100+ datasets found
  1. 👨‍🎓 Open University Learning Analytics

    • kaggle.com
    zip
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2024). 👨‍🎓 Open University Learning Analytics [Dataset]. https://www.kaggle.com/datasets/mexwell/open-university-learning-analytics
    Explore at:
    zip(44198573 bytes)Available download formats
    Dataset updated
    Mar 5, 2024
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.

    Database schema

    https://analyse.kmi.open.ac.uk/resources/images/model.png" alt="">

    courses.csv File contains the list of all available modules and their presentations. The columns are: - code_module – code name of the module, which serves as the identifier. - code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. - length - length of the module-presentation in days.

    The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.

    assessments.csv This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:

    • code_module – identification code of the module, to which the assessment belongs.
    • code_presentation - identification code of the presentation, to which the assessment belongs.
    • id_assessment – identification number of the assessment.
    • assessment_type – type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).
    • date – information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero).
    • weight - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%. If the information about the final exam date is missing, it is at the end of the last presentation week.

    vle.csv The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns:

    • id_site – an identification number of the material.
    • code_module – an identification code for module.
    • code_presentation - the identification code of presentation.
    • activity_type – the role associated with the module material.
    • week_from – the week from which the material is planned to be used.
    • week_to – week until which the material is planned to be used.

    studentInfo.csv This file contains demographic information about the students together with their results. File contains the following columns:

    • code_module – an identification code for a module on which the student is registered.
    • code_presentation - the identification code of the presentation during which the student is registered on the module.
    • id_student – a unique identification number for the student.
    • gender – the student’s gender.
    • region – identifies the geographic region, where the student lived while taking the module-presentation.
    • highest_education – highest student education level on entry to the module presentation.
    • imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.
    • age_band – band of the student’s age.
    • num_of_prev_attempts – the number times the student has attempted this module.
    • studied_credits – the total number of credits for the modules the student is currently studying.
    • disability – indicates whether the student has declared a disability.
    • final_result – student’s final result in the module-presentation.

    studentRegistration.csv This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:

    • code_module – an identification code for a module.
    • code_presentation - the identification code of the presentation.
    • id_student – a unique identification number for the student.
    • date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).
    • date_unr...
  2. m

    Data from: Dataset of Computer Science Course Queries from Students:...

    • data.mendeley.com
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khandoker Ashik Uz Zaman (2024). Dataset of Computer Science Course Queries from Students: Categorized and Scored According to Bloom's Taxonomy [Dataset]. http://doi.org/10.17632/w5zt9n6vsc.1
    Explore at:
    Dataset updated
    Jan 5, 2024
    Authors
    Khandoker Ashik Uz Zaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 3 .csv files - 1. Data_Structure.csv 2. Introduction_to_Computers_and_Research.csv 3. Irrelevant_Questions.csv.

    Each of the files consists of questions asked by students of Independent University, Bangladesh on the Summer 2023 Semester in Computer Science Courses.

    The questions have been manually pre-processed and categorized according to their course and topics. The questions have also been scored using Bloom's taxonomy's six levels of questions [remember (5 points), understand (10 points), apply (15 points), analyze (20 points), evaluate (20 points), create (30 points).].

    File-1 consists of the scored and categorized questions from the "Data Structure" course. File-2 consists of the scored and categorized questions from the "Introduction to Computers and Research" course. File-3 consists of the irrelevant questions which do not belong to the courses above but were asked by the students from those courses.

  3. U.S. Education Datasets: Unification Project

    • kaggle.com
    zip
    Updated Apr 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roy Garrard (2020). U.S. Education Datasets: Unification Project [Dataset]. https://www.kaggle.com/noriuk/us-education-datasets-unification-project
    Explore at:
    zip(155201337 bytes)Available download formats
    Dataset updated
    Apr 13, 2020
    Authors
    Roy Garrard
    Area covered
    United States
    Description

    Author's Note 2019/04/20: Revisiting this project, I recently discovered the incredibly comprehensive API produced by the Urban Institute. It achieves all of the goals laid out for this dataset in wonderful detail. I recommend that users interested pay a visit to their site.

    Context

    This dataset is designed to bring together multiple facets of U.S. education data into one convenient CSV (states_all.csv).

    Contents

    • states_all.csv: The primary data file. Contains aggregates from all state-level sources in one CSV.

    • output_files/states_all_extended.csv: The contents of states_all.csv with additional data related to race and gender.

    Column Breakdown

    Identification

    • PRIMARY_KEY: A combination of the year and state name.
    • YEAR
    • STATE

    Enrollment

    A breakdown of students enrolled in schools by school year.

    • GRADES_PK: Number of students in Pre-Kindergarten education.

    • GRADES_4: Number of students in fourth grade.

    • GRADES_8: Number of students in eighth grade.

    • GRADES_12: Number of students in twelfth grade.

    • GRADES_1_8: Number of students in the first through eighth grades.

    • GRADES 9_12: Number of students in the ninth through twelfth grades.

    • GRADES_ALL: The count of all students in the state. Comparable to ENROLL in the financial data (which is the U.S. Census Bureau's estimate for students in the state).

    The extended version of states_all contains additional columns that breakdown enrollment by race and gender. For example:

    • G06_A_A: Total number of sixth grade students.

    • G06_AS_M: Number of sixth grade male students whose ethnicity was classified as "Asian".

    • G08_AS_A_READING: Average reading score of eighth grade students whose ethnicity was classified as "Asian".

    The represented races include AM (American Indian or Alaska Native), AS (Asian), HI (Hispanic/Latino), BL (Black or African American), WH (White), HP (Hawaiian Native/Pacific Islander), and TR (Two or More Races). The represented genders include M (Male) and F (Female).

    Financials

    A breakdown of states by revenue and expenditure.

    • ENROLL: The U.S. Census Bureau's count for students in the state. Should be comparable to GRADES_ALL (which is the NCES's estimate for students in the state).

    • TOTAL REVENUE: The total amount of revenue for the state.

      • FEDERAL_REVENUE
      • STATE_REVENUE
      • LOCAL_REVENUE
    • TOTAL_EXPENDITURE: The total expenditure for the state.

      • INSTRUCTION_EXPENDITURE
      • SUPPORT_SERVICES_EXPENDITURE

      • CAPITAL_OUTLAY_EXPENDITURE

      • OTHER_EXPENDITURE

    Academic Achievement

    A breakdown of student performance as assessed by the corresponding exams (math and reading, grades 4 and 8).

    • AVG_MATH_4_SCORE: The state's average score for fourth graders taking the NAEP math exam.

    • AVG_MATH_8_SCORE: The state's average score for eight graders taking the NAEP math exam.

    • AVG_READING_4_SCORE: The state's average score for fourth graders taking the NAEP reading exam.

    • AVG_READING_8_SCORE: The state's average score for eighth graders taking the NAEP reading exam.

    Data Processing

    The original sources can be found here:

    # Enrollment
    https://nces.ed.gov/ccd/stnfis.asp
    # Financials
    https://www.census.gov/programs-surveys/school-finances/data/tables.html
    # Academic Achievement
    https://www.nationsreportcard.gov/ndecore/xplore/NDE
    

    Data was aggregated using a Python program I wrote. The code (as well as additional project information) can be found [here][1].

    Methodology Notes

    • Spreadsheets for NCES enrollment data for 2014, 2011, 2010, and 2009 were modified to place key data on the same sheet, making scripting easier.

    • The column 'ENROLL' represents the U.S. Census Bureau data value (financial data), while the column 'GRADES_ALL' represents the NCES data value (demographic data). Though the two organizations correspond on this matter, these values (which are ostensibly the same) do vary. Their documentation chalks this up to differences in membership (i.e. what is and is not a fourth grade student).

    • Enrollment data from NCES has seen a number of changes across survey years. One of the more notable is that data on student gender does not appear to have been collected until 2009. The information in states_all_extended.csv reflects this.

    • NAEP test score data is only available for certain years

    • The current version of this data is concerned with state-level patterns. It is the author's hope that future versions will allow for school district-level granularity.

    Acknowledgements

    Data is sourced from the U.S. Census Bureau and the National Center for Education Statistics (NCES).

    Licensing Notes

    The licensing of these datasets state that it must not be us...

  4. Z

    Dataset for Paper "Towards Increased Diversity in STEM Education: Five...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Dataset for Paper "Towards Increased Diversity in STEM Education: Five archetypes Derived through a Data-Driven Approach Examining a Computer Science Student Cohort [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4737551
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Anonymous
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for Paper "Towards Increased Diversity in STEM Education: Five archetypes Derived through a Data-Driven Approach Examining a Computer Science Student Cohort" - Rev #1

    This is the dataset for the paper titled "Towards Increased Diversity in STEM Education: Five archetypes Derived through a Data-Driven Approach Examining a Computer Science Student Cohort".

    In case of questions, feel free to contact the authors, anonymised, ORCID: https://orcid.org/*anonymised*, current affiliation and email: anonymised

    Survey 2019

    The raw survey data for the initial 2019 survey is available in the file survey2019_anon.csv. Note that the data is anonymised as free-text comments have been removed. Explanations on the variables and their levels are given in the files variables_survey2019.csv and values_survey2019.csv. The questionnaire for the 2019 survey is contained in survey2019_instrument.pdf.

    Survey 2020

    The raw survey data for the 2020 survey is available in the file rdata_anon_survey2020.csv. Additional scripts are supplied to reproduce the exploratory factor analysis. The main entry is the file EFA.R, which imports the data. The file contains some comments on the process. The questionnaire for the 2020 survey is contained in survey2020_instrument.pdf.

    Interviews

    The interview guide used for the five interviews is available in the file interview_instrument.pdf.

  5. XAI-FUNGI: Dataset from the user study on comprehensibility of XAI...

    • zenodo.org
    csv, pdf, zip
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Szymon Bobek; Szymon Bobek; Paloma Korycińska; Paloma Korycińska; Monika Krakowska; Monika Krakowska; Maciej Mozolewski; Maciej Mozolewski; Dorota Rak; Dorota Rak; Magdalena Zych; Magdalena Zych; Magdalena Wójcik; Magdalena Wójcik; Grzegorz J. Nalepa; Grzegorz J. Nalepa (2024). XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms [Dataset]. http://doi.org/10.5281/zenodo.11448395
    Explore at:
    csv, zip, pdfAvailable download formats
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Szymon Bobek; Szymon Bobek; Paloma Korycińska; Paloma Korycińska; Monika Krakowska; Monika Krakowska; Maciej Mozolewski; Maciej Mozolewski; Dorota Rak; Dorota Rak; Magdalena Zych; Magdalena Zych; Magdalena Wójcik; Magdalena Wójcik; Grzegorz J. Nalepa; Grzegorz J. Nalepa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms

    We present the dataset which was created during a user study on evaluation of explainability of artificial intelligence (AI) at the Jagielloninan University as a collaborative work of computer science (GEIST team) and information sciences research groups. The main goal of the research was to explore effective explanations of AI model patterns to diverse audiences.

    The dataset contains material collected from 39 participants during the interviews conducted by the Information Sciences research group. The participants were recruited from 149 candidates to form three groups that represented domain experts in the field of mycology (DE), students with data science and visualization background (IT) and students from social sciences and humanities (SSH). Each group was given an explanation of a machine learning model trained to predict edible and non-edible mushrooms and asked to interpret the explanations and answer various questions during the interview. The machine learning model and explanations for its decision were prepared by the computer science research team.

    The resulting dataset was constructed from the surveys obtained from the candidates, anonymized transcripts of the interviews, the results from thematic analysis, and original explanations with modifications suggested by the participants. The dataset is complemented with the source code allowing one to reproduce the initial machine leaning model and explanations.

    The general structure of the dataset is described in the following table. The files that contain in their names [RR]_[SS]_[NN] contain the individual results obtained from particular participant. The meaning of the prefix is as follows:

    • RR - initials of the researcher conducting the interview,
    • SS - type of the participant (DE for domain expert, SSH for social sciences and humanities students, or IT for computer science students),
    • NN - number of the participant

    FileDescription
    SURVEY.csvThe results from a survey that was filled by 149 participants out of which 39 were selected to form a final group of particiapnts.
    CODEBOOK.csvThe codebook used in thematic analysis and MAXQDA coding
    QUESTIONS.csvList of questions that the participants were asked during interviews.
    SLIDES.csvList of slides used in the study with their interpretation and reference to MAXQDA themes and VISUAL_MODIFICATIONS tables.
    MAXQDA_SUMMARY.csvSummary of thematic analysis performed with codes used in CODEBOOK for each participant
    PROBLEMS.csvList of problems that participants were asked to solve during interviews. They correspond to three instances from the dataset that the participants had to classify using knowledge gained from explanations.
    PROBLEMS_RESPONSES.csvThe responses to the problems for each participant to the problems listed in PROBLEMS.csv
    VISUALIZATION_MODIFICATIONS.csvInformation on how the order of the slides was modified by the participant, which slides (explanations) were removed, and what kind of additional explanation was suggested.
    ORIGINAL_VISUZALIZATIONS.pdfThe PDF file containing the visualization of explanations presented to the participants during the interviews
    VISUALIZATION_MODIFICATIONS.zipThe PDF file containing the original slides from ORIGINAL_VISUZALIZATIONS.pdf with the modifications suggested by the participant. Each file is a PDF file named with the participant ID, i.e. [RR]_[SS]_[NN].pdf
    TRANSCRIPTS.zipThe anonymized transcripts of interviews for each given participant, zipped into one archive. Each transcript is named after the particiapnt ID, i.e. [RR]_[SS]_[NN].csv and contains text tagged with slide number that it related to, question number from QUESTIONS.csv, and problem number from PROBLEMS.csv.

    The detailed structure of the files presented in the previous Table is given in the Technical info section.

    The source code used to train ML model and to generate explanations is available on Gitlab

  6. US Dept of Education: College Scorecard

    • kaggle.com
    zip
    Updated Nov 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2017). US Dept of Education: College Scorecard [Dataset]. https://www.kaggle.com/forums/f/810/us-dept-of-education-college-scorecard
    Explore at:
    zip(589617678 bytes)Available download formats
    Dataset updated
    Nov 9, 2017
    Dataset authored and provided by
    Kaggle
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    It's no secret that US university students often graduate with debt repayment obligations that far outstrip their employment and income prospects. While it's understood that students from elite colleges tend to earn more than graduates from less prestigious universities, the finer relationships between future income and university attendance are quite murky. In an effort to make educational investments less speculative, the US Department of Education has matched information from the student financial aid system with federal tax returns to create the College Scorecard dataset.

    Kaggle is hosting the College Scorecard dataset in order to facilitate shared learning and collaboration. Insights from this dataset can help make the returns on higher education more transparent and, in turn, more fair.

    Data Description

    Here's a script showing an exploratory overview of some of the data.

    college-scorecard-release-*.zip contains a compressed version of the same data available through Kaggle Scripts.

    It consists of three components:

    • All the raw data files released in version 1.40 of the college scorecard data
    • Scorecard.csv, a single CSV file with all the years data combined. In it, we've converted categorical variables represented by integer keys in the original data to their labels and added a Year column
    • database.sqlite, a SQLite database containing a single Scorecard table that contains the same information as Scorecard.csv

    New to data exploration in R? Take the free, interactive DataCamp course, "Data Exploration With Kaggle Scripts," to learn the basics of visualizing data with ggplot. You'll also create your first Kaggle Scripts along the way.

  7. Drop Project Student Plugin for IntelliJ IDEA - Evaluation Survey

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno Pereira Cipriano; Bernardo Baltazar; Bernardo Baltazar; Pedro Alves; Nuno Fachada; Nuno Fachada; Bruno Pereira Cipriano; Pedro Alves (2024). Drop Project Student Plugin for IntelliJ IDEA - Evaluation Survey [Dataset]. http://doi.org/10.5281/zenodo.8432997
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bruno Pereira Cipriano; Bernardo Baltazar; Bernardo Baltazar; Pedro Alves; Nuno Fachada; Nuno Fachada; Bruno Pereira Cipriano; Pedro Alves
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains a CSV file with students replies to the survey used to evaluate the Drop Project Student plugin for IntelliJ IDEA.

    To support international readers, the question names (CSV headers) were translated to English and/or match the numbering that appear in the paper. However, the student's textual replies to the open ended questions were left in their original language, Portuguese.

  8. AP Computer Science A Exam Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Computing Education at Georgia Tech (2016). AP Computer Science A Exam Dataset [Dataset]. https://www.kaggle.com/iceatgt/ap-computer-science-a-exam-dataset
    Explore at:
    zip(10410 bytes)Available download formats
    Dataset updated
    Nov 13, 2016
    Dataset authored and provided by
    Institute for Computing Education at Georgia Tech
    Description

    Context

    The datasets contain all the data for the number of CS AP A exam taken in each state from 1998 to 2013, and detailed data on pass rates, race, and gender from 2006-2013. The data was complied from the data available at http://research.collegeboard.org/programs/ap/data. This data was originally gathered by the CSTA board, but Barb Ericson of Georgia Tech keeps adding to it each year.

    Content

    historical.csv contains data for the number of CS AP A exam taken in each state from 1998 to 2013:

    • state: US states

    • 1998-2013

    • Pop: population

    pass_06_13.csv contains exam pass rates, race and gender data from 2006 to 2013 for selected states.

    pass_12_13.csv contains exam pass rates, race and gender information for every state for 2012 and 2013.

    Acknowledgements

    The original datasets can be found here and here.

    Inspiration

    Using the datasets, can you examine the temporal trends in the exam pass rates by race, gender, and geographical location?

  9. Data from: Automatic composition of descriptive music: A case study of the...

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucía Martín-Gómez (2023). Automatic composition of descriptive music: A case study of the relationship between image and sound [Dataset]. http://doi.org/10.6084/m9.figshare.6682998.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lucía Martín-Gómez
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    FANTASIAThis repository contains the data related to image descriptors and sound associated with a selection of frames of the films Fantasia and Fantasia 2000 produced by DisneyAboutThis repository contains the data used in the article Automatic composition of descriptive music: A case study of the relationship between image and sound published in the 6th International Workshop on Computational Creativity, Concept Invention, and General Intelligence (C3GI). Data structure is explained in detail in the article. AbstractHuman beings establish relationships with the environment mainly through sight and hearing. This work focuses on the concept of descriptive music, which makes use of sound resources to narrate a story. The Fantasia film, produced by Walt Disney was used in the case study. One of its musical pieces is analyzed in order to obtain the relationship between image and music. This connection is subsequently used to create a descriptive musical composition from a new video. Naive Bayes, Support Vector Machine and Random Forest are the three classifiers studied for the model induction process. After an analysis of their performance, it was concluded that Random Forest provided the best solution; the produced musical composition had a considerably high descriptive quality. DataNutcracker_data.arff: Image descriptors and the most important sound of each frame from the fragment "The Nutcracker Suite" in film Fantasia. Data stored into ARFF format.Firebird_data.arff: Image descriptors of each frame from the fragment "The Firebird" in film Fantasia 2000. Data stored into ARFF format.Firebird_midi_prediction.csv: Frame number of the fragment "The Firebird" in film Fantasia 2000 and the sound predicted by the system encoded in MIDI. Data stored into CSV format.Firebird_prediction.mp3: Audio file with the synthesizing of the prediction data for the fragment "The Firebird" of film Fantasia 2000.LicenseData is available under MIT License. To make use of the data the article must be cited.

  10. Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. http://doi.org/10.5281/zenodo.8098909
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

    This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

    The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

    The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

    Dataset References

  11. Data from: 2024 dataset on independent researchers collected from OpenAlex

    • zenodo.org
    • repository.uantwerpen.be
    • +1more
    csv, tsv
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eline Vandewalle; Eline Vandewalle; Camilla Hertil Lindelöw; Camilla Hertil Lindelöw (2024). 2024 dataset on independent researchers collected from OpenAlex [Dataset]. http://doi.org/10.5281/zenodo.10925112
    Explore at:
    csv, tsvAvailable download formats
    Dataset updated
    Apr 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eline Vandewalle; Eline Vandewalle; Camilla Hertil Lindelöw; Camilla Hertil Lindelöw
    License

    https://creativecommons.org/public-domainhttps://creativecommons.org/public-domain

    Description

    This dataset belongs to a paper about independent researchers submitted for the STI conference 2024 (https://sti2024.org/). It consists of several files described below. The data is from OpenAlex, collected through the InSySPo instance of the february snapshot of OpenAlex, hosted on Google Cloud. Since Topics are a new feature of OpenAlex data and therefore not part of the snapshot, this data as well as some other data not available at the InSySPo instance at the time of collection have been collected through the OpenAlex API, and incorporated in the files. Data from Scopus and Web of Science may be retrieved by using the search string in the appendix of the article.

    Files all domains

    240307_open_alex_works.tsv

    contains all works retrieved with the search string for Independent researchers in OpenAlex in the article's appendix.

    Files Social Sciences and/or Arts & Humanities

    240312_open_alex_works_soc_sci_arts_2010.tsv

    contains articles by Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex.

    240312_open_alex_authors_soc_sci_arts_2010.tsv

    contains authors who are Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex.

    240313_open_alex_authors_all_works_soc_sci_arts_2010.tsv

    contains all works by Independent researchers in Social Sciences and Humanities published from 2010 and retrieved from OpenAlex. All works mean that the researcher has at least once indicated independent status in the affiliation, and the author's other works are also included.

    author_distribution_domain1.csv

    contains number of works per number of authors in the domain Social Sciences (includes Arts & Humanities).

    author_distribution_field33.csv

    contains number of works per number of authors in the field Social Sciences.

    author_distribution_field12.csv

    contains number of works per number of authors in the field Arts & Humanities.

    all_ssh_oa.csv

    contains data for analyzing open access patterns for the domain Social Sciences (includes Arts & Humanities).

  12. p

    1. data all field studies CSV.csv

    • psycharchives.org
    Updated Aug 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). 1. data all field studies CSV.csv [Dataset]. https://psycharchives.org/en/item/5bb80531-2812-4a0a-9b75-b396c8543d34
    Explore at:
    Dataset updated
    Aug 5, 2022
    License

    https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988

    Description

    Citizen Science (CS) projects play a crucial role in engaging citizens in conservation efforts. While implicitly mostly considered as an outcome of CS participation, citizens may also have a certain attitude toward engagement in CS when starting to participate in a CS project. Moreover, there is a lack of CS studies that consider changes over longer periods of time. Therefore, this research presents two-wave data from four field studies of a CS project about urban wildlife ecology using cross-lagged panel analyses. We investigated the influence of attitudes toward engagement in CS on self-related, ecology-related, and motivation-related outcomes. We found that positive attitudes toward engagement in CS at the beginning of the CS project had positive influences on participants’ psychological ownership and pride in their participation, their attitudes toward and enthusiasm about wildlife, and their internal and external motivation two months later. We discuss the implications for CS research and practice. Dataset for: Greving, H., Bruckermann, T., Schumann, A., Stillfried, M., Börner, K., Hagen, R., Kimmig, S. E., Brandt, M., & Kimmerle, J. (2023). Attitudes Toward Engagement in Citizen Science Increase Self-Related, Ecology-Related, and Motivation-Related Outcomes in an Urban Wildlife Project. BioScience, 73(3), 206–219. https://doi.org/10.1093/biosci/biad003: Data (CSV format) collected for all field studies

  13. Dataset for algorithmic thinking skills assessment: Results from the virtual...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giorgia Adorni; Giorgia Adorni (2025). Dataset for algorithmic thinking skills assessment: Results from the virtual CAT large-scale study in Swiss compulsory education [Dataset]. http://doi.org/10.5281/zenodo.10912340
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giorgia Adorni; Giorgia Adorni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 3, 2024
    Area covered
    Switzerland
    Description

    Overview
    This dataset was collected during a main study that evaluated the virtual Cross Array Task (CAT) platform as an assessment tool for algorithmic thinking (AT) skills among K-12 students in Swiss compulsory education.
    As algorithmic thinking becomes increasingly vital in our digital age, this study bridges the gap between traditional assessments and the needs of today's learners by introducing a digital platform. The virtual CAT, a digital adaptation of an unplugged assessment activity, offers scalable, automated assessments with reduced human intervention.

    Study Context, Location and Participants
    To comprehensively investigate algorithmic competencies within compulsory education, exploring their variations and determining the factors influencing them, in Spring 2023 we conducted an experimental study with the virtual CAT's.
    The sample comprises 129 students (65 girls and 64 boys), selected from nine classes across five public schools in Ticino and Solothurn cantons.

    Data Collection
    During the data collection process, session and participant details were manually recorded by the administrator.
    Each session has been assigned a unique identifier, and specific details, such as the date, canton, school name and type, and the students’ HarmoS grade (HG) level, have been recorded.
    Student information are limited to sex and date of birth, with birth dates used to calculate ages, a significant factor in our demographic analysis.
    To protect student privacy, unique identifiers have been assigned to each participant, keeping the data anonymous and secure.
    The assessment tool automatically tracked all user interaction within the platform.
    All data collected have been pseudonymised, aligning with prevailing open science practices in Switzerland (SNSF, 2021).
    Data collection was integrated into a validation module of the app.

    Data Features
    The dataset comprises the following files:

    • STUDENTS_SESSIONS.csv
    • RESULTS.csv
    • LOGS.csv
    • CANTONS.csv
    • ALGORITHMS.csv

    These files collectively provide insights into the algorithmic actions of the students, demographic details, session logs, results, and more.

    Usage & Ethics
    In the spirit of open science, this dataset is made available to the public after meticulous anonymisation to ensure all participants' privacy and ethical treatment.
    Initial authorisations were secured from school administrators, teachers, and parents.
    Detailed communication regarding the study's nature, data handling, and objectives was transparently shared with all stakeholders.

    REFERENCES

    [1] A. Piatti, G. Adorni, L. El-Hamamsy, L. Negrini, D. Assaf, L. Gambardella & F. Mondada. (2022). The CT-cube: A framework for the design and the assessment of computational thinking activities. Computers in Human Behavior Reports, 5, 100166. https://doi.org/10.1016/j.chbr.2021.100166

    [2] Adorni, G., & Piatti, S., & Karpenko, V. (2023). virtual CAT: An app for algorithmic thinking assessment within Swiss compulsory education. Zenodo Software. https://doi.org/10.5281/zenodo.10027851 On GitHub: https://github.com/GiorgiaAuroraAdorni/virtual-CAT-app/

    [3] Adorni, G., & Karpenko, V. (2023). virtual CAT programming language interpreter. Zenodo Software. https://doi.org/10.5281/zenodo.10016535 On GitHub: https://github.com/GiorgiaAuroraAdorni/virtual-CAT-programming-language-interpreter/

    [4] Adorni, G., & Karpenko, V. (2023). virtual CAT data infrastructure. Zenodo Software. https://doi.org/10.5281/zenodo.10015011 On GitHub: https://github.com/GiorgiaAuroraAdorni/virtual-CAT-data-infrastructure

  14. Z

    Dataset on the Human Body as a Signal Propagation Medium

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Ormanis; V. Medvedevs; V. Aristovs; V. Abolins; A. Sevcenko; A. Elsts (2024). Dataset on the Human Body as a Signal Propagation Medium [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214496
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Institute of Electronics and Computer Science
    Authors
    J. Ormanis; V. Medvedevs; V. Aristovs; V. Abolins; A. Sevcenko; A. Elsts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.

    Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.

    Overview statistics:

    Number of subjects: 30

    Number of transmitter locations: 6

    Number of receiver locations: 6

    Number of measurement frequencies: 19

    Input voltage: 1 V

    Load resistance: 50 ohm and 1 megaohm

    Measurement group statistics:

    Height: 174.10 (7.15)

    Weight: 72.85 (16.26)

    BMI: 23.94 (4.70)

    Body fat %: 21.53 (7.55)

    Age group: 29.00 (11.25)

    Male/female ratio: 50%

    Included files:

    experiment_protocol_description.docx - protocol used in the experiments

    electrode_placement_schematic.png - schematic of placement locations

    electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject

    RawData - the full measurement results and experiment info sheets

    all_measurements.csv - the most important results extracted to .csv

    all_measurements_filtered.csv - same, but after z-score filtering

    all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row

    all_measurements_by_freq_filtered.csv - same, but after z-score filtering

    summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets

    process_json_files.py - script that creates .csv from the raw data

    filter_results.py - outlier removal based on z-score

    plot_sample_curves.py - visualization of a randomly selected measurement result subset

    plot_measurement_group.py - visualization of the measurement group

    CSV file columns:

    subject_id - participant's random unique ID

    experiment_id - measurement session's number for the participant

    height - participant's height, cm

    weight - participant's weight, kg

    BMI - body mass index, computed from the valued above

    body_fat_% - body fat composition, as measured by bioimpedance scales

    age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.

    male - 1 if male, 0 if female

    tx_point - transmitter point number

    rx_point - receiver point number

    distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!

    tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.

    rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.

    total_fat_level - sum of rx and tx fat levels

    bias - constant term to simplify data analytics, always equal to 1.0

    CSV file columns, frequency-specific:

    tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py script from the voltage drop

    rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance

    rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance

    Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.

    References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.

    Contact information: info@edi.lv

  15. d

    Replication Data for: kluster: An Efficient Scalable Procedure for...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Estiri, Hossein (2023). Replication Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.7910/DVN/LLIOHM
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Estiri, Hossein
    Description

    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z

  16. Logs and Mined Sequential Patterns of Programming Processes from...

    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minji Kong; Lori Pollock (2023). Logs and Mined Sequential Patterns of Programming Processes from "Semi-Automatically Mining Students' Common Scratch Programming Behaviors" [Dataset]. http://doi.org/10.6084/m9.figshare.12100797.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Minji Kong; Lori Pollock
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a ProgSnap2-based dataset containing anonymized logs of over 34,000 programming events exhibited by 81 programming students in Scratch, a visual programming environment, during our designed study as described in the paper "Semi-Automatically Mining Students' Common Scratch Programming Behaviors." We also include a list of approx. 3100 mined sequential patterns of programming processes that are performed by at least 10% of the 62 of the 81 students who are novice programmers, and represent maximal patterns generated by the MG-FSM algorithm while allowing a gap of one programming event. README.txt — overview of the dataset and its propertiesmainTable.csv — main event table of the dataset holding rows of programming eventscodeState.csv — table holding XML representations of code snapshots at the time of each programming eventdatasetMetadata.csv — describes features of the datasetScratch-SeqPatterns.txt — list of sequential patterns mined from the Main Event Table

  17. m

    Data from: Student grade prediction dataset

    • data.mendeley.com
    Updated Jun 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nonso Nnamoko (2022). Student grade prediction dataset [Dataset]. http://doi.org/10.17632/wf8568hxb7.1
    Explore at:
    Dataset updated
    Jun 16, 2022
    Authors
    Nonso Nnamoko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset provides a collection of 160 instances belonging to two classes (pass' = 136 andfail' = 24). The data is an anonymised, statistically sound and reliable representation of the original data collected from students studying computer science modules at a UK University. Each instance is made up of 19 features plus the class label. Eight of the features represent students' online behaviour including bio information retrieved from Virtual Learning Environment. Eleven of the features represent students' neighbourhood influence retrieved from Office for Students database. The data has been compiled and made available in de-facto/de-jure standard open formats (CSV and JSON).

    This data was collected and used in a research study undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. To encourage reproducibility of the experiments and results reported, the data is provided in the exact training-validation-testing splits used in the experiments.

  18. p

    4. codebook all field studies CSV.csv

    • psycharchives.org
    Updated Aug 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). 4. codebook all field studies CSV.csv [Dataset]. https://psycharchives.org/en/item/5bb80531-2812-4a0a-9b75-b396c8543d34
    Explore at:
    Dataset updated
    Aug 5, 2022
    License

    https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988

    Description

    Citizen Science (CS) projects play a crucial role in engaging citizens in conservation efforts. While implicitly mostly considered as an outcome of CS participation, citizens may also have a certain attitude toward engagement in CS when starting to participate in a CS project. Moreover, there is a lack of CS studies that consider changes over longer periods of time. Therefore, this research presents two-wave data from four field studies of a CS project about urban wildlife ecology using cross-lagged panel analyses. We investigated the influence of attitudes toward engagement in CS on self-related, ecology-related, and motivation-related outcomes. We found that positive attitudes toward engagement in CS at the beginning of the CS project had positive influences on participants’ psychological ownership and pride in their participation, their attitudes toward and enthusiasm about wildlife, and their internal and external motivation two months later. We discuss the implications for CS research and practice. Dataset for: Greving, H., Bruckermann, T., Schumann, A., Stillfried, M., Börner, K., Hagen, R., Kimmig, S. E., Brandt, M., & Kimmerle, J. (2023). Attitudes Toward Engagement in Citizen Science Increase Self-Related, Ecology-Related, and Motivation-Related Outcomes in an Urban Wildlife Project. BioScience, 73(3), 206–219. https://doi.org/10.1093/biosci/biad003: Codebook (CSV format) of the variables of all field studies

  19. t

    Trusted Research Environments: Analysis of Characteristics and Data...

    • researchdata.tuwien.ac.at
    bin, csv
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

    Methodology

    We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

    • Peer-reviewed articles where available,
    • TRE websites,
    • TRE metadata catalogs.

    The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

    Technical details

    This dataset consists of five comma-separated values (.csv) files describing our inventory:

    • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
    • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
    • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
    • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
    • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

    Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

    • schema.sql: Schema definition file to create the tables and views used in the analysis.

    The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

  20. [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...

    • zenodo.org
    • recerca.uoc.edu
    • +3more
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Krukowski; Simon Krukowski; Ishari Amarasinghe; Ishari Amarasinghe; Nicolás Felipe Gutiérrez-Páez; Nicolás Felipe Gutiérrez-Páez; H. Ulrich Hoppe; H. Ulrich Hoppe (2022). [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects [Dataset]. http://doi.org/10.5281/zenodo.7357747
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Simon Krukowski; Simon Krukowski; Ishari Amarasinghe; Ishari Amarasinghe; Nicolás Felipe Gutiérrez-Páez; Nicolás Felipe Gutiérrez-Páez; H. Ulrich Hoppe; H. Ulrich Hoppe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explanation/Overview:

    Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised.

    Purpose:

    The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R.

    Relatedness:

    The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'.

    Content:

    In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions).

    • corresponding_calculations.html
      • Quarto-notebook to view in browser
    • corresponding_calculations.qmd
      • Quarto-notebook to view in RStudio
    • assets
      • data
        • annotations
          • annotations.csv
            • List of annotations made per day for each of the analysed projects
        • comments
          • comments.csv
            • Total list of comments with several data fields (i.e., comment id, text, reply_user_id)
        • rolechanges
          • 478_rolechanges.csv
            • List of roles per user to determine number of role changes
          • 1104_rolechanges.csv
            • ...
          • ...
        • totalnetworkdata
          • Edges
            • 478_edges.csv
              • Network data (edge set) for the given projects (without time slices)
            • 1104_edges.csv
              • ...
            • ...
          • Nodes
            • 478_nodes.csv
              • Network data (node set) for the given projects (without time slices)
            • 1104_nodes.csv
              • ...
            • ...
        • trajectories
          • Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021)
          • 478
            • Edges
              • edges_4782016_q1.csv

              • edges_4782016_q2.csv

              • edges_4782016_q3.csv

              • edges_4782016_q4.csv

              • ...

            • Nodes
              • nodes_4782016_q1.csv
              • nodes_4782016_q4.csv

              • nodes_4782016_q3.csv

              • nodes_4782016_q2.csv

              • ...

          • 1104

            • Edges

              • ...

            • Nodes

              • ...

          • ...

      • scripts
        • datavizfuncs.R
          • script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd
        • import.R
          • script for the import of data, automatically executed from within corresponding_calculations.qmd
    • corresponding_calculations_files
      • files for the html/qmd view in the browser/RStudio

    Grouping:

    The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
mexwell (2024). 👨‍🎓 Open University Learning Analytics [Dataset]. https://www.kaggle.com/datasets/mexwell/open-university-learning-analytics
Organization logo

👨‍🎓 Open University Learning Analytics

Anonymised Open University Learning Analytics Dataset (OULAD)

Explore at:
zip(44198573 bytes)Available download formats
Dataset updated
Mar 5, 2024
Authors
mexwell
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.

Database schema

https://analyse.kmi.open.ac.uk/resources/images/model.png" alt="">

courses.csv File contains the list of all available modules and their presentations. The columns are: - code_module – code name of the module, which serves as the identifier. - code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. - length - length of the module-presentation in days.

The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.

assessments.csv This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:

  • code_module – identification code of the module, to which the assessment belongs.
  • code_presentation - identification code of the presentation, to which the assessment belongs.
  • id_assessment – identification number of the assessment.
  • assessment_type – type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).
  • date – information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero).
  • weight - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%. If the information about the final exam date is missing, it is at the end of the last presentation week.

vle.csv The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns:

  • id_site – an identification number of the material.
  • code_module – an identification code for module.
  • code_presentation - the identification code of presentation.
  • activity_type – the role associated with the module material.
  • week_from – the week from which the material is planned to be used.
  • week_to – week until which the material is planned to be used.

studentInfo.csv This file contains demographic information about the students together with their results. File contains the following columns:

  • code_module – an identification code for a module on which the student is registered.
  • code_presentation - the identification code of the presentation during which the student is registered on the module.
  • id_student – a unique identification number for the student.
  • gender – the student’s gender.
  • region – identifies the geographic region, where the student lived while taking the module-presentation.
  • highest_education – highest student education level on entry to the module presentation.
  • imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.
  • age_band – band of the student’s age.
  • num_of_prev_attempts – the number times the student has attempted this module.
  • studied_credits – the total number of credits for the modules the student is currently studying.
  • disability – indicates whether the student has declared a disability.
  • final_result – student’s final result in the module-presentation.

studentRegistration.csv This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:

  • code_module – an identification code for a module.
  • code_presentation - the identification code of the presentation.
  • id_student – a unique identification number for the student.
  • date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).
  • date_unr...
Search
Clear search
Close search
Google apps
Main menu