15 datasets found

Logs and Mined Sequential Patterns of Programming Processes from...
figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minji Kong; Lori Pollock (2023). Logs and Mined Sequential Patterns of Programming Processes from "Semi-Automatically Mining Students' Common Scratch Programming Behaviors" [Dataset]. http://doi.org/10.6084/m9.figshare.12100797.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12100797.v1
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Authors
Minji Kong; Lori Pollock
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a ProgSnap2-based dataset containing anonymized logs of over 34,000 programming events exhibited by 81 programming students in Scratch, a visual programming environment, during our designed study as described in the paper "Semi-Automatically Mining Students' Common Scratch Programming Behaviors." We also include a list of approx. 3100 mined sequential patterns of programming processes that are performed by at least 10% of the 62 of the 81 students who are novice programmers, and represent maximal patterns generated by the MG-FSM algorithm while allowing a gap of one programming event. README.txt — overview of the dataset and its propertiesmainTable.csv — main event table of the dataset holding rows of programming eventscodeState.csv — table holding XML representations of code snapshots at the time of each programming eventdatasetMetadata.csv — describes features of the datasetScratch-SeqPatterns.txt — list of sequential patterns mined from the Main Event Table
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
IDE Action Log Dataset from a CS1 MOOC
zenodo.org
data.niaid.nih.gov
csv, txt
Updated Jul 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juho Leinonen; Juho Leinonen; Arto Hellas; Arto Hellas (2022). IDE Action Log Dataset from a CS1 MOOC [Dataset]. http://doi.org/10.5281/zenodo.6903968
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6903968
Dataset updated
Jul 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juho Leinonen; Juho Leinonen; Arto Hellas; Arto Hellas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a a dataset containing Integrated Development Environment (IDE) logs from an introductory programming MOOC. The dataset contains information on when actions in the IDE were performed in relation to deadlines over the different parts of the course. One exceptional aspect of the dataset is that part of the logs have been gathered at the keystroke level, allowing for fine-grained insight into the learning process. In addition to the IDE logs themselves, the dataset has information on whether students included in the data passed the course. This can facilitate further research that analyzes how time-related behavior relates to performance in introductory programming courses.
Results of a research software programming and development survey at the...
zenodo.org
data.niaid.nih.gov
csv, pdf
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Darby; Robert Darby (2024). Results of a research software programming and development survey at the University of Reading [Dataset]. http://doi.org/10.5281/zenodo.1166019
Explore at:
pdf, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1166019
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Darby; Robert Darby
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In 2017 an online survey of University of Reading staff active in or supporting research and registered PhD students was undertaken to assess the nature and extent of research programming and software development activities in the University, and to understand how the University might provide guidance, training and support. The survey was a administered by the Research Data Manager on behalf of the University's Research Data Management Steering Group. The survey ran from 1st November to 15th December 2017 and collected a total of 170 responses.

The survey sought responses from anyone in the University who was involved in any of the following activities:

writing code and using software for numerical and statistical analysis;

creating and contributing to computational models or simulations;

conducting Text and Data Mining (TDM) and content analysis;

creating and contributing to software distributed as a product or implemented as a service;

creating data visualisations;

using markup languages to structure and render content.

The survey was distributed using the Bristol Online Survey. A dataset of anonymised survey responses and a PDF of the survey questions are here included.
Summary of performance measures of STEED compared with manual human...
plos.figshare.com
xls
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Emanuel Zurrer; Amelia Elaine Cannon; Ewoud Ewing; David Brüschweiler; Julia Bugajska; Bernard Friedrich Hild; Marianna Rosso; Daniel Salo Reich; Benjamin Victor Ineichen (2024). Summary of performance measures of STEED compared with manual human ascertainment. [Dataset]. http://doi.org/10.1371/journal.pone.0311358.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311358.t002
Dataset updated
Nov 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Wolfgang Emanuel Zurrer; Amelia Elaine Cannon; Ewoud Ewing; David Brüschweiler; Julia Bugajska; Bernard Friedrich Hild; Marianna Rosso; Daniel Salo Reich; Benjamin Victor Ineichen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of performance measures of STEED compared with manual human ascertainment.
d
Python and R Basics for Environmental Data Sciences
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Wen (2021). Python and R Basics for Environmental Data Sciences [Dataset]. https://search.dataone.org/view/sha256%3Aa4a66e6665773400ae76151d376607edf33cfead15ffad958fe5795436ff48ff
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Tao Wen
Area covered

Description
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.

This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.

This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
Self-Admitted Technical Debt in R Packages: An Exploratory Study [DATASET]
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melina Vidoni; Melina Vidoni (2024). Self-Admitted Technical Debt in R Packages: An Exploratory Study [DATASET] [Dataset]. http://doi.org/10.5281/zenodo.4558220
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4558220
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Melina Vidoni; Melina Vidoni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the paper titled "Self-Admitted Technical Debt in R Packages: An Exploratory Study" (Vidoni, 2021), appearing at: https://2021.msrconf.org/track/msr-2021-technical-papers#Accepted-Papers-
Statistical methods provided in POMAShiny.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pol Castellano-Escuder; Raúl González-Domínguez; Francesc Carmona-Pontaque; Cristina Andrés-Lacueva; Alex Sánchez-Pla (2023). Statistical methods provided in POMAShiny. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009148.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1009148.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Pol Castellano-Escuder; Raúl González-Domínguez; Francesc Carmona-Pontaque; Cristina Andrés-Lacueva; Alex Sánchez-Pla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*Methods that allow the use of covariates.
Z
Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffi Khatchadourian (2022). QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6115403
Explore at:
Dataset updated
Feb 21, 2022
Dataset provided by
Tatiana Castro Vélez
Ye Paing
Raffi Khatchadourian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.
Bank Marketing Data Set
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishan Dutta (2020). Bank Marketing Data Set [Dataset]. https://www.kaggle.com/ishandutta/bank-marketing-data-set
Explore at:
zip(1059589 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
Ishan Dutta
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information:

Input variables:

bank client data:

1 - age (numeric) 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') 5 - default: has credit in default? (categorical: 'no','yes','unknown') 6 - housing: has housing loan? (categorical: 'no','yes','unknown') 7 - loan: has personal loan? (categorical: 'no','yes','unknown')

related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: 'cellular','telephone') 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

Relevant Papers:

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]
Z
Dataset for paper "Automatically Identifying Archival-worthy,...
data.niaid.nih.gov
zenodo.org
Updated Feb 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TBA (2020). Dataset for paper "Automatically Identifying Archival-worthy, Software-related Slack Conversations" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3468558
Explore at:
Dataset updated
Feb 7, 2020
Dataset authored and provided by
TBA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of 2000 conversations from 5 programming related Q&A channels, hosted on Slack, and accompanies the paper "Automatically Identifying Archival-worthy, Software-related Slack Conversations". In addition to the text of the conversations, each conversation has been annotated as either archival worthy or not. Our definition of archival-worthiness is:

"If a conversation contains information that could be useful to other users, whether in the Slack channel or elsewhere, then it should be archived. These conversations have no determinate length and no need for objectivity. A conversation should be archived based on the availability and ease of identifying information that could help a person to gain useful software-related knowledge."

Data Origin: Numerous public Slack chat channels (https://slack.com/) have recently become available that are focused on specific software engineering-related discussion topics, e.g., Python Development (https://pyslackers.com/web/slack). The data reflects a portion of the conversations on public channels related to Python, Clojure, Elm and Racket programming.

Data Pre-Processing: To protect privacy, we replace usernames with fake names, and replace absolute times with relative times (in seconds). The conversations are disentangled from the overall chat stream with each unique thread in the dataset specifying a conversation in the channel. Archival-worthy conversations are marked with 1, while non-archival-worthy with 0.
Movie lens 1m
kaggle.com
grouplens.org
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Younis (2024). Movie lens 1m [Dataset]. https://www.kaggle.com/datasets/oyounis/movie-lens-1m
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Omar Younis
Description
MovieLens 1M Dataset

Overview

The MovieLens 1M dataset is a collection of movie ratings provided by the GroupLens Research Project at the University of Minnesota. This dataset contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 users who joined MovieLens in 2000. It is widely used in research on recommendation systems, data mining, and machine learning.

Contents

Ratings: 1,000,209 ratings on a scale of 1-5 stars.

Movies: Information on 3,883 movies including titles, genres, and release dates.

Users: Demographic information for 6,040 users including age, gender, occupation, and zip codes.

Features

Ratings: Each rating has the following attributes:

UserID: A unique identifier for the user.

MovieID: A unique identifier for the movie.

Rating: The rating given by the user (1-5 stars).

Timestamp: The time the rating was made.

Movies: Each movie entry includes:

MovieID: A unique identifier for the movie.

Title: The title of the movie.

Genres: The genres associated with the movie (e.g., Action, Comedy).

Users: Each user entry includes:

UserID: A unique identifier for the user.

Gender: The gender of the user.

Age: The age of the user.

Occupation: The occupation of the user.

Zip-code: The zip-code of the user.

Use Cases

Recommendation Systems: Ideal for building and testing collaborative filtering and content-based recommendation algorithms.

Data Mining: Suitable for discovering patterns and insights in user behavior.

Machine Learning: Can be used for training predictive models for various tasks such as predicting user ratings, movie popularity, etc.

Format

The dataset is provided in plain text format with tab-separated values, making it easy to load and process using various data analysis tools and programming languages.

Acknowledgements

This dataset is made available by the GroupLens Research Project at the University of Minnesota. For more information about their work, visit the GroupLens website.
Data from: Dataset for Vector space model and the usage patterns of...
figshare.com
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave (2023). Dataset for Vector space model and the usage patterns of Indonesian denominal verbs [Dataset]. http://doi.org/10.6084/m9.figshare.8187155.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8187155.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Gede Primahadi Wijaya Rajeg; Karlina Denistia; Simon Musgrave
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PrefaceThis is the data repository for the paper accepted for publication in NUSA's special issue on Linguistic studies using large annotated corpora (co-edited by Hiroki Nomoto and David Moeljadi).How to cite the datasetIf you use, adapt, and/or modify any of the dataset in this repository for your research or teaching purposes (except for the malindo_dbase, see below), please cite as:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): Dataset for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Fileset. https://doi.org/10.6084/m9.figshare.8187155.Alternatively, click on the dark pink Cite button to browse different citation style (default is DataCite).The malindo_dbase data in this repository is from Nomoto et al. (2018) (cf the GitHub repository). So please also cite their work if you use it for your research:Nomoto, Hiroki, Hannah Choi, David Moeljadi and Francis Bond. 2018. MALINDO Morph: Morphological dictionary and analyser for Malay/Indonesian. Kiyoaki Shirai (ed.) Proceedings of the LREC 2018 Workshop "The 13th Workshop on Asian Language Resources", 36-43.Tutorial on how to use the data together with the R Markdown Notebook for the analyses is available on GitHub and figshare:Rajeg, Gede Primahadi Wijaya; Denistia, Karlina; Musgrave, Simon (2019): R Markdown Notebook for Vector space model and the usage patterns of Indonesian denominal verbs. figshare. Software. doi: https://doi.org/10.6084/m9.figshare.9970205Dataset description1. Leipzig_w2v_vector_full.bin is the vector space model used in the paper. We built it using wordVectors package (Schmidt & Li 2017) via the MonARCH High Performance Computing Cluster (We thank Philip Chan for his help with access to MonARCH).2. Files beginning with ngramexmpl_... are data for the n-grams (i.e. words sequence) of verbs discussed in the paper. The files are in tab-separated format.3. Files beginning with sentence_... are full sentences for the verbs discussed in the paper (in the plain text format and R dataset format [.rds]). Information of the corpus file and sentence number in which the verb is found are included.4. me_parsed_nountaggedbase (in three different file-formats) contains database of the me- words with noun-tagged root that MorphInd identified to occur in three morphological schemas we focus on (me-, me-/-kan, and me-/-i). The database has columns for the verbs' token frequency in the corpus, root forms, MorphInd parsing output, among others.5. wordcount_leipzig_allcorpus (in three different file-formats) contains information on the size of each corpus file used in the paper and from which the vector space model is built.6. wordlist_leipzig_ME_DI_TER_percorpus.tsv is a tab-separated frequency list of words prefixed with me-, di-, and ter- in all thirteen corpus files used. The wordlist is built by first tokenising each corpus file, lowercasing the tokens, and then extracting the words with the corresponding three prefixes using the following regular expressions: - For me-: ^(?i)(me)([a-z-]{3,})$- For di-: ^(?i)(di)([a-z-]{3,})$- For ter-: ^(?i)(ter)([a-z-]{3,})$7. malindo_dbase is the MALINDO Morphological Dictionary (see above).ReferencesSchmidt, Ben & Jian Li. 2017. wordVectors: Tools for creating and analyzing vector-space models of texts. R package. http://github.com/bmschmidt/wordVectors.
f
Ablation study on feature engineering.
plos.figshare.com
xls
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Rashedul Islam; Adiba Mahjabin Nitu; Md Abu Marjan; Md Palash Uddin; Masud Ibn Afjal; Md Abdulla Al Mamun (2024). Ablation study on feature engineering. [Dataset]. http://doi.org/10.1371/journal.pone.0307536.t015
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0307536.t015
Dataset updated
Sep 3, 2024
Dataset provided by
PLOS ONE
Authors
Md Rashedul Islam; Adiba Mahjabin Nitu; Md Abu Marjan; Md Palash Uddin; Masud Ibn Afjal; Md Abdulla Al Mamun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Educational Data Mining (EDM) holds promise in uncovering insights from educational data to predict and enhance students’ performance. This paper presents an advanced EDM system tailored for classifying and improving tertiary students’ programming skills. Our approach emphasizes effective feature engineering, appropriate classification techniques, and the integration of Explainable Artificial Intelligence (XAI) to elucidate model decisions. Through rigorous experimentation, including an ablation study and evaluation of six machine learning algorithms, we introduce a novel ensemble method, Stacking-SRDA, which outperforms others in accuracy, precision, recall, f1-score, ROC curve, and McNemar test. Leveraging XAI tools, we provide insights into model interpretability. Additionally, we propose a system for identifying skill gaps in programming among weaker students, offering tailored recommendations for skill enhancement.
Collected Data
figshare.com
txt
Updated Feb 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gamze Sevik (2023). Collected Data [Dataset]. http://doi.org/10.6084/m9.figshare.22179149.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22179149.v1
Dataset updated
Feb 26, 2023
Dataset provided by
figshare
Authors
Gamze Sevik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all data collected during the study, "How Are Different Asynchronous Programming Constructs in JavaScript Related to Software Quality? A Repository Mining Study on GitHub".
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Minji Kong; Lori Pollock (2023). Logs and Mined Sequential Patterns of Programming Processes from "Semi-Automatically Mining Students' Common Scratch Programming Behaviors" [Dataset]. http://doi.org/10.6084/m9.figshare.12100797.v1

Logs and Mined Sequential Patterns of Programming Processes from "Semi-Automatically Mining Students' Common Scratch Programming Behaviors"

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12100797.v1

Dataset updated

Jun 3, 2023

Dataset provided by

figshare

Authors

Minji Kong; Lori Pollock

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present a ProgSnap2-based dataset containing anonymized logs of over 34,000 programming events exhibited by 81 programming students in Scratch, a visual programming environment, during our designed study as described in the paper "Semi-Automatically Mining Students' Common Scratch Programming Behaviors." We also include a list of approx. 3100 mined sequential patterns of programming processes that are performed by at least 10% of the 62 of the 81 students who are novice programmers, and represent maximal patterns generated by the MG-FSM algorithm while allowing a gap of one programming event. README.txt — overview of the dataset and its propertiesmainTable.csv — main event table of the dataset holding rows of programming eventscodeState.csv — table holding XML representations of code snapshots at the time of each programming eventdatasetMetadata.csv — describes features of the datasetScratch-SeqPatterns.txt — list of sequential patterns mined from the Main Event Table

Clear search

Close search

Google apps

Main menu

Logs and Mined Sequential Patterns of Programming Processes from...

LScD (Leicester Scientific Dictionary)

IDE Action Log Dataset from a CS1 MOOC

Results of a research software programming and development survey at the...

Summary of performance measures of STEED compared with manual human...

Python and R Basics for Environmental Data Sciences

Self-Admitted Technical Debt in R Packages: An Exploratory Study [DATASET]

Statistical methods provided in POMAShiny.

Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...

Bank Marketing Data Set

Data Set Information:

Attribute Information:

bank client data:

related with the last contact of the current campaign:

other attributes:

social and economic context attributes

Relevant Papers:

Dataset for paper "Automatically Identifying Archival-worthy,...

Movie lens 1m

MovieLens 1M Dataset

Overview

Contents

Features

Use Cases

Format

Acknowledgements

Data from: Dataset for Vector space model and the usage patterns of...

Ablation study on feature engineering.

Collected Data

Logs and Mined Sequential Patterns of Programming Processes from "Semi-Automatically Mining Students' Common Scratch Programming Behaviors"