Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was created by Ananya Nayan
Released under Database: Open Database, Contents: © Original Authors
Facebook
TwitterThis dataset was created by Mohamed Khaled
Facebook
TwitterThis dataset was created by karthickveerakumar
Facebook
TwitterThis dataset was created by FayeJavad
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The statistical data used in the combined machine-learning functions are available here as csv files. This includes all geomorphic data at the determined radial buffer size (36 variables), the results of the bagged trees function (12 variables), and the bagged trees resulting feature importance data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data are genetic assignments to upstream or downstream of Site C dam (bull trout, Arctic grayling, and rainbow trout). Columns are defined in the csv file. Also file of R code to run analysis
Facebook
TwitterThe "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".
More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.
The "Dataset_HIR" folder contains the following subfolders namely:
Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program
1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.
1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.
Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript
+ datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps.
+ datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.
2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.
+ datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step.
+ datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.
Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.
Facebook
TwitterThis dataset was created by Study Mart
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the input datasets and the results of the analyses reported on the paper titled "Are citation networks relevant to explain academic promotions? An empirical analysis of the Italian national scientific qualification".
Abstract:
The aim of this paper is to study the role of citation network measures in the assessment of scientific maturity. Referring to the case of the Italian national scientific qualification (ASN), we investigate if there is a relationship between citation network indices and the results of the researchers’ evaluation procedures. In particular, we want to understand if network measures can enhance the prediction accuracy of the results of the evaluation procedures beyond basic performance indices. Moreover, we want to highlight which citation network indices prove to be more relevant in explaining the ASN results, and if quantitative indices used in the citation-based disciplines assessment can replace the citation network measures in non-citation-based disciplines. Data concerning Statistics and Computer Science disciplines are collected from different sources (ASN, Italian Ministry of University and Research, and Scopus) and processed in order to calculate the citation-based measures used in this study. Following, we apply classification models to estimate the effects of network variables. We find that network measures are strongly related to the results of the ASN and significantly improve the explanatory power of the models, especially for the research fields of Statistics. Additionally, citation networks in the specific sub-disciplines are far more relevant than those in the general disciplines. Finally, results show that the citation network measures are not a substitute of the citation-based bibliometric indices.
Code
The code to collect and process the data used in this paper is available on GitHub at https://github.com/DigitalDataLab/ASN16-18_CitationNetwork.
Dataset description
The files AdjacencyMatrix_01B1.csv, AdjacencyMatrix_09H1.csv, AdjacencyMatrix_13D1.csv, AdjacencyMatrix_13D2.csv and AdjacencyMatrix_13D3.csv are the citation matrices for Italian academics (i.e. ASN candidates and permanent positions in the Italian academic system) in the Recruitment Fields (RFs) 01/B1, 09/H1, 13/D1, 13/D2 and 13/D3, respectively.
The files AdjacencyMatrix_CS.csv and AdjacencyMatrix_ST.csv are the citation matrices for the Italian academics in the Computer Science disciplines (i.e. RFs 01/B1 and 09/H1) and the Statistical disciplines (i.e. RFs 13/D1, 13/D2 and 13/D3), respectively.
The files CS_01B1_1.csv, CS_09H1_1.csv, ST_13D1_1.csv, ST_13D2_1.csv and ST_13D3_1.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Full Professor (FP) level.
The files CS_01B1_2.csv, CS_09H1_2.csv, ST_13D1_2.csv, ST_13D2_2.csv and ST_13D3_2.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Associate Professor (AP) level.
The file Codebook.pdf is the codebook of the previous ten files.
The file Appendix.pdf contains the final results of the stepwise logistic regressions computed for each level (i.e. Full Professor and Associate Professor) and Recruitment Field in the Computer Science and Statistics disciplines.
The file NormalityAssessment.pdf contains the normality assessment of citation network indices.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:
"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:
subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp
"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:
rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.
"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences
Empty Values in Datasets:
There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.
Codes:
In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.Directory Structure├── Datasets/│ ├── Dataset-1.csv│ ├── Dataset-2.csv│ ├── Dataset-3.csv│ ├── Dataset-4.csv│ ├── Dataset-5.csv│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv│ └── Legit_Phish_32_Features_Extracted_Data.csv│└── Source_Codes/ ├── Feature_extraction_source_code.py ├── Feature_importance_analysis_source_code.py ├── ML/ │ ├── Seven_ML_Models_trained_on_LP.py │ ├── Seven_ML_Models_trained_on_PSU.py │ ├── SoftVoting_trained_on_LP.py │ ├── SoftVoting_trained_on_PSU.py │ ├── HardVoting_trained_on_LP.py │ └── HardVoting_trained_on_PSU.py │ ├── DL/ │ ├── [DLModel1]_trained_on_LP.py │ ├── [DLModel1]_trained_on_PSU.py │ └── ... (total 16 files for 8 DL algorithms) │ └── LLM/ ├── BERT_Fine_Tuned_on_LP.py ├── BERT_Fine_Tuned_on_PSU.py ├── DistilBERT_Fine_Tuned_on_LP.py ├── DistilBERT_Fine_Tuned_on_PSU.py ├── PhishBERT_Evaluation.py └── URLBERT_Evaluation.pyDatasets:Dataset-1.csv to Dataset-5.csv:Used for feature importance analysis.Phishing_Site_URLs_32_Features_Extracted_Data.csv (PSU dataset):Includes phishing and legitimate URLs with 32 extracted lexical features.Legit_Phish_32_Features_Extracted_Data.csv (LP dataset):Another benchmark dataset with the same 32 features, used for comparative evaluation.Note: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.Source Code:Feature_extraction_source_code.pyExtracts 32 handcrafted lexical features from raw URL data.Feature_importance_analysis_source_code.pyPerforms feature selection using seven statistical and model-based ranking methods.Machine Learning (ML)Implements ML classifiers individually trained on LP and PSU datasets:Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.Soft Voting and Hard Voting ensembles are also implemented.Scripts:Seven_ML_Models_trained_on_LP.pySeven_ML_Models_trained_on_PSU.pySoftVoting_trained_on_LP.py, SoftVoting_trained_on_PSU.pyHardVoting_trained_on_LP.py, HardVoting_trained_on_PSU.pyDeep Learning (DL)Implements eight deep learning architectures (each trained separately on LP and PSU):Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).Large Language Models (LLMs)Fine-tuned:BERT_Fine_Tuned_on_LP.py, BERT_Fine_Tuned_on_PSU.pyDistilBERT_Fine_Tuned_on_LP.py, DistilBERT_Fine_Tuned_on_PSU.pyPre-trained, zero-shot or direct evaluation:PhishBERT_Evaluation.pyURLBERT_Evaluation.py
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets presented in this repository are obtained by applying the logistic regression (LogisticRegression) algorithm with the following specific hyperparameter setting and the different inputs that ,have been described in the journal paper: "Data mining techniques for endometriosis detection in a data-scarce medical dataset".HyperparametersC: 1Solver: liblinearmax_iter: 1000000Filesresult_eb.csv: Results for EB sample type.result_ef.csv: Results for EF sample type.result_vagina.csv: Results for vagina sample type.result_oral.csv: Results for oral sample type.result_feces.csv: Results for feces sample type.result_frt.csv: Results for FRT (EB + EF + vagina) sample type.result_frt2.csv: Results for FRT2 (EB + vagina) sample type.properties_eb.log: Arguments and result information for EB sample type (C, n_split, solver, max_iter, random_state, len_scores_before_filtering, len_scores_after_filtering, len_f1).properties_ef.log: Arguments and result information for EF sample type.properties_vagina.log: Arguments and result information for vagina sample type.properties_oral.log: Arguments and result information for oral sample type.properties_feces.log: Arguments and result information for feces sample type.properties_frt.log: Arguments and result information for FRT sample type.properties_frt2.log: Arguments and result information for FRT2 sample type.
Facebook
TwitterR code for conducting analyses described in Johnson, N.S., W.D. Swink, and T.O. Brenden. Field study suggests that sex determination in sea lamprey is directly influenced by larval growth rate. Proceedings of the Royal Society B.data.csv is the raw data for fitting the Bayesian hierarchical logistic regression modelRead me_Metadata... is the metadata describing the variables in the data.csv fileRscript.R is the R script for fitting the Bayesian hierarchical logistic regression model
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Transportation and Logistics Tracking Dataset comprises multiple datasets related to various aspects of transportation and logistics operations. It includes information on on-time delivery impact, routes by rating, customer ratings, delivery times with and without congestion, weather conditions, and differences between fixed and main delivery times across different regions.
On-Time Delivery Impact: This dataset provides insights into the impact of on-time delivery, categorizing deliveries based on their impact and counting the occurrences for each category. Routes by Rating: Here, the dataset illustrates the relationship between routes and their corresponding ratings, offering a visual representation of route performance across different rating categories. Customer Ratings and On-Time Delivery: This dataset explores the relationship between customer ratings and on-time delivery, presenting a comparison of delivery counts based on customer ratings and on-time delivery status. Delivery Time with and Without Congestion: It contains information on delivery times in various cities, both with and without congestion, allowing for an analysis of how congestion affects delivery efficiency. Weather Conditions: This dataset provides a summary of weather conditions, including counts for different weather conditions such as partly cloudy, patchy light rain with thunder, and sunny. Difference between Fixed and Main Delivery Times: Lastly, the dataset highlights the differences between fixed and main delivery times across different regions, shedding light on regional variations in delivery schedules. Overall, this dataset offers valuable insights into the transportation and logistics domain, enabling analysis and decision-making to optimize delivery processes and enhance customer satisfaction.
Facebook
TwitterDataset Description:
The "Exam Scores Dataset" is a synthetic dataset generated using Python. It contains records of exam scores for two exams, labeled as "Exam Score1" and "Exam Score2". Each row in the dataset represents a student's performance in these exams. Additionally, there is a binary indicator labeled "Pass", denoting whether the student has passed both exams.
The dataset is designed for educational purposes, providing a practical tool for learners to engage in hands-on exercises and gain a deeper understanding of binary classification concepts, particularly in the context of educational assessment and student performance evaluation. Feel free to utilize the dataset for practice and experimentation to enhance your proficiency in machine learning algorithms and techniques.
Example Task: Predicting Exam 3 Pass Status
The task involves using logistic regression to predict whether a student can pass exam 3 based on their scores in exams 1 and 2. Given a student's scores of 77 in Exam Score1 and 58 in Exam Score2, the logistic regression model will be trained using the generated dataset to predict whether the student can pass exam 3.
This task serves as an introductory exercise to binary classification using logistic regression, demonstrating how machine learning models can be applied to predict binary outcomes based on input features.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Replication Package for "A Study on the Pythonic Functional Constructs' Understandability" to appear at ICSE 2024
Authors: Cyrine Zid, Fiorella Zampetti, Giuliano Antoniol, Massimiliano Di penta
Article Preprint: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf
Artifacts: https://doi.org/10.5281/zenodo.8191782
License: GPL V3.0
This package contains folders and files with code and data used in the study described in the paper. In the following, we first provide all fields required for the submission, and then report a detailed description of all repository folders.
Artifact Description
Purpose
The artifact is about a controlled experiment aimed at investigating the extent to which Pythonic functional constructs have an impact on source code understandability. The artifact archive contains:
The material to allow replicating the study (see Section Experimental-Material)
Raw quantitative results, working datasets, and scripts to replicate the statistical analyses reported in the paper. Specifically, the executable part of the replication package reproduces figures and tables of the quantitative analysis (RQ1 and RQ2) of the paper starting from the working datasets.
Spreadsheets used for the qualitative analysis (RQ3).
We apply for the following badges:
Available and reusable: because we provide all the material that can be used to replicate the experiment, but also to perform the statistical analyses and the qualitative analyses (spreadsheets, in this case)
Provenance
Paper preprint link: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf
Artifacts: https://doi.org/10.5281/zenodo.8191782
Data
Results have been obtained by conducting the controlled experiment involving Prolificworkers as participants. Data collection and processing followed a protocol approved by the University ethical board. Note that all data enclosed in the artifact is completely anonymized and does not contain sensible information.
Further details about the provided dataset can be found in the Section Results' directory and files
Setup and Usage (for executable artifacts):
See the Section Scripts to reproduce the results, and instructions for running them
Experiment-Material/
Contains the material used for the experiment, and, specifically, the following subdirectories:
Google-Forms/
Contains (as PDF documents) the questionnaires submitted to the ten experimental groups.
Task-Sources/
Contains, for each experimental group (G-1...G-10), the sources used to produce the Google Forms, and, specifically: - The cover letter (Letter.docx). - A directory for each experimental task (Lambda 1, Lambda 2, Comp 1, Comp 2, MRF 1, MRF 2, Lambda Comparison, Comp Comparison, MRF Comparison). Each directory contains: (i) the exercise text (in both Word and .txt format), the source code snippet, and its .png image to be used in the form. Note: the "Comparison" tasks do not have any exercise as the purpose is always the same, i.e., to compare the (perceived) understandability of the snippets and return the results of the comparison.
Code-Examples-Table1/
Contains the source code snippets used as objects of the study (the same you can find under "Task-Sources/"), named as reported in Table 1.
Results' directory and files
raw-responses/
Contains, as spreadsheets, the raw responses provided by the study participants through Google forms.
raw-results-RQ1/
Contains the raw results for RQ1. Specifically, the directory contains a subdirectory for each group (G1-G10). Each subdirectory contains: - For each user (named using their Prolific IDs, a directory containing, for each question (Q1-Q6) the produced python code (Qn.py) its output (QnR.txt) and its StdErr output (QnErr.txt). - "expected-outputs/": A directory containing the expected outputs for each task (Qn.txt).
working-results/RQ1-RQ2-files-for-statistical-analysis/
Contains three .csv files used as input for conducting the statistical analysis and drawing the graphs for addressing the first two research questions of the study. Specifically:
ConstructUsage.csv contains the declared frequency usage of the three functional constructs object of the study. This file is used to draw Figure 4. The file contains an entry for each participant, reporting the (text-coded) frequency of construct usage for Comprehension, Lambda, and MRF.
RQ1.csv contains the collected data used for the mixed-effect logistic regression relating the use of functional constructs with the correctness of the change task, as well as the logistic regression relating the use of map/reduce/filter functions with the correctness of the change task. The csv file contains an entry for each answer provided by each subject, and features the following columns:
Group: experimental group to which the participant is assigned
User: user ID
Time: task time in seconds
Approvals: number of approvals on previous tasks performed on Prolific
Student: whether the participant declared themselves as a student
Section: section of the questionnaire (lambda, comp, or mrf)
Construct: specific construct being presented (same as "Section" for lambda and comp, for mrf it says whether it is a map, reduce, or filter)
Question: question id, from Q1 to Q6, indicate the ordering of the question
MainFactor: main factor treatment for the given question - "f" for functional, "p" for procedural counterpart
Outcome: TRUE if the task was correctly performed, FALSE otherwise
Complexity: cyclomatic complexity of the construct (empty for mrf)
UsageFrequency: usage frequency of the given construct
RQ1Paired-RQ2.csv contains the collected data used for the ordinal logistic regression of the relationship between the perceived ease of understanding of the functional constructs and (i) participants' usage frequency, and (ii) constructs' complexity (except for map/reduce/filter). The file features a row for each participant, and the columns are the following:
Group: experimental group to which the participant is assigned
User: user ID
Time: task time in seconds
Approvals: number of approvals on previous tasks performed on Prolific
Student: whether the participant declared themselves as a student
LambdaF: result for the change task related to a lambda construct
LambdaP: result for the change task related to the procedural counterpart of a lambda construct
CompF: result for the change task related to a comprehension construct
CompP: result for the change task related to the procedural counterpart of a comprehension construct
MrfF: result for the change task related to an MRF construct
MrfP: result for the change task related to the procedural counterpart of a MRF construct
LambdaComp: perceived understandability level for the comparison task (RQ2) between a lambda and its procedural counterpart
CompComp: perceived understandability level for the comparison task (RQ2) between a comprehension and its procedural counterpart
MrfComp: perceived understandability level for the comparison task (RQ2) between a MRF and its procedural counterpart
LambdaCompCplx: cyclomatic complexity of the lambda construct involved in the comparison task (RQ2)
CompCompCplx: cyclomatic complexity of the comprehension construct involved in the comparison task (RQ2)
MrfCompType: type of MRF construct (map, reduce, or filter) used in the comparison task (RQ2)
LambdaUsageFrequency: self-declared usage frequency on lambda constructs
CompUsageFrequency: self-declared usage frequency on comprehension constructs
MrfUsageFrequency: self-declared usage frequency on MRF constructs
LambdaComparisonAssessment: outcome of the manual assessment of the answer to the "check question" required for the lambda comparison ("yes" means valid, "no" means wrong, "moderatechatgpt" and "extremechatgpt" are the results of GPTZero)
CompComparisonAssessment: as above, but for comprehension
MrfComparisonAssessment: as above, but for MRF
working-results/inter-rater-RQ3-files/
This directory contains four .csv files used as input for computing the inter-rater agreement for the manual labeling used for addressing RQ3. Specifically, you will find one file for each functional construct, i.e., comprehension.csv, lambda.csv, and mrf.csv, and a different file used for highlighting the reasons why participants prefer to use the procedural paradigm, i.e., procedural.csv.
working-results/RQ2ManualValidation.csv
This file contains the results of the manual validation being done to sanitize the answers provided by our participants used for addressing RQ2. Specifically, we coded the behaviour description using four different levels: (i) correct ("yes"), (ii) somewhat correct ("partial"), (iii) wrong ("no"), and (iv) automatically generated. The file features a row for each participant, and the columns are the following:
ID: ID we used to refer the participant in the paper's qualitative analysis
Group: experimental group to which the participant is assigned
ProlificID: user ID
Comparison for lambda construct description: answer provided by the user for the lambda comparison task
Final Classification: our assessment of the lambda comparison answer
Comparison for comprehension description: answer provided by the user for the comprehension comparison task
Final Classification: our assessment of the comprehension comparison answer
Comparison for MRF description: answer provided by the user for the MRF comparison task
Final Classification: our assessment of the MRF comparison answer
working-results/RQ3ManualValidation.xlsx
This file contains the results of the open coding applied to address our third research question. Specifically, you will find four sheets, one for each functional construct and one for the procedural paradigm. Each sheet reports the provided answers together with the categories assigned to them. Each sheet contains the following columns:
ID: ID we used to refer the participant in the paper's qualitative
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was created by Ananya Nayan
Released under Database: Open Database, Contents: © Original Authors