100+ datasets found

Logistic Regression
kaggle.com
zip
Updated Dec 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
Explore at:
zip(3349 bytes)Available download formats
Dataset updated
Dec 24, 2017
Authors
Ananya Nayan
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Ananya Nayan

Released under Database: Open Database, Contents: © Original Authors

Contents
Titaanic Dataset for logistic regression
kaggle.com
zip
Updated Mar 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Khaled (2023). Titaanic Dataset for logistic regression [Dataset]. https://www.kaggle.com/datasets/moro146/titaanic-dataset-for-logistic-regression
Explore at:
zip(4965 bytes)Available download formats
Dataset updated
Mar 21, 2023
Authors
Mohamed Khaled
Description
Dataset

This dataset was created by Mohamed Khaled

Contents
Startup - Multiple Linear Regression
kaggle.com
zip
Updated Jan 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
karthickveerakumar (2018). Startup - Multiple Linear Regression [Dataset]. https://www.kaggle.com/datasets/karthickveerakumar/startup-logistic-regression
Explore at:
zip(1330 bytes)Available download formats
Dataset updated
Jan 29, 2018
Authors
karthickveerakumar
Description
Dataset

This dataset was created by karthickveerakumar

Contents
Marketing Linear Multiple Regression
kaggle.com
zip
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FayeJavad (2020). Marketing Linear Multiple Regression [Dataset]. https://www.kaggle.com/datasets/fayejavad/marketing-linear-multiple-regression
Explore at:
zip(1907 bytes)Available download formats
Dataset updated
Apr 24, 2020
Authors
FayeJavad
Description
Dataset

This dataset was created by FayeJavad

Contents
m
Datasets used to train and test prediction model to predict scores in terms...
data.mendeley.com
Updated Mar 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
Explore at:
Unique identifier
https://doi.org/10.17632/6c8fm7s4y2.1
Dataset updated
Mar 5, 2025
Authors
Jarosław Wątróbski
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
2. All Data: Using landslide-inventory mapping for a combined bagged-trees...
geolsoc.figshare.com
txt
Updated Apr 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew M. Crawford; Jason M. Dortch; Hudson J. Koch; Ashton A. Killen; Junfeng Zhu; Yichaun Zhu; Lindsey S. Bryson; William C. Haneberg (2021). 2. All Data: Using landslide-inventory mapping for a combined bagged-trees and logistic-regression approach to determining landslide susceptibility in eastern Kentucky, United States [Dataset]. http://doi.org/10.6084/m9.figshare.14473487.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14473487.v1
Dataset updated
Apr 23, 2021
Dataset provided by
Geological Society of Londonhttp://www.geolsoc.org.uk/
Authors
Matthew M. Crawford; Jason M. Dortch; Hudson J. Koch; Ashton A. Killen; Junfeng Zhu; Yichaun Zhu; Lindsey S. Bryson; William C. Haneberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Kentucky
Description
The statistical data used in the combined machine-learning functions are available here as csv files. This includes all geomorphic data at the determined radial buffer size (36 variables), the results of the bagged trees function (12 variables), and the bagged trees resulting feature importance data.
B
Replication Data for: Site C Logistic Regression model
borealisdata.ca
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Taylor (2025). Replication Data for: Site C Logistic Regression model [Dataset]. http://doi.org/10.5683/SP3/MA1ATA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/MA1ATA
Dataset updated
Nov 18, 2025
Dataset provided by
Borealis
Authors
Eric Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Site C dam, Ft St John, Peace River, Canada, BC
Description
The data are genetic assignments to upstream or downstream of Site C dam (bull trout, Arctic grayling, and rainbow trout). Columns are defined in the csv file. Also file of R code to run analysis
n
Data for: Identification of hindered internal rotational mode for complex...
narcis.nl
data.mendeley.com
Updated Nov 8, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le, T (via Mendeley Data) (2017). Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model [Dataset]. http://doi.org/10.17632/d37mzs3b3m.2
Explore at:
Unique identifier
https://doi.org/10.17632/d37mzs3b3m.2
Dataset updated
Nov 8, 2017
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Le, T (via Mendeley Data)
Description
The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".

More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.

The "Dataset_HIR" folder contains the following subfolders namely:

Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program

1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.

1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.

Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript

+ datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps. + datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.

2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.

+ datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step. + datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.

Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.
marital status using logistic regression
kaggle.com
Updated Aug 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Study Mart (2020). marital status using logistic regression [Dataset]. https://www.kaggle.com/datasets/studymart/marital-status-using-logistic-regression
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2020
Dataset provided by
Kaggle
Authors
Study Mart
Description
Dataset

This dataset was created by Study Mart

Contents
Z
One Classifier Ignores a Feature
data.niaid.nih.gov
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
Explore at:
Dataset updated
Apr 29, 2022
Authors
Maier, Karl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test
Z
Datasets and results of the paper titled "Are citation networks relevant to...
nde-dev.biothings.io
data.niaid.nih.gov
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elvira Pelle (2024). Datasets and results of the paper titled "Are citation networks relevant to explain academic promotions? An empirical analysis of the Italian national scientific qualification" [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_5118644
Explore at:
Dataset updated
Jul 17, 2024
Dataset provided by
Andrea Sciandra
Elvira Pelle
Francesco Poggi
Maria Cristiana Martini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are the input datasets and the results of the analyses reported on the paper titled "Are citation networks relevant to explain academic promotions? An empirical analysis of the Italian national scientific qualification".

Abstract:

The aim of this paper is to study the role of citation network measures in the assessment of scientific maturity. Referring to the case of the Italian national scientific qualification (ASN), we investigate if there is a relationship between citation network indices and the results of the researchers’ evaluation procedures. In particular, we want to understand if network measures can enhance the prediction accuracy of the results of the evaluation procedures beyond basic performance indices. Moreover, we want to highlight which citation network indices prove to be more relevant in explaining the ASN results, and if quantitative indices used in the citation-based disciplines assessment can replace the citation network measures in non-citation-based disciplines. Data concerning Statistics and Computer Science disciplines are collected from different sources (ASN, Italian Ministry of University and Research, and Scopus) and processed in order to calculate the citation-based measures used in this study. Following, we apply classification models to estimate the effects of network variables. We find that network measures are strongly related to the results of the ASN and significantly improve the explanatory power of the models, especially for the research fields of Statistics. Additionally, citation networks in the specific sub-disciplines are far more relevant than those in the general disciplines. Finally, results show that the citation network measures are not a substitute of the citation-based bibliometric indices.

Code

The code to collect and process the data used in this paper is available on GitHub at https://github.com/DigitalDataLab/ASN16-18_CitationNetwork.

Dataset description

The files AdjacencyMatrix_01B1.csv, AdjacencyMatrix_09H1.csv, AdjacencyMatrix_13D1.csv, AdjacencyMatrix_13D2.csv and AdjacencyMatrix_13D3.csv are the citation matrices for Italian academics (i.e. ASN candidates and permanent positions in the Italian academic system) in the Recruitment Fields (RFs) 01/B1, 09/H1, 13/D1, 13/D2 and 13/D3, respectively.

The files AdjacencyMatrix_CS.csv and AdjacencyMatrix_ST.csv are the citation matrices for the Italian academics in the Computer Science disciplines (i.e. RFs 01/B1 and 09/H1) and the Statistical disciplines (i.e. RFs 13/D1, 13/D2 and 13/D3), respectively.

The files CS_01B1_1.csv, CS_09H1_1.csv, ST_13D1_1.csv, ST_13D2_1.csv and ST_13D3_1.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Full Professor (FP) level.

The files CS_01B1_2.csv, CS_09H1_2.csv, ST_13D1_2.csv, ST_13D2_2.csv and ST_13D3_2.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Associate Professor (AP) level.

The file Codebook.pdf is the codebook of the previous ten files.

The file Appendix.pdf contains the final results of the stepwise logistic regressions computed for each level (i.e. Full Professor and Associate Professor) and Recruitment Field in the Computer Science and Statistics disciplines.

The file NormalityAssessment.pdf contains the normality assessment of citation network indices.
[A Procedure for Multilevel Logistic Modeling] Appendix, Datasets, and...
figshare.com
pdf
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolas Sommet (2024). [A Procedure for Multilevel Logistic Modeling] Appendix, Datasets, and Syntax Files [Dataset]. http://doi.org/10.6084/m9.figshare.5350786.v6
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5350786.v6
Dataset updated
May 29, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Nicolas Sommet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For each software, a series of sub-appendices describes the way to handle each stage of the procedure;- For each software, a .zip file contains the dataset in .dta (for Stata), .rdata (for R), .dat (for Mplus), and .sav (for SPSS), as well as the syntax file(s) in .do (for Stata), .R (for R), .inp (for Mplus), and .sps (for SPSS)- The dataset is also provided in .csv format.If you notice a mistake in the Stata or SPSS-related Appendices and/or syntax files, please report it to Nicolas Sommet (nicolas.sommet@unil.ch). If you notice a mistake in the R or Mplus-related Appendices and/or syntax files, please report it to Davide Morselli (davide.morselli@unil.ch).Sommet, N. and Morselli, D. (2017). Keep Calm and Learn Multilevel Logistic Modeling: A Simplified Three-Step Procedure Using Stata, R, Mplus, and SPSS. International Review of Social Psychology, 30, 203–218, DOI: https://doi.org/10.5334/irsp.90
n
Data from: Macaques preferentially attend to intermediately surprising...
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6078/D15Q7Q
Dataset updated
Apr 26, 2022
Dataset provided by
Klaviyo
Yale University
University of Minnesota
University of California, Berkeley
Authors
Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

Empty Values in Datasets:

There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

Codes:

In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Replication Package of "Battling Phish"
figshare.com
csv
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Author (2025). Replication Package of "Battling Phish" [Dataset]. http://doi.org/10.6084/m9.figshare.30324559.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30324559.v1
Dataset updated
Oct 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Anonymous Author
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.Directory Structure├── Datasets/│ ├── Dataset-1.csv│ ├── Dataset-2.csv│ ├── Dataset-3.csv│ ├── Dataset-4.csv│ ├── Dataset-5.csv│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv│ └── Legit_Phish_32_Features_Extracted_Data.csv│└── Source_Codes/ ├── Feature_extraction_source_code.py ├── Feature_importance_analysis_source_code.py ├── ML/ │ ├── Seven_ML_Models_trained_on_LP.py │ ├── Seven_ML_Models_trained_on_PSU.py │ ├── SoftVoting_trained_on_LP.py │ ├── SoftVoting_trained_on_PSU.py │ ├── HardVoting_trained_on_LP.py │ └── HardVoting_trained_on_PSU.py │ ├── DL/ │ ├── [DLModel1]_trained_on_LP.py │ ├── [DLModel1]_trained_on_PSU.py │ └── ... (total 16 files for 8 DL algorithms) │ └── LLM/ ├── BERT_Fine_Tuned_on_LP.py ├── BERT_Fine_Tuned_on_PSU.py ├── DistilBERT_Fine_Tuned_on_LP.py ├── DistilBERT_Fine_Tuned_on_PSU.py ├── PhishBERT_Evaluation.py └── URLBERT_Evaluation.pyDatasets:Dataset-1.csv to Dataset-5.csv:Used for feature importance analysis.Phishing_Site_URLs_32_Features_Extracted_Data.csv (PSU dataset):Includes phishing and legitimate URLs with 32 extracted lexical features.Legit_Phish_32_Features_Extracted_Data.csv (LP dataset):Another benchmark dataset with the same 32 features, used for comparative evaluation.Note: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.Source Code:Feature_extraction_source_code.pyExtracts 32 handcrafted lexical features from raw URL data.Feature_importance_analysis_source_code.pyPerforms feature selection using seven statistical and model-based ranking methods.Machine Learning (ML)Implements ML classifiers individually trained on LP and PSU datasets:Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.Soft Voting and Hard Voting ensembles are also implemented.Scripts:Seven_ML_Models_trained_on_LP.pySeven_ML_Models_trained_on_PSU.pySoftVoting_trained_on_LP.py, SoftVoting_trained_on_PSU.pyHardVoting_trained_on_LP.py, HardVoting_trained_on_PSU.pyDeep Learning (DL)Implements eight deep learning architectures (each trained separately on LP and PSU):Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).Large Language Models (LLMs)Fine-tuned:BERT_Fine_Tuned_on_LP.py, BERT_Fine_Tuned_on_PSU.pyDistilBERT_Fine_Tuned_on_LP.py, DistilBERT_Fine_Tuned_on_PSU.pyPre-trained, zero-shot or direct evaluation:PhishBERT_Evaluation.pyURLBERT_Evaluation.py
LogisticRegression scores
figshare.com
txt
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pablo Caballero (2024). LogisticRegression scores [Dataset]. http://doi.org/10.6084/m9.figshare.23904903.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23904903.v1
Dataset updated
Mar 4, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Pablo Caballero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented in this repository are obtained by applying the logistic regression (LogisticRegression) algorithm with the following specific hyperparameter setting and the different inputs that ,have been described in the journal paper: "Data mining techniques for endometriosis detection in a data-scarce medical dataset".HyperparametersC: 1Solver: liblinearmax_iter: 1000000Filesresult_eb.csv: Results for EB sample type.result_ef.csv: Results for EF sample type.result_vagina.csv: Results for vagina sample type.result_oral.csv: Results for oral sample type.result_feces.csv: Results for feces sample type.result_frt.csv: Results for FRT (EB + EF + vagina) sample type.result_frt2.csv: Results for FRT2 (EB + vagina) sample type.properties_eb.log: Arguments and result information for EB sample type (C, n_split, solver, max_iter, random_state, len_scores_before_filtering, len_scores_after_filtering, len_f1).properties_ef.log: Arguments and result information for EF sample type.properties_vagina.log: Arguments and result information for vagina sample type.properties_oral.log: Arguments and result information for oral sample type.properties_feces.log: Arguments and result information for feces sample type.properties_frt.log: Arguments and result information for FRT sample type.properties_frt2.log: Arguments and result information for FRT2 sample type.
f
R script and dataset for Bayesian hierarchical logistic modeling of percent...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Mar 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brenden, Travis (2017). R script and dataset for Bayesian hierarchical logistic modeling of percent male in sea lamprey populations in lentic and river environments [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001810488
Explore at:
Dataset updated
Mar 1, 2017
Authors
Brenden, Travis
Description
R code for conducting analyses described in Johnson, N.S., W.D. Swink, and T.O. Brenden. Field study suggests that sex determination in sea lamprey is directly influenced by larval growth rate. Proceedings of the Royal Society B.data.csv is the raw data for fitting the Bayesian hierarchical logistic regression modelRead me_Metadata... is the metadata describing the variables in the data.csv fileRscript.R is the R script for fitting the Bayesian hierarchical logistic regression model
Transportation and Logistics Tracking Dataset
kaggle.com
zip
Updated May 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicole Machado (2024). Transportation and Logistics Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/nicolemachado/transportation-and-logistics-tracking-dataset
Explore at:
zip(3705944 bytes)Available download formats
Dataset updated
May 5, 2024
Authors
Nicole Machado
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Transportation and Logistics Tracking Dataset comprises multiple datasets related to various aspects of transportation and logistics operations. It includes information on on-time delivery impact, routes by rating, customer ratings, delivery times with and without congestion, weather conditions, and differences between fixed and main delivery times across different regions.

On-Time Delivery Impact: This dataset provides insights into the impact of on-time delivery, categorizing deliveries based on their impact and counting the occurrences for each category. Routes by Rating: Here, the dataset illustrates the relationship between routes and their corresponding ratings, offering a visual representation of route performance across different rating categories. Customer Ratings and On-Time Delivery: This dataset explores the relationship between customer ratings and on-time delivery, presenting a comparison of delivery counts based on customer ratings and on-time delivery status. Delivery Time with and Without Congestion: It contains information on delivery times in various cities, both with and without congestion, allowing for an analysis of how congestion affects delivery efficiency. Weather Conditions: This dataset provides a summary of weather conditions, including counts for different weather conditions such as partly cloudy, patchy light rain with thunder, and sunny. Difference between Fixed and Main Delivery Times: Lastly, the dataset highlights the differences between fixed and main delivery times across different regions, shedding light on regional variations in delivery schedules. Overall, this dataset offers valuable insights into the transportation and logistics domain, enabling analysis and decision-making to optimize delivery processes and enhance customer satisfaction.
Pass or Not? Students Exam Score Data
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanyuan CHEN (2024). Pass or Not? Students Exam Score Data [Dataset]. https://www.kaggle.com/datasets/cchen002/pass-or-not-students-exam-score-data
Explore at:
zip(2675 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Yuanyuan CHEN
Description
Dataset Description:

The "Exam Scores Dataset" is a synthetic dataset generated using Python. It contains records of exam scores for two exams, labeled as "Exam Score1" and "Exam Score2". Each row in the dataset represents a student's performance in these exams. Additionally, there is a binary indicator labeled "Pass", denoting whether the student has passed both exams.

Exam Score1: Represents the score obtained by the student in the first exam. Scores range from 0 to 100, inclusive, with decimal values.

Exam Score2: Represents the score obtained by the student in the second exam. Scores range from 0 to 100, inclusive, with decimal values.

Pass: A binary indicator denoting the pass status of the student. It takes a value of 1 if the student has passed both exams, and 0 otherwise.

The dataset is designed for educational purposes, providing a practical tool for learners to engage in hands-on exercises and gain a deeper understanding of binary classification concepts, particularly in the context of educational assessment and student performance evaluation. Feel free to utilize the dataset for practice and experimentation to enhance your proficiency in machine learning algorithms and techniques.

Example Task: Predicting Exam 3 Pass Status

The task involves using logistic regression to predict whether a student can pass exam 3 based on their scores in exams 1 and 2. Given a student's scores of 77 in Exam Score1 and 58 in Exam Score2, the logistic regression model will be trained using the generated dataset to predict whether the student can pass exam 3.

This task serves as an introductory exercise to binary classification using logistic regression, demonstrating how machine learning models can be applied to predict binary outcomes based on input features.
Z
Data from: Replication package for the paper: "A Study on the Pythonic...
data.niaid.nih.gov
nde-dev.biothings.io
+1more
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zid, Cyrine; Zampetti, Fiorella; Antoniol, Giuliano; Di Penta, Massimiliano (2024). Replication package for the paper: "A Study on the Pythonic Functional Constructs' Understandability" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8191782
Explore at:
Dataset updated
Jan 23, 2024
Dataset provided by
Polytechnique Montréal
University of Sannio
Authors
Zid, Cyrine; Zampetti, Fiorella; Antoniol, Giuliano; Di Penta, Massimiliano
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Replication Package for "A Study on the Pythonic Functional Constructs' Understandability" to appear at ICSE 2024

Authors: Cyrine Zid, Fiorella Zampetti, Giuliano Antoniol, Massimiliano Di penta

Article Preprint: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf

Artifacts: https://doi.org/10.5281/zenodo.8191782

License: GPL V3.0

This package contains folders and files with code and data used in the study described in the paper. In the following, we first provide all fields required for the submission, and then report a detailed description of all repository folders.

Artifact Description

Purpose

The artifact is about a controlled experiment aimed at investigating the extent to which Pythonic functional constructs have an impact on source code understandability. The artifact archive contains:

The material to allow replicating the study (see Section Experimental-Material)

Raw quantitative results, working datasets, and scripts to replicate the statistical analyses reported in the paper. Specifically, the executable part of the replication package reproduces figures and tables of the quantitative analysis (RQ1 and RQ2) of the paper starting from the working datasets.

Spreadsheets used for the qualitative analysis (RQ3).

We apply for the following badges:

Available and reusable: because we provide all the material that can be used to replicate the experiment, but also to perform the statistical analyses and the qualitative analyses (spreadsheets, in this case)

Provenance

Paper preprint link: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf

Artifacts: https://doi.org/10.5281/zenodo.8191782

Data

Results have been obtained by conducting the controlled experiment involving Prolificworkers as participants. Data collection and processing followed a protocol approved by the University ethical board. Note that all data enclosed in the artifact is completely anonymized and does not contain sensible information.

Further details about the provided dataset can be found in the Section Results' directory and files

Setup and Usage (for executable artifacts):

See the Section Scripts to reproduce the results, and instructions for running them

Experiment-Material/

Contains the material used for the experiment, and, specifically, the following subdirectories:

Google-Forms/

Contains (as PDF documents) the questionnaires submitted to the ten experimental groups.

Task-Sources/

Contains, for each experimental group (G-1...G-10), the sources used to produce the Google Forms, and, specifically: - The cover letter (Letter.docx). - A directory for each experimental task (Lambda 1, Lambda 2, Comp 1, Comp 2, MRF 1, MRF 2, Lambda Comparison, Comp Comparison, MRF Comparison). Each directory contains: (i) the exercise text (in both Word and .txt format), the source code snippet, and its .png image to be used in the form. Note: the "Comparison" tasks do not have any exercise as the purpose is always the same, i.e., to compare the (perceived) understandability of the snippets and return the results of the comparison.

Code-Examples-Table1/

Contains the source code snippets used as objects of the study (the same you can find under "Task-Sources/"), named as reported in Table 1.

Results' directory and files

raw-responses/

Contains, as spreadsheets, the raw responses provided by the study participants through Google forms.

raw-results-RQ1/

Contains the raw results for RQ1. Specifically, the directory contains a subdirectory for each group (G1-G10). Each subdirectory contains: - For each user (named using their Prolific IDs, a directory containing, for each question (Q1-Q6) the produced python code (Qn.py) its output (QnR.txt) and its StdErr output (QnErr.txt). - "expected-outputs/": A directory containing the expected outputs for each task (Qn.txt).

working-results/RQ1-RQ2-files-for-statistical-analysis/

Contains three .csv files used as input for conducting the statistical analysis and drawing the graphs for addressing the first two research questions of the study. Specifically:

ConstructUsage.csv contains the declared frequency usage of the three functional constructs object of the study. This file is used to draw Figure 4. The file contains an entry for each participant, reporting the (text-coded) frequency of construct usage for Comprehension, Lambda, and MRF.

RQ1.csv contains the collected data used for the mixed-effect logistic regression relating the use of functional constructs with the correctness of the change task, as well as the logistic regression relating the use of map/reduce/filter functions with the correctness of the change task. The csv file contains an entry for each answer provided by each subject, and features the following columns:

Group: experimental group to which the participant is assigned

User: user ID

Time: task time in seconds

Approvals: number of approvals on previous tasks performed on Prolific

Student: whether the participant declared themselves as a student

Section: section of the questionnaire (lambda, comp, or mrf)

Construct: specific construct being presented (same as "Section" for lambda and comp, for mrf it says whether it is a map, reduce, or filter)

Question: question id, from Q1 to Q6, indicate the ordering of the question

MainFactor: main factor treatment for the given question - "f" for functional, "p" for procedural counterpart

Outcome: TRUE if the task was correctly performed, FALSE otherwise

Complexity: cyclomatic complexity of the construct (empty for mrf)

UsageFrequency: usage frequency of the given construct

RQ1Paired-RQ2.csv contains the collected data used for the ordinal logistic regression of the relationship between the perceived ease of understanding of the functional constructs and (i) participants' usage frequency, and (ii) constructs' complexity (except for map/reduce/filter). The file features a row for each participant, and the columns are the following:

Group: experimental group to which the participant is assigned

User: user ID

Time: task time in seconds

Approvals: number of approvals on previous tasks performed on Prolific

Student: whether the participant declared themselves as a student

LambdaF: result for the change task related to a lambda construct

LambdaP: result for the change task related to the procedural counterpart of a lambda construct

CompF: result for the change task related to a comprehension construct

CompP: result for the change task related to the procedural counterpart of a comprehension construct

MrfF: result for the change task related to an MRF construct

MrfP: result for the change task related to the procedural counterpart of a MRF construct

LambdaComp: perceived understandability level for the comparison task (RQ2) between a lambda and its procedural counterpart

CompComp: perceived understandability level for the comparison task (RQ2) between a comprehension and its procedural counterpart

MrfComp: perceived understandability level for the comparison task (RQ2) between a MRF and its procedural counterpart

LambdaCompCplx: cyclomatic complexity of the lambda construct involved in the comparison task (RQ2)

CompCompCplx: cyclomatic complexity of the comprehension construct involved in the comparison task (RQ2)

MrfCompType: type of MRF construct (map, reduce, or filter) used in the comparison task (RQ2)

LambdaUsageFrequency: self-declared usage frequency on lambda constructs

CompUsageFrequency: self-declared usage frequency on comprehension constructs

MrfUsageFrequency: self-declared usage frequency on MRF constructs

LambdaComparisonAssessment: outcome of the manual assessment of the answer to the "check question" required for the lambda comparison ("yes" means valid, "no" means wrong, "moderatechatgpt" and "extremechatgpt" are the results of GPTZero)

CompComparisonAssessment: as above, but for comprehension

MrfComparisonAssessment: as above, but for MRF

working-results/inter-rater-RQ3-files/

This directory contains four .csv files used as input for computing the inter-rater agreement for the manual labeling used for addressing RQ3. Specifically, you will find one file for each functional construct, i.e., comprehension.csv, lambda.csv, and mrf.csv, and a different file used for highlighting the reasons why participants prefer to use the procedural paradigm, i.e., procedural.csv.

working-results/RQ2ManualValidation.csv

This file contains the results of the manual validation being done to sanitize the answers provided by our participants used for addressing RQ2. Specifically, we coded the behaviour description using four different levels: (i) correct ("yes"), (ii) somewhat correct ("partial"), (iii) wrong ("no"), and (iv) automatically generated. The file features a row for each participant, and the columns are the following:

ID: ID we used to refer the participant in the paper's qualitative analysis

Group: experimental group to which the participant is assigned

ProlificID: user ID

Comparison for lambda construct description: answer provided by the user for the lambda comparison task

Final Classification: our assessment of the lambda comparison answer

Comparison for comprehension description: answer provided by the user for the comprehension comparison task

Final Classification: our assessment of the comprehension comparison answer

Comparison for MRF description: answer provided by the user for the MRF comparison task

Final Classification: our assessment of the MRF comparison answer

working-results/RQ3ManualValidation.xlsx

This file contains the results of the open coding applied to address our third research question. Specifically, you will find four sheets, one for each functional construct and one for the procedural paradigm. Each sheet reports the provided answers together with the categories assigned to them. Each sheet contains the following columns:

ID: ID we used to refer the participant in the paper's qualitative

Facebook

Twitter

Click to copy link

Link copied

Cite

Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression

Logistic Regression

Explore at:

zip(3349 bytes)Available download formats

Dataset updated

Dec 24, 2017

Authors

Ananya Nayan

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Dataset

This dataset was created by Ananya Nayan

Clear search

Close search

Google apps

Main menu

Logistic Regression

Dataset

Contents

Titaanic Dataset for logistic regression

Dataset

Contents

Startup - Multiple Linear Regression

Dataset

Contents

Marketing Linear Multiple Regression

Dataset

Contents

Datasets used to train and test prediction model to predict scores in terms...

Prediction of Personality Traits using the Big 5 Framework

2. All Data: Using landslide-inventory mapping for a combined bagged-trees...

Replication Data for: Site C Logistic Regression model

Data for: Identification of hindered internal rotational mode for complex...

marital status using logistic regression

Dataset

Contents

One Classifier Ignores a Feature

Datasets and results of the paper titled "Are citation networks relevant to...

[A Procedure for Multilevel Logistic Modeling] Appendix, Datasets, and...

Data from: Macaques preferentially attend to intermediately surprising...

Replication Package of "Battling Phish"

LogisticRegression scores

R script and dataset for Bayesian hierarchical logistic modeling of percent...

Transportation and Logistics Tracking Dataset

Pass or Not? Students Exam Score Data

Data from: Replication package for the paper: "A Study on the Pythonic...

Logistic Regression

Dataset

Contents