100+ datasets found
  1. Logistic Regression

    • kaggle.com
    zip
    Updated Dec 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
    Explore at:
    zip(3349 bytes)Available download formats
    Dataset updated
    Dec 24, 2017
    Authors
    Ananya Nayan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ananya Nayan

    Released under Database: Open Database, Contents: © Original Authors

    Contents

  2. Titaanic Dataset for logistic regression

    • kaggle.com
    zip
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Khaled (2023). Titaanic Dataset for logistic regression [Dataset]. https://www.kaggle.com/datasets/moro146/titaanic-dataset-for-logistic-regression
    Explore at:
    zip(4965 bytes)Available download formats
    Dataset updated
    Mar 21, 2023
    Authors
    Mohamed Khaled
    Description

    Dataset

    This dataset was created by Mohamed Khaled

    Contents

  3. Startup - Multiple Linear Regression

    • kaggle.com
    zip
    Updated Jan 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    karthickveerakumar (2018). Startup - Multiple Linear Regression [Dataset]. https://www.kaggle.com/datasets/karthickveerakumar/startup-logistic-regression
    Explore at:
    zip(1330 bytes)Available download formats
    Dataset updated
    Jan 29, 2018
    Authors
    karthickveerakumar
    Description

    Dataset

    This dataset was created by karthickveerakumar

    Contents

  4. Marketing Linear Multiple Regression

    • kaggle.com
    zip
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FayeJavad (2020). Marketing Linear Multiple Regression [Dataset]. https://www.kaggle.com/datasets/fayejavad/marketing-linear-multiple-regression
    Explore at:
    zip(1907 bytes)Available download formats
    Dataset updated
    Apr 24, 2020
    Authors
    FayeJavad
    Description

    Dataset

    This dataset was created by FayeJavad

    Contents

  5. m

    Datasets used to train and test prediction model to predict scores in terms...

    • data.mendeley.com
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarosław Wątróbski (2025). Datasets used to train and test prediction model to predict scores in terms of SDG 7 realization [Dataset]. http://doi.org/10.17632/6c8fm7s4y2.1
    Explore at:
    Dataset updated
    Mar 5, 2025
    Authors
    Jarosław Wątróbski
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.

  6. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  7. 2. All Data: Using landslide-inventory mapping for a combined bagged-trees...

    • geolsoc.figshare.com
    txt
    Updated Apr 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew M. Crawford; Jason M. Dortch; Hudson J. Koch; Ashton A. Killen; Junfeng Zhu; Yichaun Zhu; Lindsey S. Bryson; William C. Haneberg (2021). 2. All Data: Using landslide-inventory mapping for a combined bagged-trees and logistic-regression approach to determining landslide susceptibility in eastern Kentucky, United States [Dataset]. http://doi.org/10.6084/m9.figshare.14473487.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 23, 2021
    Dataset provided by
    Geological Society of Londonhttp://www.geolsoc.org.uk/
    Authors
    Matthew M. Crawford; Jason M. Dortch; Hudson J. Koch; Ashton A. Killen; Junfeng Zhu; Yichaun Zhu; Lindsey S. Bryson; William C. Haneberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States, Kentucky
    Description

    The statistical data used in the combined machine-learning functions are available here as csv files. This includes all geomorphic data at the determined radial buffer size (36 variables), the results of the bagged trees function (12 variables), and the bagged trees resulting feature importance data.

  8. B

    Replication Data for: Site C Logistic Regression model

    • borealisdata.ca
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Taylor (2025). Replication Data for: Site C Logistic Regression model [Dataset]. http://doi.org/10.5683/SP3/MA1ATA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2025
    Dataset provided by
    Borealis
    Authors
    Eric Taylor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Site C dam, Ft St John, Peace River, Canada, BC
    Description

    The data are genetic assignments to upstream or downstream of Site C dam (bull trout, Arctic grayling, and rainbow trout). Columns are defined in the csv file. Also file of R code to run analysis

  9. n

    Data for: Identification of hindered internal rotational mode for complex...

    • narcis.nl
    • data.mendeley.com
    Updated Nov 8, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Le, T (via Mendeley Data) (2017). Data for: Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model [Dataset]. http://doi.org/10.17632/d37mzs3b3m.2
    Explore at:
    Dataset updated
    Nov 8, 2017
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Le, T (via Mendeley Data)
    Description

    The "Dataset_HIR" folder contains the data to reproduce the results of the data mining approach proposed in the manuscript titled "Identification of hindered internal rotational mode for complex chemical species: A data mining approach with multivariate logistic regression model".

    More specifically, the folder contains the raw electronic structure calculation input data provided by the domain experts as well as the training and testing dataset with the extracted features.

    The "Dataset_HIR" folder contains the following subfolders namely:

    1. Electronic structure calculation input data: contains the electronic structure calculation input generated by the Gaussian program

      1.1. Testing data: contains the raw data of all training species (each is stored in a separate folder) used for extracting dataset for training and validation phase.

      1.2. Testing data: contains the raw data of all testing species (each is stored in a separate folder) used for extracting data for the testing phase.

    2. Dataset 2.1. Training dataset: used to produce the results in Tables 3 and 4 in the manuscript

      + datasetTrain_raw.csv: contains the features for all vibrational modes associated with corresponding labeled species to let the chemists select the Hindered Internal Rotor from the list easily for the training and validation steps.  
      
      + datasetTrain.csv: refines the datasetTrain_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the modeling and validation steps.
      

      2.2. Testing dataset: used to produce the results of the data mining approach in Table 5 in the manuscript.

      + datasetTest_raw.csv: contains the features for all vibrational modes of each labeled species to let the chemists select the Hindered Internal Rotor from the list for the testing step.
      
      + datasetTest.csv: refines the datasetTest_raw.csv where the names of the species are all removed to transform the dataset into an appropriate form for the testing step.
      

    Note for the Result feature in the dataset: 1 is for the mode needed to be treated as Hindered Internal Rotor, and 0 otherwise.

  10. marital status using logistic regression

    • kaggle.com
    Updated Aug 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Study Mart (2020). marital status using logistic regression [Dataset]. https://www.kaggle.com/datasets/studymart/marital-status-using-logistic-regression
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2020
    Dataset provided by
    Kaggle
    Authors
    Study Mart
    Description

    Dataset

    This dataset was created by Study Mart

    Contents

  11. Z

    One Classifier Ignores a Feature

    • data.niaid.nih.gov
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
    Explore at:
    Dataset updated
    Apr 29, 2022
    Authors
    Maier, Karl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

    The original data set was created and split using this Python code:

    from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

    clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

    X_explain = X_test y_explain = y_test

  12. Z

    Datasets and results of the paper titled "Are citation networks relevant to...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elvira Pelle (2024). Datasets and results of the paper titled "Are citation networks relevant to explain academic promotions? An empirical analysis of the Italian national scientific qualification" [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_5118644
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Andrea Sciandra
    Elvira Pelle
    Francesco Poggi
    Maria Cristiana Martini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the input datasets and the results of the analyses reported on the paper titled "Are citation networks relevant to explain academic promotions? An empirical analysis of the Italian national scientific qualification".

    Abstract:

    The aim of this paper is to study the role of citation network measures in the assessment of scientific maturity. Referring to the case of the Italian national scientific qualification (ASN), we investigate if there is a relationship between citation network indices and the results of the researchers’ evaluation procedures. In particular, we want to understand if network measures can enhance the prediction accuracy of the results of the evaluation procedures beyond basic performance indices. Moreover, we want to highlight which citation network indices prove to be more relevant in explaining the ASN results, and if quantitative indices used in the citation-based disciplines assessment can replace the citation network measures in non-citation-based disciplines. Data concerning Statistics and Computer Science disciplines are collected from different sources (ASN, Italian Ministry of University and Research, and Scopus) and processed in order to calculate the citation-based measures used in this study. Following, we apply classification models to estimate the effects of network variables. We find that network measures are strongly related to the results of the ASN and significantly improve the explanatory power of the models, especially for the research fields of Statistics. Additionally, citation networks in the specific sub-disciplines are far more relevant than those in the general disciplines. Finally, results show that the citation network measures are not a substitute of the citation-based bibliometric indices.

    Code

    The code to collect and process the data used in this paper is available on GitHub at https://github.com/DigitalDataLab/ASN16-18_CitationNetwork.

    Dataset description

    The files AdjacencyMatrix_01B1.csv, AdjacencyMatrix_09H1.csv, AdjacencyMatrix_13D1.csv, AdjacencyMatrix_13D2.csv and AdjacencyMatrix_13D3.csv are the citation matrices for Italian academics (i.e. ASN candidates and permanent positions in the Italian academic system) in the Recruitment Fields (RFs) 01/B1, 09/H1, 13/D1, 13/D2 and 13/D3, respectively.

    The files AdjacencyMatrix_CS.csv and AdjacencyMatrix_ST.csv are the citation matrices for the Italian academics in the Computer Science disciplines (i.e. RFs 01/B1 and 09/H1) and the Statistical disciplines (i.e. RFs 13/D1, 13/D2 and 13/D3), respectively.

    The files CS_01B1_1.csv, CS_09H1_1.csv, ST_13D1_1.csv, ST_13D2_1.csv and ST_13D3_1.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Full Professor (FP) level.

    The files CS_01B1_2.csv, CS_09H1_2.csv, ST_13D1_2.csv, ST_13D2_2.csv and ST_13D3_2.csv contain the data used to build the logistic regression models presented in the paper for the Italian academics at the Associate Professor (AP) level.

    The file Codebook.pdf is the codebook of the previous ten files.

    The file Appendix.pdf contains the final results of the stepwise logistic regressions computed for each level (i.e. Full Professor and Associate Professor) and Recruitment Field in the Computer Science and Statistics disciplines.

    The file NormalityAssessment.pdf contains the normality assessment of citation network indices.

  13. [A Procedure for Multilevel Logistic Modeling] Appendix, Datasets, and...

    • figshare.com
    pdf
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Sommet (2024). [A Procedure for Multilevel Logistic Modeling] Appendix, Datasets, and Syntax Files [Dataset]. http://doi.org/10.6084/m9.figshare.5350786.v6
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Nicolas Sommet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • For each software, a series of sub-appendices describes the way to handle each stage of the procedure;- For each software, a .zip file contains the dataset in .dta (for Stata), .rdata (for R), .dat (for Mplus), and .sav (for SPSS), as well as the syntax file(s) in .do (for Stata), .R (for R), .inp (for Mplus), and .sps (for SPSS)- The dataset is also provided in .csv format.If you notice a mistake in the Stata or SPSS-related Appendices and/or syntax files, please report it to Nicolas Sommet (nicolas.sommet@unil.ch). If you notice a mistake in the R or Mplus-related Appendices and/or syntax files, please report it to Davide Morselli (davide.morselli@unil.ch).Sommet, N. and Morselli, D. (2017). Keep Calm and Learn Multilevel Logistic Modeling: A Simplified Three-Step Procedure Using Stata, R, Mplus, and SPSS. International Review of Social Psychology, 30, 203–218, DOI: https://doi.org/10.5334/irsp.90
  14. n

    Data from: Macaques preferentially attend to intermediately surprising...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Klaviyo
    Yale University
    University of Minnesota
    University of California, Berkeley
    Authors
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

    "csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

    subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

    "csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

    rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

    "csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

    Empty Values in Datasets:

    There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

    Codes:

    In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd

  15. Replication Package of "Battling Phish"

    • figshare.com
    csv
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Author (2025). Replication Package of "Battling Phish" [Dataset]. http://doi.org/10.6084/m9.figshare.30324559.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Anonymous Author
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.Directory Structure├── Datasets/│ ├── Dataset-1.csv│ ├── Dataset-2.csv│ ├── Dataset-3.csv│ ├── Dataset-4.csv│ ├── Dataset-5.csv│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv│ └── Legit_Phish_32_Features_Extracted_Data.csv│└── Source_Codes/ ├── Feature_extraction_source_code.py ├── Feature_importance_analysis_source_code.py ├── ML/ │ ├── Seven_ML_Models_trained_on_LP.py │ ├── Seven_ML_Models_trained_on_PSU.py │ ├── SoftVoting_trained_on_LP.py │ ├── SoftVoting_trained_on_PSU.py │ ├── HardVoting_trained_on_LP.py │ └── HardVoting_trained_on_PSU.py │ ├── DL/ │ ├── [DLModel1]_trained_on_LP.py │ ├── [DLModel1]_trained_on_PSU.py │ └── ... (total 16 files for 8 DL algorithms) │ └── LLM/ ├── BERT_Fine_Tuned_on_LP.py ├── BERT_Fine_Tuned_on_PSU.py ├── DistilBERT_Fine_Tuned_on_LP.py ├── DistilBERT_Fine_Tuned_on_PSU.py ├── PhishBERT_Evaluation.py └── URLBERT_Evaluation.pyDatasets:Dataset-1.csv to Dataset-5.csv:Used for feature importance analysis.Phishing_Site_URLs_32_Features_Extracted_Data.csv (PSU dataset):Includes phishing and legitimate URLs with 32 extracted lexical features.Legit_Phish_32_Features_Extracted_Data.csv (LP dataset):Another benchmark dataset with the same 32 features, used for comparative evaluation.Note: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.Source Code:Feature_extraction_source_code.pyExtracts 32 handcrafted lexical features from raw URL data.Feature_importance_analysis_source_code.pyPerforms feature selection using seven statistical and model-based ranking methods.Machine Learning (ML)Implements ML classifiers individually trained on LP and PSU datasets:Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.Soft Voting and Hard Voting ensembles are also implemented.Scripts:Seven_ML_Models_trained_on_LP.pySeven_ML_Models_trained_on_PSU.pySoftVoting_trained_on_LP.py, SoftVoting_trained_on_PSU.pyHardVoting_trained_on_LP.py, HardVoting_trained_on_PSU.pyDeep Learning (DL)Implements eight deep learning architectures (each trained separately on LP and PSU):Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).Large Language Models (LLMs)Fine-tuned:BERT_Fine_Tuned_on_LP.py, BERT_Fine_Tuned_on_PSU.pyDistilBERT_Fine_Tuned_on_LP.py, DistilBERT_Fine_Tuned_on_PSU.pyPre-trained, zero-shot or direct evaluation:PhishBERT_Evaluation.pyURLBERT_Evaluation.py

  16. LogisticRegression scores

    • figshare.com
    txt
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Caballero (2024). LogisticRegression scores [Dataset]. http://doi.org/10.6084/m9.figshare.23904903.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Pablo Caballero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets presented in this repository are obtained by applying the logistic regression (LogisticRegression) algorithm with the following specific hyperparameter setting and the different inputs that ,have been described in the journal paper: "Data mining techniques for endometriosis detection in a data-scarce medical dataset".HyperparametersC: 1Solver: liblinearmax_iter: 1000000Filesresult_eb.csv: Results for EB sample type.result_ef.csv: Results for EF sample type.result_vagina.csv: Results for vagina sample type.result_oral.csv: Results for oral sample type.result_feces.csv: Results for feces sample type.result_frt.csv: Results for FRT (EB + EF + vagina) sample type.result_frt2.csv: Results for FRT2 (EB + vagina) sample type.properties_eb.log: Arguments and result information for EB sample type (C, n_split, solver, max_iter, random_state, len_scores_before_filtering, len_scores_after_filtering, len_f1).properties_ef.log: Arguments and result information for EF sample type.properties_vagina.log: Arguments and result information for vagina sample type.properties_oral.log: Arguments and result information for oral sample type.properties_feces.log: Arguments and result information for feces sample type.properties_frt.log: Arguments and result information for FRT sample type.properties_frt2.log: Arguments and result information for FRT2 sample type.

  17. f

    R script and dataset for Bayesian hierarchical logistic modeling of percent...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Mar 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brenden, Travis (2017). R script and dataset for Bayesian hierarchical logistic modeling of percent male in sea lamprey populations in lentic and river environments [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001810488
    Explore at:
    Dataset updated
    Mar 1, 2017
    Authors
    Brenden, Travis
    Description

    R code for conducting analyses described in Johnson, N.S., W.D. Swink, and T.O. Brenden. Field study suggests that sex determination in sea lamprey is directly influenced by larval growth rate. Proceedings of the Royal Society B.data.csv is the raw data for fitting the Bayesian hierarchical logistic regression modelRead me_Metadata... is the metadata describing the variables in the data.csv fileRscript.R is the R script for fitting the Bayesian hierarchical logistic regression model

  18. Transportation and Logistics Tracking Dataset

    • kaggle.com
    zip
    Updated May 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicole Machado (2024). Transportation and Logistics Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/nicolemachado/transportation-and-logistics-tracking-dataset
    Explore at:
    zip(3705944 bytes)Available download formats
    Dataset updated
    May 5, 2024
    Authors
    Nicole Machado
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Transportation and Logistics Tracking Dataset comprises multiple datasets related to various aspects of transportation and logistics operations. It includes information on on-time delivery impact, routes by rating, customer ratings, delivery times with and without congestion, weather conditions, and differences between fixed and main delivery times across different regions.

    On-Time Delivery Impact: This dataset provides insights into the impact of on-time delivery, categorizing deliveries based on their impact and counting the occurrences for each category. Routes by Rating: Here, the dataset illustrates the relationship between routes and their corresponding ratings, offering a visual representation of route performance across different rating categories. Customer Ratings and On-Time Delivery: This dataset explores the relationship between customer ratings and on-time delivery, presenting a comparison of delivery counts based on customer ratings and on-time delivery status. Delivery Time with and Without Congestion: It contains information on delivery times in various cities, both with and without congestion, allowing for an analysis of how congestion affects delivery efficiency. Weather Conditions: This dataset provides a summary of weather conditions, including counts for different weather conditions such as partly cloudy, patchy light rain with thunder, and sunny. Difference between Fixed and Main Delivery Times: Lastly, the dataset highlights the differences between fixed and main delivery times across different regions, shedding light on regional variations in delivery schedules. Overall, this dataset offers valuable insights into the transportation and logistics domain, enabling analysis and decision-making to optimize delivery processes and enhance customer satisfaction.

  19. Pass or Not? Students Exam Score Data

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuanyuan CHEN (2024). Pass or Not? Students Exam Score Data [Dataset]. https://www.kaggle.com/datasets/cchen002/pass-or-not-students-exam-score-data
    Explore at:
    zip(2675 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Yuanyuan CHEN
    Description

    Dataset Description:

    The "Exam Scores Dataset" is a synthetic dataset generated using Python. It contains records of exam scores for two exams, labeled as "Exam Score1" and "Exam Score2". Each row in the dataset represents a student's performance in these exams. Additionally, there is a binary indicator labeled "Pass", denoting whether the student has passed both exams.

    • Exam Score1: Represents the score obtained by the student in the first exam. Scores range from 0 to 100, inclusive, with decimal values.
    • Exam Score2: Represents the score obtained by the student in the second exam. Scores range from 0 to 100, inclusive, with decimal values.
    • Pass: A binary indicator denoting the pass status of the student. It takes a value of 1 if the student has passed both exams, and 0 otherwise.

    The dataset is designed for educational purposes, providing a practical tool for learners to engage in hands-on exercises and gain a deeper understanding of binary classification concepts, particularly in the context of educational assessment and student performance evaluation. Feel free to utilize the dataset for practice and experimentation to enhance your proficiency in machine learning algorithms and techniques.

    Example Task: Predicting Exam 3 Pass Status

    The task involves using logistic regression to predict whether a student can pass exam 3 based on their scores in exams 1 and 2. Given a student's scores of 77 in Exam Score1 and 58 in Exam Score2, the logistic regression model will be trained using the generated dataset to predict whether the student can pass exam 3.

    This task serves as an introductory exercise to binary classification using logistic regression, demonstrating how machine learning models can be applied to predict binary outcomes based on input features.

  20. Z

    Data from: Replication package for the paper: "A Study on the Pythonic...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zid, Cyrine; Zampetti, Fiorella; Antoniol, Giuliano; Di Penta, Massimiliano (2024). Replication package for the paper: "A Study on the Pythonic Functional Constructs' Understandability" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8191782
    Explore at:
    Dataset updated
    Jan 23, 2024
    Dataset provided by
    Polytechnique Montréal
    University of Sannio
    Authors
    Zid, Cyrine; Zampetti, Fiorella; Antoniol, Giuliano; Di Penta, Massimiliano
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    Replication Package for "A Study on the Pythonic Functional Constructs' Understandability" to appear at ICSE 2024

    Authors: Cyrine Zid, Fiorella Zampetti, Giuliano Antoniol, Massimiliano Di penta

    Article Preprint: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf

    Artifacts: https://doi.org/10.5281/zenodo.8191782

    License: GPL V3.0

    This package contains folders and files with code and data used in the study described in the paper. In the following, we first provide all fields required for the submission, and then report a detailed description of all repository folders.

    Artifact Description

    Purpose

    The artifact is about a controlled experiment aimed at investigating the extent to which Pythonic functional constructs have an impact on source code understandability. The artifact archive contains:

    The material to allow replicating the study (see Section Experimental-Material)

    Raw quantitative results, working datasets, and scripts to replicate the statistical analyses reported in the paper. Specifically, the executable part of the replication package reproduces figures and tables of the quantitative analysis (RQ1 and RQ2) of the paper starting from the working datasets.

    Spreadsheets used for the qualitative analysis (RQ3).

    We apply for the following badges:

    Available and reusable: because we provide all the material that can be used to replicate the experiment, but also to perform the statistical analyses and the qualitative analyses (spreadsheets, in this case)

    Provenance

    Paper preprint link: https://mdipenta.github.io/files/ICSE24_funcExperiment.pdf

    Artifacts: https://doi.org/10.5281/zenodo.8191782

    Data

    Results have been obtained by conducting the controlled experiment involving Prolificworkers as participants. Data collection and processing followed a protocol approved by the University ethical board. Note that all data enclosed in the artifact is completely anonymized and does not contain sensible information.

    Further details about the provided dataset can be found in the Section Results' directory and files

    Setup and Usage (for executable artifacts):

    See the Section Scripts to reproduce the results, and instructions for running them

    Experiment-Material/

    Contains the material used for the experiment, and, specifically, the following subdirectories:

    Google-Forms/

    Contains (as PDF documents) the questionnaires submitted to the ten experimental groups.

    Task-Sources/

    Contains, for each experimental group (G-1...G-10), the sources used to produce the Google Forms, and, specifically: - The cover letter (Letter.docx). - A directory for each experimental task (Lambda 1, Lambda 2, Comp 1, Comp 2, MRF 1, MRF 2, Lambda Comparison, Comp Comparison, MRF Comparison). Each directory contains: (i) the exercise text (in both Word and .txt format), the source code snippet, and its .png image to be used in the form. Note: the "Comparison" tasks do not have any exercise as the purpose is always the same, i.e., to compare the (perceived) understandability of the snippets and return the results of the comparison.

    Code-Examples-Table1/

    Contains the source code snippets used as objects of the study (the same you can find under "Task-Sources/"), named as reported in Table 1.

    Results' directory and files

    raw-responses/

    Contains, as spreadsheets, the raw responses provided by the study participants through Google forms.

    raw-results-RQ1/

    Contains the raw results for RQ1. Specifically, the directory contains a subdirectory for each group (G1-G10). Each subdirectory contains: - For each user (named using their Prolific IDs, a directory containing, for each question (Q1-Q6) the produced python code (Qn.py) its output (QnR.txt) and its StdErr output (QnErr.txt). - "expected-outputs/": A directory containing the expected outputs for each task (Qn.txt).

    working-results/RQ1-RQ2-files-for-statistical-analysis/

    Contains three .csv files used as input for conducting the statistical analysis and drawing the graphs for addressing the first two research questions of the study. Specifically:

    ConstructUsage.csv contains the declared frequency usage of the three functional constructs object of the study. This file is used to draw Figure 4. The file contains an entry for each participant, reporting the (text-coded) frequency of construct usage for Comprehension, Lambda, and MRF.

    RQ1.csv contains the collected data used for the mixed-effect logistic regression relating the use of functional constructs with the correctness of the change task, as well as the logistic regression relating the use of map/reduce/filter functions with the correctness of the change task. The csv file contains an entry for each answer provided by each subject, and features the following columns:

    Group: experimental group to which the participant is assigned

    User: user ID

    Time: task time in seconds

    Approvals: number of approvals on previous tasks performed on Prolific

    Student: whether the participant declared themselves as a student

    Section: section of the questionnaire (lambda, comp, or mrf)

    Construct: specific construct being presented (same as "Section" for lambda and comp, for mrf it says whether it is a map, reduce, or filter)

    Question: question id, from Q1 to Q6, indicate the ordering of the question

    MainFactor: main factor treatment for the given question - "f" for functional, "p" for procedural counterpart

    Outcome: TRUE if the task was correctly performed, FALSE otherwise

    Complexity: cyclomatic complexity of the construct (empty for mrf)

    UsageFrequency: usage frequency of the given construct

    RQ1Paired-RQ2.csv contains the collected data used for the ordinal logistic regression of the relationship between the perceived ease of understanding of the functional constructs and (i) participants' usage frequency, and (ii) constructs' complexity (except for map/reduce/filter). The file features a row for each participant, and the columns are the following:

    Group: experimental group to which the participant is assigned

    User: user ID

    Time: task time in seconds

    Approvals: number of approvals on previous tasks performed on Prolific

    Student: whether the participant declared themselves as a student

    LambdaF: result for the change task related to a lambda construct

    LambdaP: result for the change task related to the procedural counterpart of a lambda construct

    CompF: result for the change task related to a comprehension construct

    CompP: result for the change task related to the procedural counterpart of a comprehension construct

    MrfF: result for the change task related to an MRF construct

    MrfP: result for the change task related to the procedural counterpart of a MRF construct

    LambdaComp: perceived understandability level for the comparison task (RQ2) between a lambda and its procedural counterpart

    CompComp: perceived understandability level for the comparison task (RQ2) between a comprehension and its procedural counterpart

    MrfComp: perceived understandability level for the comparison task (RQ2) between a MRF and its procedural counterpart

    LambdaCompCplx: cyclomatic complexity of the lambda construct involved in the comparison task (RQ2)

    CompCompCplx: cyclomatic complexity of the comprehension construct involved in the comparison task (RQ2)

    MrfCompType: type of MRF construct (map, reduce, or filter) used in the comparison task (RQ2)

    LambdaUsageFrequency: self-declared usage frequency on lambda constructs

    CompUsageFrequency: self-declared usage frequency on comprehension constructs

    MrfUsageFrequency: self-declared usage frequency on MRF constructs

    LambdaComparisonAssessment: outcome of the manual assessment of the answer to the "check question" required for the lambda comparison ("yes" means valid, "no" means wrong, "moderatechatgpt" and "extremechatgpt" are the results of GPTZero)

    CompComparisonAssessment: as above, but for comprehension

    MrfComparisonAssessment: as above, but for MRF

    working-results/inter-rater-RQ3-files/

    This directory contains four .csv files used as input for computing the inter-rater agreement for the manual labeling used for addressing RQ3. Specifically, you will find one file for each functional construct, i.e., comprehension.csv, lambda.csv, and mrf.csv, and a different file used for highlighting the reasons why participants prefer to use the procedural paradigm, i.e., procedural.csv.

    working-results/RQ2ManualValidation.csv

    This file contains the results of the manual validation being done to sanitize the answers provided by our participants used for addressing RQ2. Specifically, we coded the behaviour description using four different levels: (i) correct ("yes"), (ii) somewhat correct ("partial"), (iii) wrong ("no"), and (iv) automatically generated. The file features a row for each participant, and the columns are the following:

    ID: ID we used to refer the participant in the paper's qualitative analysis

    Group: experimental group to which the participant is assigned

    ProlificID: user ID

    Comparison for lambda construct description: answer provided by the user for the lambda comparison task

    Final Classification: our assessment of the lambda comparison answer

    Comparison for comprehension description: answer provided by the user for the comprehension comparison task

    Final Classification: our assessment of the comprehension comparison answer

    Comparison for MRF description: answer provided by the user for the MRF comparison task

    Final Classification: our assessment of the MRF comparison answer

    working-results/RQ3ManualValidation.xlsx

    This file contains the results of the open coding applied to address our third research question. Specifically, you will find four sheets, one for each functional construct and one for the procedural paradigm. Each sheet reports the provided answers together with the categories assigned to them. Each sheet contains the following columns:

    ID: ID we used to refer the participant in the paper's qualitative

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ananya Nayan (2017). Logistic Regression [Dataset]. https://www.kaggle.com/datasets/dragonheir/logistic-regression
Organization logo

Logistic Regression

Explore at:
zip(3349 bytes)Available download formats
Dataset updated
Dec 24, 2017
Authors
Ananya Nayan
License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Dataset

This dataset was created by Ananya Nayan

Released under Database: Open Database, Contents: © Original Authors

Contents

Search
Clear search
Close search
Google apps
Main menu