100+ datasets found

Exploratory Data Analysis on Automobile Dataset
kaggle.com
zip
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monis Ahmad (2022). Exploratory Data Analysis on Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/monisahmad/automobile
Explore at:
zip(4915 bytes)Available download formats
Dataset updated
Sep 12, 2022
Authors
Monis Ahmad
Description
Dataset

This dataset was created by Monis Ahmad

Contents
Sample data files for Python Course
figshare.com
txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Verhaar (2022). Sample data files for Python Course [Dataset]. http://doi.org/10.6084/m9.figshare.21501549.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21501549.v1
Dataset updated
Nov 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peter Verhaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample data set used in an introductory course on Programming in Python
sales_data
kaggle.com
zip
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amalawa Max Ogbomo (2024). sales_data [Dataset]. https://www.kaggle.com/datasets/amalawaogbomo/sales-data
Explore at:
zip(1827 bytes)Available download formats
Dataset updated
Jan 31, 2024
Authors
Amalawa Max Ogbomo
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Amalawa Max Ogbomo

Released under Apache 2.0

Contents
H
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
dataverse.harvard.edu
figshare.com
Updated Mar 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SXMSDZ
Dataset updated
Mar 21, 2022
Dataset provided by
Harvard Dataverse
Authors
Elizabeth Szkirpan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Data analysis codes
figshare.com
txt
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Auguste Vadisiute; Fernando Messore; Marissa Mueller (2024). Data analysis codes [Dataset]. http://doi.org/10.6084/m9.figshare.26963674.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26963674.v1
Dataset updated
Sep 7, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Dr Auguste Vadisiute; Fernando Messore; Marissa Mueller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data analysis scripts for neurone, glial cells and interneurons
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
Python use cases globally 2022
statista.com
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Python use cases globally 2022 [Dataset]. https://www.statista.com/statistics/1338409/python-use-cases/
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2022 - Dec 2022
Area covered
Worldwide
Description
Python has become one of the most popular programming languages, with a wide variety of use cases. In 2022, Python is most used for web development and data analysis, with ** percent and ** percent respectively.
f
Python script with data analysis.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romanowska, Iza; Raja, Rubina; Jiménez, Joan Campmany; Seland, Eivind H. (2022). Python script with data analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000444350
Explore at:
Dataset updated
Sep 21, 2022
Authors
Romanowska, Iza; Raja, Rubina; Jiménez, Joan Campmany; Seland, Eivind H.
Description
The file is prepared for use with Jupyter Notebook. Data analysis for climate proxies and estimates of carrying capacity over time. (IPYNB)
COM model and data analysis scripts
figshare.com
search.datacite.org
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José Pedro Correia (2016). COM model and data analysis scripts [Dataset]. http://doi.org/10.6084/m9.figshare.1428652.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1428652.v1
Dataset updated
Jan 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
José Pedro Correia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This fileset contains scripts used for model implementation, simulation execution, and data processing for the work presented in J.P. Correia, R. Ocelák, and J. Mašek's "Towards more realistic modeling of linguistic color categorization" (to appear). Python script for model implementation and simulation execution is adapted from an another implementation originally by Gerhard Jaeger and later extended by Michael Franke. The code is provided as is to support a deeper understanding of the details involved in the data analysis we carried out. It is not fully organized or documented (it might even be a bit hacky in places), and for that we apologize.
Data from: PLEIAData:consumption, HVAC (Heating, Ventilation & Air...
zenodo.org
portalinvestigacion.um.es
+2more
zip
Updated Feb 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Martínez Ibarra; Antonio Martínez Ibarra; Aurora González-Vidal; Aurora González-Vidal; Antonio Skarmeta Gómez; Antonio Skarmeta Gómez (2023). PLEIAData:consumption, HVAC (Heating, Ventilation & Air Conditioning), temperature, weather and motion sensor data for smart buildings applications [Dataset]. http://doi.org/10.5281/zenodo.7620136
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7620136
Dataset updated
Feb 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Martínez Ibarra; Antonio Martínez Ibarra; Aurora González-Vidal; Aurora González-Vidal; Antonio Skarmeta Gómez; Antonio Skarmeta Gómez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset presents detailed building operation data from the three blocks (A, B and C) of the Pleiades building of the University of Murcia, which is a pilot building of the European project PHOENIX. The aim of PHOENIX is to improve buildings efficiency, and therefore we included information of:
(i) consumption data, aggregated by block in kWh; (ii) HVAC (Heating, Ventilation and Air Conditioning) data with several features, such as state (ON=1, OFF=0), operation mode (None=0, Heating=1, Cooling=2), setpoint and device type; (iii) indoor temperature per room; (iv) weather data, including temperature, humidity, radiation, dew point, wind direction and precipitation; (v) carbon dioxide and presence data for few rooms; (vi) relationships between HVAC, temperature, carbon dioxide and presence sensors identifiers with their respective rooms and blocks. Weather data was acquired from the IMIDA (Instituto Murciano de Investigación y Desarrollo Agrario y Alimentario).
Data from: tableone: An open source Python package for producing summary...
zenodo.org
search.dataone.org
+1more
csv, txt
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2022). Data from: tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.26c4s35
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.

Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.
Data analysis V5 for python.xlsx
figshare.com
xlsx
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pingfei Jiang (2025). Data analysis V5 for python.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.28956233.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28956233.v1
Dataset updated
May 8, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Pingfei Jiang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the original data for processing for manuscript "A Comparative Study on Retrieval-Augmented Generation and Chain-of-Thought Applications for LLM-Assisted Engineering Design Ideation"
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
S&P 500 Companies Analysis Project
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Explore at:
zip(9721576 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
anshadkaggle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.
P
Python Program Development Report
datainsightsmarket.com
archivemarketresearch.com
doc, pdf, ppt
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Python Program Development Report [Dataset]. https://www.datainsightsmarket.com/reports/python-program-development-1402872
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Jan 25, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The size of the Python Program Development market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX% during the forecast period.
Z
DustNet - structured data and Python code to reproduce the model,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
University of Exeter
Kingston University
Authors
Nowak, T. E.; Augousti, Andy T.; Simmons, Benno I.; Siegert, Stefan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
H
Python and R Basics for Environmental Data Sciences
hydroshare.org
zip
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Wen (2020). Python and R Basics for Environmental Data Sciences [Dataset]. https://www.hydroshare.org/resource/114e5092ab684bd9beb9fc845a25a087
Explore at:
zip(282.7 MB)Available download formats
Dataset updated
Nov 1, 2020
Dataset provided by
HydroShare
Authors
Tao Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.

This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.

This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...
zenodo.org
application/gzip
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi (2024). Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python [Dataset]. http://doi.org/10.5281/zenodo.11584961
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11584961
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package

This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

Requirements

We recommend the following requirements to replicate our study:

Internet access

At least 100GB of space

Docker installed

Git installed

Package Structure

We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

data-analysis, an R-based Container we used to run our data analysis.

data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.

database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.

storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.

docker-compose.yml, the Docker file that configures all containers used in the package.

In the remainder of this document, we describe how to set up each container properly.

Using VSCode to Setup the Package

We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

You first need to set up the containers

$ cd /replication/package/folder $ docker-compose build $ docker-compose up # Wait docker creating and running all containers

Then, you can open them in Visual Studio Code:

Open VSCode in project root folder

Access the command palette and select "Dev Container: Reopen in Container"

Select either Data Collection or Data Analysis.

Start working

If you want/need a more customized organization, the remainder of this file describes it in detail.

Longest Road: Manual Package Setup

Database Setup

The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

Build an image:

$ cd ./database $ docker build --tag 'dabc-database' . $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB

Create and enter inside the container:

$ docker run -it --name dabc-database-1 dabc-database $ docker exec -it dabc-database-1 /bin/bash root# psql -U postgres -h localhost -d jupyter-notebooks jupyter-notebooks=# \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+------- public | Cell | table | root public | Code_cell | table | root public | Md_cell | table | root public | Notebook | table | root public | Notebook_features | table | root public | Notebook_metadata | table | root public | repository | table | root

If you got the tables list as above, your database is properly setup.

It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

Data Collection Setup

This container is responsible for collecting the data to answer our research questions. It has the following structure:

dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.

dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.

Makefile, commands to set up and run both dabcs.py and dabcs-clients.py

matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.

storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.

requirements.txt, Python dependencies adopted in this module.

Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

$ cd ./data-collection $ docker build --tag "data-collection" . $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection $ docker exec -it data-collection-1 /bin/bash $ ls Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py

If you see project files, it means the container is configured accordingly.

Data Analysis Setup

We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

dependencies.R, an R script containing the dependencies used in our data analysis.

data-analysis.Rmd, the R notebook we used to perform our data analysis

datasets, a docker volume pointing to the storage directory.

Execute the following commands to run this container:

$ cd ./data-analysis $ docker build --tag "data-analysis" . $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis $ docker exec -it data-analysis-1 /bin/bash $ ls data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile

If you see project files, it means the container is configured accordingly.

A note on storage shared folder

As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

$ make unzip # extract files $ ls clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv $ make zip # compress files $ ls csv-files.tar.gz Makefile
m
Python Script for Simulating, Analyzing, and Evaluating Statistical...
data.mendeley.com
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kabir Bindawa Abdullahi (2025). Python Script for Simulating, Analyzing, and Evaluating Statistical Mirroring-Based Ordinalysis and Other Estimators [Dataset]. http://doi.org/10.17632/zdhy83cv4p.3
Explore at:
Unique identifier
https://doi.org/10.17632/zdhy83cv4p.3
Dataset updated
Jun 5, 2025
Authors
Kabir Bindawa Abdullahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This presentation involves simulation and data generation processes, data analysis, and evaluation of classical and proposed methods of ordinal data analysis. All the parameters and metrics used are based on the methodology presented in the article titled "Statistical Mirroring-Based Ordinalysis: A Sensitive, Robust, Efficient, and Ordinality-Preserving Descriptive Method for Analyzing Ordinal Assessment Data," authored by Kabir Bindawa Abdullahi in 2024. For further details, you can follow the paper's publication submitted to MethodsX Elsevier Publishing.

The validation process of ordinal data analysis methods (estimators) has the following specifications:

• Simulation process: Monte Carlo simulation. • Data generation distributions: categorical, normal, and multivariate model distributions. • Data analysis: - Classical estimators: sum, average, and median ordinal score. - Proposed estimators: Kabirian coefficient of proximity, probability of proximity, probability of deviation.
• Evaluation metrics: - Overall estimates average. - Overall estimates median. - Efficiency (by statistical absolute meanic deviation method). - Sensitivity (by entropy method). - Normality, Mann-Whitney-U test, and others.