This dataset was created by Ranjeet Jain
The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.
The main goal is to predict if clients will subscribe to a term deposit or not.
Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)
Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)
#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Waiter's Tips Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aminizahra/tips-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.
The data was reported in a collection of case studies for business statistics.
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing
The dataset is also available through the Python package Seaborn.
Of course, this database has additional columns compared to other tips datasets.
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
# Column Non-Null Count Dtype
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
7 price_per_person 244 non-null float64
8 Payer Name 244 non-null object
9 CC Number 244 non-null int64
10 Payment ID 244 non-null object
dtypes: float64(3), int64(2), object(6)
total_bill a numeric vector, the bill amount (dollars)
tip a numeric vector, the tip amount (dollars)
sex a factor with levels Female Male, gender of the payer of the bill
Smoker a factor with levels No Yes, whether the party included smokers
day a factor with levels Friday Saturday Sunday Thursday, day of the week
time a factor with levels Day Night, rough time of day
size a numeric vector, number of ppartyeople in
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: While the acquired risk factors of liver cancer in Asia are relatively well understood, the underlying genetic background of liver cancer in Asians has not been well established or correlated with clinical outcomes. Objective: To identify gene mutations linked with worse outcomes in Asian patients with hepatocellular carcinoma (HCC). Methods: A total of 347 Asian and Non-Asian patients with HCC were analyzed in this study. TCGA patient mutation and clinical data were downloaded through TCGAbiolinksGUI and analyzed using the Python NumPy, Matplotlib, seaborn, and SciPy libraries. Statistical significance was determined by Welch’s t-test (unequal variances t-test), with P-values < 0.05 considered to be statistically significant. Results: Mutations in five genes (TP53, TTN, OBSCN, MUC5B, CSMD1) were statistically linked with increased mortality in Asians compared to non-Asians, four of which (TTN, OBSCN, MUC5B, CSMD1) were also more prevalent in the Asian population. Within the Asian cohort, two gene mutations (TTN, HMCN1) were statistically linked with worse outcomes. The TP53 mutation predicts worse outcomes within the non-Asian cohort, but not within the Asian cohort. Conclusions: This study identified multiple genetic biomarkers that can aid in the recognition, surveillance, prognosis, and gene therapy of HCC.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘A Waiter's Tips’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jsphyg/tipping on 30 September 2021.
--- Dataset description provided by original source is as follows ---
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.
Can you predict the tip amount?
The data was reported in a collection of case studies for business statistics.
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing
The dataset is also available through the Python package Seaborn.
--- Original source retains full ownership of the source dataset ---
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Data for 2D Lagrangian Particle tracking and evaluation for their hydrodynamic characteristics ## Abstract This dataset entails PYTHON code for fluid mechanic evaluation of Lagrangian Particles with the "Consensus-Based tracking with Selective Rejection of Tracklets" (CSRT) algorithm in the "OpenCV" library, written by Ryan Rautenbach in the framework of his Master thesis. ## Workflow for Lagrangian Particle tracking and evaluatio via OpenCV In the following a brief introduction and guide based on the folders in the repository is laid out. More code specific instructions can be found in the respective codes. working_env_RMR.yml --> Contains the entire environment including software versions (here used with Spyder IDE and Conda) with which the datasets were evaluated. 01 --> The tracking always begins with the same 01_milti[...] folder in which the python code with OpenCV algorithm is located. For tracking the tracking to work certain directories are required in which the raw images are to be stored (separate from anything else) as well as a directory in which the results are to be save (not the same directory as the raw data). After tracking is completed for all respective experiments and the results directories are adequately labelled and stored any of the other code files can be used for respective analyses. The order of folders beyond the first 01 directory has no relevance to the order of evaluation however can ease the understanding of evaluated data if followed. 02 --> Evaluation of amount of circulations and respective circulation time in experimental vat. (code can be extended to calculate the circulation time based on the various plains that are artificially set) 03 --> Code for the calculation of the amount of contacts with the vat floor. Code requires certain visual evaluations based on the LP trajectories, as the plain/barrier for the contact evaluation has to be manually set. 04 --> Contains two codes that can be applied to results data to combine individual results into larger more processable arrays within python. 05 --> Contains the code to plot the trajectory of single experiments of Lagrangian particles based on their positional results and velocity at respective position, highlighting the trajectory over the experiment. 06 --> Condes to create 1D histograms based on the probability density distribution and velocity distributions in cumulative experiments. 07 --> Codes for plotting the 2D probability density distribution (2D Histograms) of Lagrangian Particles based on the cumulative experiments. Code provides values for the 2D grid, plotting is conducted in Origin Lab or similar graphing tools, graphing can also be conducted in python whereby the seaborn (matplotlib) library is suggested. 08 --> Contain the code for the dimensionless evaluation of the results based on the respective Stokes number approaches and weighted averages. 2D histograms are also vital to this evaluation, whereby the plotting is again conducted in Origin Lab as values are only calculated in code. 09 --> Directory does not contain any python codes but instead contains the respective Origin Lab files for the graphing, plotting and evaluation of results calculated via python is given. Respective tables, histograms and heat maps are hereby given to be used as templates if necessary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
matplotlib
, seaborn
, numpy
, pandas
, tensorflow
, and sklearn
. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.walk_through_dir
function is used to explore the dataset directory structure and count the number of images in each class.Train
, Val
, and Test
directories, each containing subdirectories for the four classes.ImageDataGenerator
from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.Patches
layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.seaborn
to provide a clear understanding of the model's predictions.Dataset Preparation
Train
, Val
, and Test
directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).Install Required Libraries
pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
Run the Script
Analyze Results
Fine-Tuning
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Replication Kit for the Paper "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects"
This additional material shall provide other researchers with the ability to replicate our results. Furthermore, we want to facilitate further insights that might be generated based on our data sets.
Structure
The structure of the replication kit is as follows:
Additional Visualizations
We provide two additional visualizations for each project:
1)
For each of these data sets there exist one visualization for each project that shows four Venn-Diagrams for each of the different defect types. These Venn-Diagrams show the number of defects that were detected by either unit, or integration tests (or both).
Furthermore, we added boxplots for each of the data sets (i.e., ALL and DISJ) showing the scores of unit and integration tests for each defect type.
Analysis scripts
Requirements:
- python3.5
- tabulate
- scipy
- seaborn
- mongoengine
- pycoshark
- pandas
- matplotlib
Both python files contain all code for the statistical analysis we performed.
Data Collection Tools
We provide all data collection tools that we have implemented and used throughout our paper. Overall it contains six different projects and one python script:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data deposit contains synthetically generated crime data from the Agent-Based Reinforcement Learning Model of Burglary developed for the research article: Learning the Rational Choice Perspective: A Reinforcement Learning Approach to Simulating Offender Behaviours in Criminological Agent-Based Models
The data directory is as follows:
Model/
Data_Analysis_Notebook.ipynb MC1_Data MC2_Data MC3_Data
The Data_Analysis_Notebook.ipynb is the jupyter notebook used to produce the analysis within the article. This notebook requires python 3.* with packages such as matplotlib, seaborn, numpy, pandas, plotly, scipy to run.
The MC1, MC2 and MC3 folders contain the .txt files containing the data outputs used for analysis in the article. Where MC1 = Experiment Condition 1 in the article.
Each column of the data is described as follows:
AgentID: A unique agent identifier. Action: The current action an agent has chosen can be one of [OFFEND, DON'T OFFEND, MOVE]. Area: The locality in which the above action has taken place. Target_Attractiveness: The target attractiveness value of the property that has been victimised. Target_Reward: The reward at the property that has been victimised. Target_Risk: The risk surrounding the property that has been victimised. Target_Effort: The effort of the property victimised by the specific offender agent. Total_Cumulative_Reward: The total sum of Target_Attractiveness acquired by the offender agent. xAxisPos: The x-axis position of the cell the offender agent is currently in. zAxisPos: The y-axis position of the cell the offender agent is currently in. Zone_Travelled_To: The locality the offender agent is currently travelling towards. Episode: The current episode. Distance_To_Home: The normalised Euclidean distance to the offender agent's home node from the current victimised target. Distance_To_Next_Node: The normalised Euclidean distance to the next routine activity node from the current victimised target. Timestep: The current discrete time point. Target_Cumulative_Reward: The total amount of Target_Attractiveness the offender agent wants to achieve.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
Ever wonder how much to tip your waiter? One dedicated waiter meticulously recorded information about each tip he received over a few months while working at a restaurant. In total, he documented 244 tips. Now, the challenge is: can you predict the tip amount?
Stacking and ensembling techniques seem to work wonders with this dataset!
This dataset is a treasure trove of information from a collection of case studies for business statistics. Special thanks to Bryant and Smith for their diligent work:
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing.
You can also access this dataset now through the Python package Seaborn. Happy tipping (prediction)!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative data of figures and graphing scripts from the thesis titled 'Developing a congestion management scheme to reduce the impact of congestion in mixed traffic LoRaWANs'. The files contain the processed output of simulations conducted with a modified version of the ns-3 plugin lorawan. Processed simulation output was Pandas dataframes stored in text files. Software used: ns-3 (version 3.30), Jupyter notebooks, Python with packages sem, pandas, seaborn, modified version of lorawan module from signetlabdei. Python scripts refer to Std and Ex, std refers to the standard LoRaWAN module and Ex refers to the Extended version of the module with the algorithms presented in the thesis. Text files contain a legend at the top of all of the fields present in the dataframe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Required Libraries
The following libraries are required to run the scripts in this repository. You can install them using `pip`:
```bash pip install pandas numpy argparse json time random openai copy statistics krippendorff sklearn seaborn matplotlib together anthropic google-generativeai
Make sure to also install any other dependencies required by the specific model API if you plan on using models like GPT-4 or Claude:
openai
anthropic
together
All the experiments were done using python 3.10.11
For each dataset, we have a folder that contains process.py, heatmap.py, ira_sample.py. The folder also contains the relevant datasets and plots.
File Description:
Commands for datasets (Except Code Summarization):
Generating samples for different models:
python process.py --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
For Figure (1-5):
python heatmap.py
For Figure (7-10):
python ira_sample.py
Commands for datasets (Code Summarization):
python process.py --what accurate --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --what accurate --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --what accurate --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --what accurate --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --what accurate --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
python process.py --what accurate --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx
For Figure (1-5):
python heatmap.py
For Figure (7-10):
python ira_sample.py
What="accurate", "adequate", "concise", "similarity"
For Figure 6:
python scatter.py
For Figure 12 & 13, please copy majority.py and probability.py outside the shared folders.
For Figure 12:
python probability.py
For Figure 6:
python majority.py
We also provided sample prompts from all datasets in Prompts.pdf
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 10,000 synthetic records simulating the migratory behavior of various bird species across global regions. Each entry represents a single bird tagged with a tracking device and includes detailed information such as flight distance, speed, altitude, weather conditions, tagging information, and migration outcomes.
The data was entirely synthetically generated using randomized yet realistic values based on known ranges from ornithological studies. It is ideal for practicing data analysis and visualization techniques without privacy concerns or real-world data access restrictions. Because it’s artificial, the dataset can be freely used in education, portfolio projects, demo dashboards, machine learning pipelines, or business intelligence training.
With over 40 columns, this dataset supports a wide array of analysis types. Analysts can explore questions like “Do certain species migrate in larger flocks?”, “How does weather impact nesting success?”, or “What conditions lead to migration interruptions?”. Users can also perform geospatial mapping of start and end locations, cluster birds by behavior, or build time series models based on migration months and environmental factors.
For data visualization, tools like Power BI, Python (Matplotlib/Seaborn/Plotly), or Excel can be used to create insightful dashboards and interactive charts.
Join the Fabric Community DataViz Contest | May 2025: https://community.fabric.microsoft.com/t5/Power-BI-Community-Blog/%EF%B8%8F-Fabric-Community-DataViz-Contest-May-2025/ba-p/4668560
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the dataset, build and run scripts for the paper "Fmmgen: Automatic Code Generation of Cartesian Fast Multipole and Barnes-Hut Operators", by Ryan Alexander Pepper and Hans Fangohr.
Organisation
The repository is organised as follows:
sim-scripts
sim-scripts/ contains the source code, build and run scripts for running the FMM calculations described in the paper. To reproduce the results from the paper, you need as prerequisites:
* An installation and license of the Intel Compiler (Parallel Studio 2019 Update 3 was used for the paper).
* An installation of the GNU compiler suite.
* GNU Make
* Python 3.6
* A copy of fmmgen v1.0 (available at https://github.com/rpep/fmmgen or https://zenodo.org/record/3842591)
* An installation of Fidimag v3.0 (available at http://github.com/computationalmodelling/fidimag or http://dx.doi.org/10.5281/zenodo.3841935)
With these prerequisites, simply run from the sim-scripts directory:
```
# To build the executables
make build
# To run the studies
make run
```
The four scripts are:
* run-harmonic-cse-comparison.sh - Runs the Fast Multipole Method for 50000 Coulomb particles, varying the order of expansion, and evaluating the performance benefits of various optimisation strategies introduced in the code generation stage.
* run-scaling-comparison.sh - Runs comparisons between the Barnes-Hut and Fast Multipole Methods for different expansion orders and values of theta, the 'opening angle' parameter.
* run-error-comparison.sh - Runs the FMM and Barnes-Hut calculations, saving the fields and performing the direct calculation, allowing evaluation of the errors for the two methods.
* run-fidimag-tests.sh - Runs the Fidimag scaling tests for a series of magnetic dipoles placed on a lattice.
results
results contains the output data from the sim-scripts scripts. Note: running the scripts will overwrite this data!
figure-scripts
This contains Python scripts needed to reproduce the figures from the paper. These generated figures are included in the repository for convenience. To run these scripts, you require:
* Python >= 3.6
* Matplotlib >= 3.1.1
* Seaborn >= 0.9.1
* NumPy >= 1.17.1
figures
This contains the output figures included in the paper.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by Ranjeet Jain