15 datasets found

seaborn_tips_dataset
kaggle.com
Updated Apr 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranjeet Jain (2018). seaborn_tips_dataset [Dataset]. https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ranjeet Jain
Description
Dataset

This dataset was created by Ranjeet Jain

Contents
Bank Data Analysis
kaggle.com
Updated Mar 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Steve Gallegos
Description
Data Set Information

The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.

bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

Changed file name to bank.csv after delimited

Goal

The main goal is to predict if clients will subscribe to a term deposit or not.

Attribute Information

-Input Variables -

Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
A
‘Waiter's Tips Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Waiter's Tips Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-waiter-s-tips-dataset-b284/7835f609/?iid=004-884&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Waiter's Tips Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aminizahra/tips-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.

Acknowledgements

The data was reported in a collection of case studies for business statistics.

Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

The dataset is also available through the Python package Seaborn.

Hint

Of course, this database has additional columns compared to other tips datasets.

Dataset info

RangeIndex: 244 entries, 0 to 243

Data columns (total 11 columns):

# Column Non-Null Count Dtype

0 total_bill 244 non-null float64

1 tip 244 non-null float64

2 sex 244 non-null object

3 smoker 244 non-null object

4 day 244 non-null object

5 time 244 non-null object

6 size 244 non-null int64

7 price_per_person 244 non-null float64

8 Payer Name 244 non-null object

9 CC Number 244 non-null int64

10 Payment ID 244 non-null object

dtypes: float64(3), int64(2), object(6)

Some details

total_bill a numeric vector, the bill amount (dollars)

tip a numeric vector, the tip amount (dollars)

sex a factor with levels Female Male, gender of the payer of the bill

Smoker a factor with levels No Yes, whether the party included smokers

day a factor with levels Friday Saturday Sunday Thursday, day of the week

time a factor with levels Day Night, rough time of day

size a numeric vector, number of ppartyeople in

--- Original source retains full ownership of the source dataset ---
Hepatocellular Carcinoma TCGA-LIHC Mutation Information CSV...
figshare.com
txt
Updated Oct 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tane Kim (2021). Hepatocellular Carcinoma TCGA-LIHC Mutation Information CSV (TCGA_LIHC_Mutation_Input.csv) [Dataset]. http://doi.org/10.6084/m9.figshare.16822900.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16822900.v2
Dataset updated
Oct 17, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tane Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: While the acquired risk factors of liver cancer in Asia are relatively well understood, the underlying genetic background of liver cancer in Asians has not been well established or correlated with clinical outcomes. Objective: To identify gene mutations linked with worse outcomes in Asian patients with hepatocellular carcinoma (HCC). Methods: A total of 347 Asian and Non-Asian patients with HCC were analyzed in this study. TCGA patient mutation and clinical data were downloaded through TCGAbiolinksGUI and analyzed using the Python NumPy, Matplotlib, seaborn, and SciPy libraries. Statistical significance was determined by Welch’s t-test (unequal variances t-test), with P-values < 0.05 considered to be statistically significant. Results: Mutations in five genes (TP53, TTN, OBSCN, MUC5B, CSMD1) were statistically linked with increased mortality in Asians compared to non-Asians, four of which (TTN, OBSCN, MUC5B, CSMD1) were also more prevalent in the Asian population. Within the Asian cohort, two gene mutations (TTN, HMCN1) were statistically linked with worse outcomes. The TP53 mutation predicts worse outcomes within the non-Asian cohort, but not within the Asian cohort. Conclusions: This study identified multiple genetic biomarkers that can aid in the recognition, surveillance, prognosis, and gene therapy of HCC.
A Waiter's Tips
kaggle.com
Updated Mar 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joe Young (2019). A Waiter's Tips [Dataset]. https://www.kaggle.com/jsphyg/tipping/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Joe Young
Description
Context

Ever wonder how much to tip your waiter? One dedicated waiter meticulously recorded information about each tip he received over a few months while working at a restaurant. In total, he documented 244 tips. Now, the challenge is: can you predict the tip amount?

Stacking and ensembling techniques seem to work wonders with this dataset!

Acknowledgements

This dataset is a treasure trove of information from a collection of case studies for business statistics. Special thanks to Bryant and Smith for their diligent work:

Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing.

You can also access this dataset now through the Python package Seaborn. Happy tipping (prediction)!
A
‘A Waiter's Tips’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘A Waiter's Tips’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-a-waiter-s-tips-8e84/83b2a987/?iid=009-810&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘A Waiter's Tips’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/jsphyg/tipping on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.

Can you predict the tip amount?

Acknowledgements

The data was reported in a collection of case studies for business statistics.

Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

The dataset is also available through the Python package Seaborn.

--- Original source retains full ownership of the source dataset ---
D
Data from: Code for: Experimental Investigations of the Flow-Following...
darus.uni-stuttgart.de
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Hofmann; Ryan Rautenbach (2023). Code for: Experimental Investigations of the Flow-Following Capabilities and Hydrodynamic Characteristics of Lagrangian Sensor Particles With Respect to Their Centre of Mass [Dataset]. http://doi.org/10.18419/DARUS-3314
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3314
Dataset updated
Mar 13, 2023
Dataset provided by
DaRUS
Authors
Sebastian Hofmann; Ryan Rautenbach
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Dataset funded by
DFG
Description
Data for 2D Lagrangian Particle tracking and evaluation for their hydrodynamic characteristics ## Abstract This dataset entails PYTHON code for fluid mechanic evaluation of Lagrangian Particles with the "Consensus-Based tracking with Selective Rejection of Tracklets" (CSRT) algorithm in the "OpenCV" library, written by Ryan Rautenbach in the framework of his Master thesis. ## Workflow for Lagrangian Particle tracking and evaluatio via OpenCV In the following a brief introduction and guide based on the folders in the repository is laid out. More code specific instructions can be found in the respective codes. working_env_RMR.yml --> Contains the entire environment including software versions (here used with Spyder IDE and Conda) with which the datasets were evaluated. 01 --> The tracking always begins with the same 01_milti[...] folder in which the python code with OpenCV algorithm is located. For tracking the tracking to work certain directories are required in which the raw images are to be stored (separate from anything else) as well as a directory in which the results are to be save (not the same directory as the raw data). After tracking is completed for all respective experiments and the results directories are adequately labelled and stored any of the other code files can be used for respective analyses. The order of folders beyond the first 01 directory has no relevance to the order of evaluation however can ease the understanding of evaluated data if followed. 02 --> Evaluation of amount of circulations and respective circulation time in experimental vat. (code can be extended to calculate the circulation time based on the various plains that are artificially set) 03 --> Code for the calculation of the amount of contacts with the vat floor. Code requires certain visual evaluations based on the LP trajectories, as the plain/barrier for the contact evaluation has to be manually set. 04 --> Contains two codes that can be applied to results data to combine individual results into larger more processable arrays within python. 05 --> Contains the code to plot the trajectory of single experiments of Lagrangian particles based on their positional results and velocity at respective position, highlighting the trajectory over the experiment. 06 --> Condes to create 1D histograms based on the probability density distribution and velocity distributions in cumulative experiments. 07 --> Codes for plotting the 2D probability density distribution (2D Histograms) of Lagrangian Particles based on the cumulative experiments. Code provides values for the 2D grid, plotting is conducted in Origin Lab or similar graphing tools, graphing can also be conducted in python whereby the seaborn (matplotlib) library is suggested. 08 --> Contain the code for the dimensionless evaluation of the results based on the respective Stokes number approaches and weighted averages. 2D histograms are also vital to this evaluation, whereby the plotting is again conducted in Origin Lab as values are only calculated in code. 09 --> Directory does not contain any python codes but instead contains the respective Origin Lab files for the graphing, plotting and evaluation of results calculated via python is given. Respective tables, histograms and heat maps are hereby given to be used as templates if necessary.
Apple Leaf Disease Detection Using Vision Transformer
zenodo.org
text/x-python
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15702007
Dataset updated
Jun 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amreen Batool; Amreen Batool
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Table of Contents

Introduction

Code Explanation

Steps for Implementation

Example Usage

Conclusion

Introduction

The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Code Explanation

1. Importing Libraries

The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

2. Visualizing the Dataset

The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.

The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

3. Data Augmentation

The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.

Separate generators are created for training, validation, and test datasets.

4. Patch Visualization

The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.

The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

5. Model Training

The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.

The model is trained for a specified number of epochs, and the training history is stored for later analysis.

6. Model Evaluation

After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.

The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

7. Visualizing Misclassified Images

The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

8. Fine-Tuning and Learning Rate Adjustment

The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

Steps for Implementation

Dataset Preparation

Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).

Install Required Libraries

Install the necessary Python libraries using pip:

pip install tensorflow matplotlib seaborn numpy pandas scikit-learn

Run the Script

Execute the script in a Python environment. The script will automatically:

Load and preprocess the dataset.

Apply data augmentation.

Train the Vision Transformer model.

Evaluate the model and generate performance metrics.

Analyze Results

Review the confusion matrix and classification report to understand the model's performance.

Visualize misclassified images to identify potential areas for improvement.

Fine-Tuning

Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
Agent-Based Reinforcement Learning Model of Burglary (ARLMB) datasets for...
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sedar Olmez (2023). Agent-Based Reinforcement Learning Model of Burglary (ARLMB) datasets for article: Learning the Rational Choice Perspective: A Reinforcement Learning Approach to Simulating Offender Behaviours in Criminological Agent-Based Models [Dataset]. http://doi.org/10.6084/m9.figshare.20418735.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20418735.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Sedar Olmez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data deposit contains synthetically generated crime data from the Agent-Based Reinforcement Learning Model of Burglary developed for the research article: Learning the Rational Choice Perspective: A Reinforcement Learning Approach to Simulating Offender Behaviours in Criminological Agent-Based Models

The data directory is as follows:

Model/

Data_Analysis_Notebook.ipynb MC1_Data MC2_Data MC3_Data

The Data_Analysis_Notebook.ipynb is the jupyter notebook used to produce the analysis within the article. This notebook requires python 3.* with packages such as matplotlib, seaborn, numpy, pandas, plotly, scipy to run.

The MC1, MC2 and MC3 folders contain the .txt files containing the data outputs used for analysis in the article. Where MC1 = Experiment Condition 1 in the article.

Each column of the data is described as follows:

AgentID: A unique agent identifier. Action: The current action an agent has chosen can be one of [OFFEND, DON'T OFFEND, MOVE]. Area: The locality in which the above action has taken place. Target_Attractiveness: The target attractiveness value of the property that has been victimised. Target_Reward: The reward at the property that has been victimised. Target_Risk: The risk surrounding the property that has been victimised. Target_Effort: The effort of the property victimised by the specific offender agent. Total_Cumulative_Reward: The total sum of Target_Attractiveness acquired by the offender agent. xAxisPos: The x-axis position of the cell the offender agent is currently in. zAxisPos: The y-axis position of the cell the offender agent is currently in. Zone_Travelled_To: The locality the offender agent is currently travelling towards. Episode: The current episode. Distance_To_Home: The normalised Euclidean distance to the offender agent's home node from the current victimised target. Distance_To_Next_Node: The normalised Euclidean distance to the next routine activity node from the current victimised target. Timestep: The current discrete time point. Target_Cumulative_Reward: The total amount of Target_Attractiveness the offender agent wants to achieve.
Replication Kit: "Are Unit and Integration Test Definitions Still Valid for...
zenodo.org
explore.openaire.eu
application/gzip, bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski (2020). Replication Kit: "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects" [Dataset]. http://doi.org/10.5281/zenodo.1415334
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1415334
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Trautsch; Fabian Trautsch; Steffen Herbold; Jens Grabowski; Steffen Herbold; Jens Grabowski
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Replication Kit for the Paper "Are Unit and Integration Test Definitions Still Valid for Modern Java Projects? An Empirical Study on Open-Source Projects"
This additional material shall provide other researchers with the ability to replicate our results. Furthermore, we want to facilitate further insights that might be generated based on our data sets.

Structure
The structure of the replication kit is as follows:

additional_visualizations: contains additional visualizations (Venn-Diagrams) for each projects for each of the data sets that we used

data_analysis: contains two python scripts that we used to analyze our raw data (one for each research question)

data_collection_tools: contains all source code used for the data collection, including the used versions of the COMFORT framework, the BugFixClassifier, and the used tools of the SmartSHARK environment;

mongodb_no_authors: Archived dump of our MongoDB that we created by executing our data collection tools. The "comfort" database can be restored via the mongorestore command.

Additional Visualizations
We provide two additional visualizations for each project:
1)

For each of these data sets there exist one visualization for each project that shows four Venn-Diagrams for each of the different defect types. These Venn-Diagrams show the number of defects that were detected by either unit, or integration tests (or both).

Furthermore, we added boxplots for each of the data sets (i.e., ALL and DISJ) showing the scores of unit and integration tests for each defect type.

Analysis scripts
Requirements:
- python3.5
- tabulate
- scipy
- seaborn
- mongoengine
- pycoshark
- pandas
- matplotlib

Both python files contain all code for the statistical analysis we performed.

Data Collection Tools
We provide all data collection tools that we have implemented and used throughout our paper. Overall it contains six different projects and one python script:

BugFixClassifier: Used to classify our defects.

comfort-core: Core of the comfort framework. Used to classify our tests into unit and integration tests and calculate different metrics for these tests.

comfort-jacoco-listner: Used to intercept the coverage collection process as we were executing the tests of our case study projects.

issueSHARK: Used to collect data from the ITSs of the projects.

pycoSHARK: Library that contains models for the used ORM mapper that is used insight the SmartSHARK environment.

vcsSHARK: Used to collect data from the VCSs of the projects.
Data from: Can LLMs Replace Manual Annotation of Software Engineering...
zenodo.org
pdf, text/x-python +1
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blinded; Blinded (2024). Can LLMs Replace Manual Annotation of Software Engineering Artifacts? [Dataset]. http://doi.org/10.5281/zenodo.13917054
Explore at:
zip, text/x-python, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13917054
Dataset updated
Oct 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Blinded; Blinded
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Required Libraries

The following libraries are required to run the scripts in this repository. You can install them using `pip`:

```bash pip install pandas numpy argparse json time random openai copy statistics krippendorff sklearn seaborn matplotlib together anthropic google-generativeai

Make sure to also install any other dependencies required by the specific model API if you plan on using models like GPT-4 or Claude:

openai

anthropic

together

All the experiments were done using python 3.10.11

For each dataset, we have a folder that contains process.py, heatmap.py, ira_sample.py. The folder also contains the relevant datasets and plots.

File Description:

data_result: This folder contains the file with the dataset and few-shot samples. After running process.py, all the results will be accumuted to data_result folder. Note that this folder is already containing all the data and model generated results in .jsonl fomat files. You do not need to run process.py to generate the results.

Plots: This folder is containing the generated plots which can be generated by running heatmap.py and ira_sample.py.

process.py: This file will generate the results/annotations from the model based on the given parameters. We have shared the necessary command to run this file at the bottom. Note that you need API keys from different organizations to run the script. However, we have shared all the model generated results on data_result folder.

heatmap.py: Running this file will generate the heatmap that we presented from Figure 1-5 in the paper. The generated plots will be stored in "Plots" folder.

ira_sample.py: Running this file will generate the plots that we presented from Figure 7-10 in the paper. The generated plots will be stored in "Plots" folder.

Commands for datasets (Except Code Summarization):

Generating samples for different models:

python process.py --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

For Figure (1-5):

python heatmap.py

For Figure (7-10):

python ira_sample.py

Commands for datasets (Code Summarization):

python process.py --what accurate --model gpt-4 --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --what accurate --model gpt-3.5-turbo --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --what accurate --model llama3--fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --what accurate --model mixtral --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --what accurate --model claude --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

python process.py --what accurate --model gemini --fewshot yes --openai_key xxxx --together_key xxxx --claude_key xxxx --google_key xxxx

For Figure (1-5):

python heatmap.py

For Figure (7-10):

python ira_sample.py

What="accurate", "adequate", "concise", "similarity"

For Figure 6:

python scatter.py

For Figure 12 & 13, please copy majority.py and probability.py outside the shared folders.

For Figure 12:

python probability.py

For Figure 6:

python majority.py

We also provided sample prompts from all datasets in Prompts.pdf
Dataset for "Fmmgen: Automatic Code Generation of Operators for Cartesian...
zenodo.org
eprints.soton.ac.uk
application/gzip
Updated May 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Ryan Alexander Pepper; Ryan Ryan Alexander Pepper; Hans Hans Fangohr; Hans Hans Fangohr (2020). Dataset for "Fmmgen: Automatic Code Generation of Operators for Cartesian Fast Multipole and Barnes-Hut Methods" [Dataset]. http://doi.org/10.5281/zenodo.3842584
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3842584
Dataset updated
May 25, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Ryan Alexander Pepper; Ryan Ryan Alexander Pepper; Hans Hans Fangohr; Hans Hans Fangohr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository provides the dataset, build and run scripts for the paper "Fmmgen: Automatic Code Generation of Cartesian Fast Multipole and Barnes-Hut Operators", by Ryan Alexander Pepper and Hans Fangohr.

Organisation
The repository is organised as follows:

sim-scripts

sim-scripts/ contains the source code, build and run scripts for running the FMM calculations described in the paper. To reproduce the results from the paper, you need as prerequisites:
* An installation and license of the Intel Compiler (Parallel Studio 2019 Update 3 was used for the paper).
* An installation of the GNU compiler suite.
* GNU Make
* Python 3.6
* A copy of fmmgen v1.0 (available at https://github.com/rpep/fmmgen or https://zenodo.org/record/3842591)
* An installation of Fidimag v3.0 (available at http://github.com/computationalmodelling/fidimag or http://dx.doi.org/10.5281/zenodo.3841935)

With these prerequisites, simply run from the sim-scripts directory:
```
# To build the executables
make build
# To run the studies
make run
```

The four scripts are:
* run-harmonic-cse-comparison.sh - Runs the Fast Multipole Method for 50000 Coulomb particles, varying the order of expansion, and evaluating the performance benefits of various optimisation strategies introduced in the code generation stage.

* run-scaling-comparison.sh - Runs comparisons between the Barnes-Hut and Fast Multipole Methods for different expansion orders and values of theta, the 'opening angle' parameter.

* run-error-comparison.sh - Runs the FMM and Barnes-Hut calculations, saving the fields and performing the direct calculation, allowing evaluation of the errors for the two methods.

* run-fidimag-tests.sh - Runs the Fidimag scaling tests for a series of magnetic dipoles placed on a lattice.

results

results contains the output data from the sim-scripts scripts. Note: running the scripts will overwrite this data!

figure-scripts

This contains Python scripts needed to reproduce the figures from the paper. These generated figures are included in the repository for convenience. To run these scripts, you require:

* Python >= 3.6
* Matplotlib >= 3.1.1
* Seaborn >= 0.9.1
* NumPy >= 1.17.1

figures

This contains the output figures included in the paper.
Z
Supplementary material: Burial Analysis on the Middle Bronze Age in the...
data.niaid.nih.gov
zenodo.org
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laabs, Julian (2024). Supplementary material: Burial Analysis on the Middle Bronze Age in the Carpathian Basin (dataset and scripts) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7355008
Explore at:
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Laabs, Julian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pannonian Basin
Description
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
Bird Migration Dataset (Data Visualization / EDA)
kaggle.com
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahir Maharaj (2025). Bird Migration Dataset (Data Visualization / EDA) [Dataset]. https://www.kaggle.com/datasets/sahirmaharajj/bird-migration-dataset-data-visualization-eda/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sahir Maharaj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains 10,000 synthetic records simulating the migratory behavior of various bird species across global regions. Each entry represents a single bird tagged with a tracking device and includes detailed information such as flight distance, speed, altitude, weather conditions, tagging information, and migration outcomes.

The data was entirely synthetically generated using randomized yet realistic values based on known ranges from ornithological studies. It is ideal for practicing data analysis and visualization techniques without privacy concerns or real-world data access restrictions. Because it’s artificial, the dataset can be freely used in education, portfolio projects, demo dashboards, machine learning pipelines, or business intelligence training.

With over 40 columns, this dataset supports a wide array of analysis types. Analysts can explore questions like “Do certain species migrate in larger flocks?”, “How does weather impact nesting success?”, or “What conditions lead to migration interruptions?”. Users can also perform geospatial mapping of start and end locations, cluster birds by behavior, or build time series models based on migration months and environmental factors.

For data visualization, tools like Power BI, Python (Matplotlib/Seaborn/Plotly), or Excel can be used to create insightful dashboards and interactive charts.

Join the Fabric Community DataViz Contest | May 2025: https://community.fabric.microsoft.com/t5/Power-BI-Community-Blog/%EF%B8%8F-Fabric-Community-DataViz-Contest-May-2025/ba-p/4668560
u
Analysis of network performance when confirmed traffic is present in Long...
researchdata.up.ac.za
zip
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz (2024). Analysis of network performance when confirmed traffic is present in Long Range Wide Area Networks (LoRaWANs) [Dataset]. http://doi.org/10.25403/UPresearchdata.22113050.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.22113050.v1
Dataset updated
Feb 16, 2024
Dataset provided by
University of Pretoria
Authors
Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Quantitative data of figures and graphing scripts from the thesis titled 'Developing a congestion management scheme to reduce the impact of congestion in mixed traffic LoRaWANs'. The files contain the processed output of simulations conducted with a modified version of the ns-3 plugin lorawan. Processed simulation output was Pandas dataframes stored in text files. Software used: ns-3 (version 3.30), Jupyter notebooks, Python with packages sem, pandas, seaborn, modified version of lorawan module from signetlabdei. Python scripts refer to Std and Ex, std refers to the standard LoRaWAN module and Ex refers to the Extended version of the module with the algorithms presented in the thesis. Text files contain a legend at the top of all of the fields present in the dataframe.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ranjeet Jain (2018). seaborn_tips_dataset [Dataset]. https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

seaborn_tips_dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 2, 2018

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ranjeet Jain

Description

Dataset

This dataset was created by Ranjeet Jain

Clear search

Close search

Google apps

Main menu

seaborn_tips_dataset

Dataset

Contents

Bank Data Analysis

Data Set Information

Goal

Attribute Information

-Input Variables -

Source

‘Waiter's Tips Dataset’ analyzed by Analyst-2

Context

Acknowledgements

Hint

Dataset info

Some details

Hepatocellular Carcinoma TCGA-LIHC Mutation Information CSV...

A Waiter's Tips

Context

Acknowledgements

‘A Waiter's Tips’ analyzed by Analyst-2

Context

Acknowledgements

Data from: Code for: Experimental Investigations of the Flow-Following...

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

Agent-Based Reinforcement Learning Model of Burglary (ARLMB) datasets for...

Replication Kit: "Are Unit and Integration Test Definitions Still Valid for...

Data from: Can LLMs Replace Manual Annotation of Software Engineering...

Dataset for "Fmmgen: Automatic Code Generation of Operators for Cartesian...

Supplementary material: Burial Analysis on the Middle Bronze Age in the...

Bird Migration Dataset (Data Visualization / EDA)

Analysis of network performance when confirmed traffic is present in Long...

seaborn_tips_dataset

Dataset

Contents