100+ datasets found

Numpy , pandas and matplot lib practice
kaggle.com
zip
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
Explore at:
zip(385020 bytes)Available download formats
Dataset updated
Jul 16, 2023
Authors
pratham saraf
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

Specifics of the Dataset:

The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

Context of the Dataset:

The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
COVID-19 Dataset
kaggle.com
zip
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anushka Ranjan (2024). COVID-19 Dataset [Dataset]. https://www.kaggle.com/datasets/anushkaranjan/covid-19-dataset
Explore at:
zip(11178 bytes)Available download formats
Dataset updated
Oct 17, 2024
Authors
Anushka Ranjan
Description
COVID-19 DATASET

This dataset contains comprehensive information related to the COVID-19 pandemic. It includes data collected from various reliable sources, providing insights into the spread, impact, and outcomes of the virus across different regions. The dataset is structured to facilitate analysis on trends such as infection rates, recovery statistics, death tolls, and vaccination progress.

Potential Use Cases:

Trend Analysis: Analyze the spread and control of the virus over time. 2.Predictive Modeling: Build models to forecast future infection rates or outcomes. 3.Policy Research: Evaluate the effectiveness of public health policies across regions. 4.Healthcare Resource Planning: Assist in managing healthcare resources and response strategies.

The dataset will require cleaning and formatting from user end but is great for practicing if you are learning pandas and NumPy. This dataset serves as a vital resource for researchers, data scientists, healthcare professionals, and policy-makers aiming to gain a deeper understanding of the global pandemic and devise strategies for future preparedness.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Bank Data Analysis
kaggle.com
zip
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Gallegos (2022). Bank Data Analysis [Dataset]. https://www.kaggle.com/stevegallegos/bank-marketing-data-set
Explore at:
zip(376757 bytes)Available download formats
Dataset updated
Feb 23, 2022
Authors
Steve Gallegos
Description
Data Set Information

The bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.

bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

Changed file name to bank.csv after delimited

Goal

The main goal is to predict if clients will subscribe to a term deposit or not.

Attribute Information

-Input Variables -

Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)

Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)

#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
h
watches
huggingface.co
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gil (2025). watches [Dataset]. https://huggingface.co/datasets/yotam22/watches
Explore at:
Dataset updated
Nov 17, 2025
Authors
gil
Description
🕰️ Exploratory Data Analysis of Luxury Watch Prices

Overview

This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.

Dataset

Rows: ~172,000
Columns: 14
Unit of observation: one watch listing

Main columns

name – watch/listing title
price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.
4
Dataset for 'Identifying Key Drivers of Product Formation in Microbial...
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marika Zegers; Moumita Roy; Ludovic Jourdin, Dataset for 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis' [Dataset]. http://doi.org/10.4121/5e840d08-55f6-4daa-a639-048cebcd8266.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/5e840d08-55f6-4daa-a639-048cebcd8266.v1
Dataset provided by
4TU.ResearchData
Authors
Marika Zegers; Moumita Roy; Ludovic Jourdin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 2024 - Dec 1, 2024
Dataset funded by
Delft University of Technology
NWO
Description
The analysed data and complete scripts for the permutation tests and mixed linear regression models (MLRMs) used in the paper 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis'.
Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, matplotlib.pyplot, statsmodels.formula.api, seaborn are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.
m
Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
Explore at:
Unique identifier
https://doi.org/10.17632/x3fz2nfm4w.3
Dataset updated
Apr 2, 2024
Authors
Md Shoaib Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

Fresh salmon 🐟 Infected Salmon 🐠

This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

So, dive in and explore the fascinating world of salmon health and disease!

The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

Fresh salmon (healthy fish with no visible signs of disease), 456 images

Infected Salmon containing disease, 752 images

Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

Data Preprocessing

The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

Usage

To use the salmon scan dataset in your ML and DL projects, follow these steps:

Clone or download the salmon scan dataset repository from GitHub.

Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.

Split the dataset into training, validation, and test sets as per your requirement.

Preprocess the data as needed, such as resizing and normalizing the images.

Train your ML/DL model using the preprocessed training data.

Evaluate the model on the test set and make predictions on new, unseen data.
Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
An Empirical Study on Energy Usage Patterns of Different Variants of Data...
figshare.com
zip
Updated Nov 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princy Chauhan (2024). An Empirical Study on Energy Usage Patterns of Different Variants of Data Processing Libraries [Dataset]. http://doi.org/10.6084/m9.figshare.27611421.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611421.v1
Dataset updated
Nov 5, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Princy Chauhan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
u
Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...
observatorio-cientifico.ua.es
scidb.cn
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
Explore at:
Dataset updated
2025
Authors
Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
Description
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
d
Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...
search.dataone.org
dataverse.harvard.edu
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kravtsov, Gennady (2025). Data and Code for: \"Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System\" [Dataset]. http://doi.org/10.7910/DVN/BISM0N
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BISM0N
Dataset updated
Nov 15, 2025
Dataset provided by
Harvard Dataverse
Authors
Kravtsov, Gennady
Description
Dataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.
Amazon Sales Data Analysis Project1
kaggle.com
zip
Updated Sep 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sushant pattanaik (2022). Amazon Sales Data Analysis Project1 [Dataset]. https://www.kaggle.com/datasets/sushantpattanaik/amazon-sales-data-analysis-project1/code
Explore at:
zip(844623 bytes)Available download formats
Dataset updated
Sep 3, 2022
Authors
sushant pattanaik
Description
Problem Statement: Sales management has gained importance to meet increasing competition and the need for improved methods of distribution to reduce cost and to increase profits. Sales management today is the most important function in a commercial and business enterprise. We need to extract all the Amazon sales datasets, transform them using data cleaning and data preprocessing and then finally loading it for analysis. We need to visualize sales trend month-wise, year-wise and yearly-month wise. Moreover, we need to find key metrics and factors and show meaningful relationships between attributes.

Approach The main goal of the project is to find key metrics and factors and then show meaningful relationships between them based on different features available in the dataset.

Data Collection : Imported data from various datasets available in the project using Pandas library.

Data Cleaning : Removed missing values and created new features as per insights.

Data Preprocessing : Modified the structure of data in order to make it more understandable and suitable and convenient for statistical analysis.

Data Analysis : I started analyzing dataset using Pandas,Numpy,Matplotlib and Seaborn.

Data Visualization : Plotted graphs to get insights about dependent and independent variables. Also used Tableau and PowerBI for data visulization.
h
Supporting data for “Deep learning methods and applications to digital...
datahub.hku.hk
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shichao Ma (2024). Supporting data for “Deep learning methods and applications to digital health” [Dataset]. http://doi.org/10.25442/hku.27060427.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.27060427.v1
Dataset updated
Oct 3, 2024
Dataset provided by
HKU Data Repository
Authors
Shichao Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains three folders which contain either the data or the source code for the three main chapters (Chapter 3, 4, and 5) in the thesis. Those folders are 1) Dataset (Chapter 3): This file contains phonocardigrams signals (/PhysioNet2016) used in Chapter 3 and 4 as the upstream pretraining data. This is a public dataset. /SourceCode includes all the statistical analysis and visualization scripts for Chapter 3. Yaseen_dataset and PASCAL contain phonocardigrams signals with pathological features, Yaseen_dataset serves as the downstream finetuning dataset in Chapter 3, while PASCAL datasets serves as the secondary testing dataset in Chapter 3. 2) Dataset (Chapter 4): /SourceCode includes all the statistical analysis and visualization scripts for Chapter 4. 3) Dataset (Chapter 5): PAD-UFES-20_processed contains dermatology images processed from the PAD-UFES-20 dataset, which is a public dataset. The dataset is used in the Chapter 5. And /SourceCode includes all the statistical analysis and visualization scripts for Chapter 5.Several packges are mendatory to run the source code, including:Python > 3.6 (3.11 preferred), TensorFlow > 2.16, Keras > 3.3, NumPy > 1.26, Pandas > 2.2, SciPy > 1.13
Fracture toughness of mixed-mode anticracks in highly porous materials...
zenodo.org
bin, text/x-python +1
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl (2024). Fracture toughness of mixed-mode anticracks in highly porous materials dataset and data processing [Dataset]. http://doi.org/10.5281/zenodo.11443644
Explore at:
text/x-python, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11443644
Dataset updated
Sep 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.

Contents

main.ipynb: Jupyter notebook with the main data analysis workflow.

energy.py: Methods for the calculation of energy release rates.

regression.py: Methods for the regression analyses.

visualization.py: Methods for generating visualizations.

df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.

df_legacy.pkl: Pickled DataFrame with literature data.

Prerequisites

To run the scripts and notebooks, you need:

Python 3.12 or higher

Jupyter Notebook or JupyterLab

Libraries: pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weac

Setup

Download the zip file or clone this repository to your local machine.

Ensure that Python and Jupyter are installed.

Install required Python libraries using pip install -r requirements.txt.

Running the Analysis

Open the main.ipynb notebook in Jupyter Notebook or JupyterLab.

Execute the cells in sequence to reproduce the analysis.

Data Description

The data included in this repository is encapsulated in two pickled DataFrame files, df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:

df_mmft.pkl

Includes data such as experiment identifiers, datetime, and physical measurements like slope inclination and critical cut lengths.

exp_id: Unique identifier for each experiment.

datestring: Date of the experiment as a string.

datetime: Timestamp of the experiment.

bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.

slope_incl: Inclination of the slope in degrees.

h_sledge_top: Distance from sample top surface to the sled in mm.

h_wl_top: Distance from sample top surface to weak layer in mm.

h_wl_notch: Distance from the notch root to the weak layer in mm.

rc_right: Critical cut length in mm, measured on the front side of the sample.

rc_left: Critical cut length in mm, measured on the back side of the sample.

rc: Mean of rc_right and rc_left.

densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.

densities_mean: Daily mean of densities.

layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

layers_mean: Daily mean of layers.

surface_lineload: Surface line load of added surface weights in N/mm.

wl_thickness: Weak-layer thickness in mm.

notes: Additional notes regarding the experiment or observations.

L: Length of the slab–weak-layer assembly in mm.

df_legacy.pkl

Contains robustness data such as radii of curvature, slope inclination, and various geometrical measurements.

#: Record number.

rc: Critical cut length in mm.

slope_incl: Inclination of the slope in degrees.

h: Slab height in mm.

density: Mean slab density in kg/m^3.

L: Lenght of the slab–weak-layer assembly in mm.

collapse_height: Weak-layer height reduction through collapse.

layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

wl_thickness: Weak-layer thickness in mm.

surface_lineload: Surface line load from added weights in N/mm.

For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Citation

Please cite the following paper if you use this analysis or the accompanying datasets:

Adam, V., Bergfeld, B., Weißgraeber, P. van Herwijnen, A., Rosendahl, P.L., Fracture toughness of mixed-mode anticracks in highly porous materials. Nature Communincations 15, 7379 (2024). https://doi.org/10.1038/s41467-024-51491-7
Datasets for manuscript: A Graph-Based Modeling Framework for Tracing...
catalog.data.gov
Updated Nov 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript: A Graph-Based Modeling Framework for Tracing Hydrological Pollutant Transport in Surface Waters [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-graph-based-modeling-framework-for-tracing-hydrological-pollutan
Explore at:
Dataset updated
Nov 5, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Hydrology Graphs This repository contains the code for the manuscript "A Graph Formulation for Tracing Hydrological Pollutant Transport in Surface Waters." There are three main folders containing code and data, and these are outlined below. We call the framework for building a graph of these hydrological systems "Hydrology Graphs". Several of the datafiles for building this framework are large and cannot be stored on Github. To conserve space, the notebook get_and_unpack_data.ipynb or the script get_and_unpack_data.py can be used to download the data from the Watershed Boundary Dataset (WBD), the National Hydrography Dataset (NHDPlusV2), and the agricultural land dataset for the state of Wisconsin. The files WILakes.df and WIRivers.df metnioend in section 1 below are contained within the WI_lakes_rivers.zip folder, and the files 24k Hydro Waterbodies dataset are contained in a zip file under the directory DNR_data/Hydro_Waterbodies. These files can also be unpacked by running the corresponding cells in the notebook get_and_unpack_data.ipynb or get_and_unpack_data.py. 1. graph_construction This folder contains the data and code for building a graph of the watershed-river-waterbody hydrological system. It uses data from the Watershed Boundary Dataset (link here) and the National Hydrography Dataset (link here) as a basis and builds a list of directed edges. We use NetworkX to build and visualize the list as a graph. case_studies This folder contains three .ipynb files for three separate case studies. These three case studies focus on how "Hydrology Graphs" can be used to analyze pollutant impacts in surface waters. Details of these case studies can be found in the manuscript above. DNR_data This folder contains data from the Wisconsin Department of Natural Resources (DNR) on water quality in several Wisconsin lakes. The data was obtained from here using the file Web_scraping_script.py. The original downloaded reports are found in the folder original_lake_reports. These reports were then cleaned and reformatted using the script DNR_data_filter.ipynb. The resulting, cleaned reports are found in the Lakes folder. Each subfolder of the Lakes folder contains data for a single lake. The two .csvs lake_index_WBIC.csv contain an index for what lake each numbered subfolder corresponds. In addition, we added the corresponding COMID in lake_index_WBIC_COMID.csv by matching the NHDPlusV2 data to the Wisconsin DNR's 24k Hydro Waterbodies dataset which we downloaded from here. The DNR's reported data only matches lakes to a waterbody identification code (WBIC), so we use HYDROLakes (indexed by WBIC) to match to the COMID. This is done in the DNR_data_filter.ipynb script as well. Python Versions The .py files in graph_construction/ were run using Python version 3.9.7. The scripts used the following packages and version numbers: geopandas (0.10.2) shapely (1.8.1.post1) tqdm (4.63.0) networkx (2.7.1) pandas (1.4.1) numpy (1.21.2). This dataset is associated with the following publication: Cole, D.L., G.J. Ruiz-Mercado, and V.M. Zavala. A graph-based modeling framework for tracing hydrological pollutant transport in surface waters. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 179: 108457, (2023).
Z
Supplementary material: Burial Analysis on the Middle Bronze Age in the...
data.niaid.nih.gov
zenodo.org
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laabs, Julian (2024). Supplementary material: Burial Analysis on the Middle Bronze Age in the Carpathian Basin (dataset and scripts) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7355008
Explore at:
Dataset updated
Dec 4, 2024
Dataset provided by
Kiel University
Authors
Laabs, Julian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pannonian Basin
Description
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
Student Performance and Learning Behavior Dataset for Educational Analytics
zenodo.org
bin, csv
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16459132
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamal NAJEM; Kamal NAJEM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 26, 2025
Description
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resource access and learning environment: Resources, Internet, EduTech

Motivation and psychological factors: Motivation, StressLevel

Demographic information: Gender, Age (ranging from 18 to 30 years)

Learning preference classification: LearningStyle

Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

Data Preprocessing –

Encoding categorical variables using LabelEncoder.

Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

Detecting and removing duplicates.

Clustering Analysis –

Applying K-Means clustering to segment learners into distinct profiles.

Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
4
Dataset for 'Novel miniaturised microbial electrosynthesis reactor: A study...
data.4tu.nl
zip
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marika Zegers; Eva Augustijn; Geurt Jongbloed; Ludovic Jourdin (2025). Dataset for 'Novel miniaturised microbial electrosynthesis reactor: A study on replicability' [Dataset]. http://doi.org/10.4121/2d14c7a1-e707-4adb-939d-1556a19d9f76.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/2d14c7a1-e707-4adb-939d-1556a19d9f76.v2
Dataset updated
May 23, 2025
Dataset provided by
4TU.ResearchData
Authors
Marika Zegers; Eva Augustijn; Geurt Jongbloed; Ludovic Jourdin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2023 - Oct 1, 2024
Area covered
Van der Maasweg 9, the Netherlands, 2629HZ Delft
Dataset funded by
Delft University of Technology
Description
The analysed data and full scripts for the ECSA determination, onset potential determination, micro-CT data analysis and DTW distance calculation used in the paper 'Novel Miniaturised Microbial Electrosynthesis Reactor: A Study on Replicability'.
Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, dtaidistance, math, skfda, kneed, matplotlib.pyplot are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.
z
Data set containing the energy landscapes for GPO and GPP tropocollagen...
zenodo.org
zip
Updated Sep 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowe, James; Röder, Konstantin (2022). Data set containing the energy landscapes for GPO and GPP tropocollagen models under pulling forces [Dataset]. http://doi.org/10.5281/zenodo.7107608
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7107608
Dataset updated
Sep 23, 2022
Dataset provided by
University of Cambridge
Authors
Rowe, James; Röder, Konstantin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Energy landscapes (databases of minima and transition states) for GPO and GPP repeat collagen models under constant pulling forces as explored with OPTIM and PATHSAMPLE with an AMBER force field.

The systems are seven GPO or GPP per chain capped with ACE and NME.

The forces applied are 0 pN (F0), 10 pN (F1), 50 pN (F2), 100 pN (F3), 250 pN (F4), 500 pN (F5) and 750 pN (F6).

The folders contains numerous analysis scripts and graphs. Most of these assume python with numpy and pandas, as well as cpptraj from AMBERTools.
Z
TrafficDator Madrid
data.niaid.nih.gov
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gómez, Iván; Ilarri, Sergio (2024). TrafficDator Madrid [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10435153
Explore at:
Dataset updated
Apr 6, 2024
Dataset provided by
Universidad de Zaragoza
Authors
Gómez, Iván; Ilarri, Sergio
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Madrid
Description
Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.

Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).

Technical Details:

Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.

Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.

Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.

Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.

Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.

Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.

Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:

id: Unique sensor identifier.

date: Date and time of the measurement.

longitude and latitude: Geographical coordinates of the sensor.

day type: Information about the day being a working day, holiday, or festive Sunday.

intensity: Measured traffic intensity.

Additional data like wind, temperature, precipitation, etc.

Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.

Acknowledgment and Funding:

This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.

In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.

For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.

Facebook

Twitter

Click to copy link

Link copied

Cite

pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions

Numpy , pandas and matplot lib practice

Dataset with Diverse Features and Variations: Exploring a Multivariate Collectio

Explore at:

zip(385020 bytes)Available download formats

Dataset updated

Jul 16, 2023

Authors

pratham saraf

License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Description

The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

Specifics of the Dataset:

The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

Context of the Dataset:

The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.

Clear search

Close search

Google apps

Main menu

Numpy , pandas and matplot lib practice

COVID-19 Dataset

COVID-19 DATASET

Potential Use Cases:

Reddit r/AskScience Flair Dataset

Bank Data Analysis

Data Set Information

Goal

Attribute Information

-Input Variables -

Source

watches

Dataset for 'Identifying Key Drivers of Product Formation in Microbial...

Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

Classicmodels

An Empirical Study on Energy Usage Patterns of Different Variants of Data...

Data from: CADDI: An in-Class Activity Detection Dataset using IMU data from...

Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...

Amazon Sales Data Analysis Project1

Supporting data for “Deep learning methods and applications to digital...

Fracture toughness of mixed-mode anticracks in highly porous materials...

Contents

Prerequisites

Setup

Running the Analysis

Data Description

`df_mmft.pkl`

`df_legacy.pkl`

License

Citation

Datasets for manuscript: A Graph-Based Modeling Framework for Tracing...

Supplementary material: Burial Analysis on the Middle Bronze Age in the...

Student Performance and Learning Behavior Dataset for Educational Analytics

Dataset for 'Novel miniaturised microbial electrosynthesis reactor: A study...

Data set containing the energy landscapes for GPO and GPP tropocollagen...

TrafficDator Madrid

Numpy , pandas and matplot lib practice

Dataset with Diverse Features and Variations: Exploring a Multivariate Collectio