Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Facebook
TwitterThis dataset contains comprehensive information related to the COVID-19 pandemic. It includes data collected from various reliable sources, providing insights into the spread, impact, and outcomes of the virus across different regions. The dataset is structured to facilitate analysis on trends such as infection rates, recovery statistics, death tolls, and vaccination progress.
The dataset will require cleaning and formatting from user end but is great for practicing if you are learning pandas and NumPy. This dataset serves as a vital resource for researchers, data scientists, healthcare professionals, and policy-makers aiming to gain a deeper understanding of the global pandemic and devise strategies for future preparedness.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Facebook
TwitterThe bank.csv dataset describes about a phone call between customer and customer care staffs who are working for Portuguese banking institution. The dataset is about, whether the customer will get the scheme or product such as bank term deposit. Maximum the data will have ‘yes’ or ‘no’ type data.
The main goal is to predict if clients will subscribe to a term deposit or not.
Bank Client Data: 1 - age: (numeric) 2 - job: type of job (categorical: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown) 3 - marital: marital status (categorical: divorced, married, single, unknown; note: divorced means either divorced or widowed) 4 - education: (categorical: basic.4y, basic.6y, basic.9y, high.school, illiterate, professional.course, university.degree, unknown) 5 - default: has credit in default? (categorical: no, yes, unknown) 6 - housing: has housing loan? (categorical: no, yes, unknown) 7 - loan: has personal loan? (categorical: no, yes, unknown)
Related with the Last Contact of the Current Campaign: 8 - contact: contact communication type (categorical: cellular, telephone) 9 - month: last contact month of year (categorical: jan, feb, mar, ..., nov, dec) 10 - day_of_week: last contact day of the week (categorical: mon, tue, wed, thu, fri) 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: failure, nonexistent, success)
#Social and Economic Context Attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output Variable (Desired Target): 21 - y (deposit): - has the client subscribed a term deposit? (binary: yes, no) -> changed column title from '***y***' to '***deposit***'
Facebook
Twitter🕰️ Exploratory Data Analysis of Luxury Watch Prices
Overview
This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.
Dataset
Rows: ~172,000
Columns: 14
Unit of observation: one watch listing
Main columns
name – watch/listing title
price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The analysed data and complete scripts for the permutation tests and mixed linear regression models (MLRMs) used in the paper 'Identifying Key Drivers of Product Formation in Microbial Electrosynthesis with a Mixed Linear Regression Analysis'.
Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, matplotlib.pyplot, statsmodels.formula.api, seaborn are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:
Fresh salmon 🐟 Infected Salmon 🐠
This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.
So, dive in and explore the fascinating world of salmon health and disease!
The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]
The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:
Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.
Data Preprocessing
The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:
Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.
Usage
To use the salmon scan dataset in your ML and DL projects, follow these steps:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
Facebook
TwitterData DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Facebook
TwitterDataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.
Facebook
TwitterProblem Statement: Sales management has gained importance to meet increasing competition and the need for improved methods of distribution to reduce cost and to increase profits. Sales management today is the most important function in a commercial and business enterprise. We need to extract all the Amazon sales datasets, transform them using data cleaning and data preprocessing and then finally loading it for analysis. We need to visualize sales trend month-wise, year-wise and yearly-month wise. Moreover, we need to find key metrics and factors and show meaningful relationships between attributes.
Approach The main goal of the project is to find key metrics and factors and then show meaningful relationships between them based on different features available in the dataset.
Data Collection : Imported data from various datasets available in the project using Pandas library.
Data Cleaning : Removed missing values and created new features as per insights.
Data Preprocessing : Modified the structure of data in order to make it more understandable and suitable and convenient for statistical analysis.
Data Analysis : I started analyzing dataset using Pandas,Numpy,Matplotlib and Seaborn.
Data Visualization : Plotted graphs to get insights about dependent and independent variables. Also used Tableau and PowerBI for data visulization.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains three folders which contain either the data or the source code for the three main chapters (Chapter 3, 4, and 5) in the thesis. Those folders are 1) Dataset (Chapter 3): This file contains phonocardigrams signals (/PhysioNet2016) used in Chapter 3 and 4 as the upstream pretraining data. This is a public dataset. /SourceCode includes all the statistical analysis and visualization scripts for Chapter 3. Yaseen_dataset and PASCAL contain phonocardigrams signals with pathological features, Yaseen_dataset serves as the downstream finetuning dataset in Chapter 3, while PASCAL datasets serves as the secondary testing dataset in Chapter 3. 2) Dataset (Chapter 4): /SourceCode includes all the statistical analysis and visualization scripts for Chapter 4. 3) Dataset (Chapter 5): PAD-UFES-20_processed contains dermatology images processed from the PAD-UFES-20 dataset, which is a public dataset. The dataset is used in the Chapter 5. And /SourceCode includes all the statistical analysis and visualization scripts for Chapter 5.Several packges are mendatory to run the source code, including:Python > 3.6 (3.11 preferred), TensorFlow > 2.16, Keras > 3.3, NumPy > 1.26, Pandas > 2.2, SciPy > 1.13
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.
main.ipynb: Jupyter notebook with the main data analysis workflow.energy.py: Methods for the calculation of energy release rates.regression.py: Methods for the regression analyses.visualization.py: Methods for generating visualizations.df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.df_legacy.pkl: Pickled DataFrame with literature data.pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weacpip install -r requirements.txt.main.ipynb notebook in Jupyter Notebook or JupyterLab.df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:df_mmft.pklexp_id: Unique identifier for each experiment.datestring: Date of the experiment as a string.datetime: Timestamp of the experiment.bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.slope_incl: Inclination of the slope in degrees.h_sledge_top: Distance from sample top surface to the sled in mm.h_wl_top: Distance from sample top surface to weak layer in mm.h_wl_notch: Distance from the notch root to the weak layer in mm.rc_right: Critical cut length in mm, measured on the front side of the sample.rc_left: Critical cut length in mm, measured on the back side of the sample.rc: Mean of rc_right and rc_left.densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.densities_mean: Daily mean of densities.layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.layers_mean: Daily mean of layers.surface_lineload: Surface line load of added surface weights in N/mm.wl_thickness: Weak-layer thickness in mm.notes: Additional notes regarding the experiment or observations.L: Length of the slab–weak-layer assembly in mm.df_legacy.pkl#: Record number.rc: Critical cut length in mm.slope_incl: Inclination of the slope in degrees.h: Slab height in mm.density: Mean slab density in kg/m^3.L: Lenght of the slab–weak-layer assembly in mm.collapse_height: Weak-layer height reduction through collapse.layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.wl_thickness: Weak-layer thickness in mm.surface_lineload: Surface line load from added weights in N/mm.For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.
Facebook
TwitterHydrology Graphs This repository contains the code for the manuscript "A Graph Formulation for Tracing Hydrological Pollutant Transport in Surface Waters." There are three main folders containing code and data, and these are outlined below. We call the framework for building a graph of these hydrological systems "Hydrology Graphs". Several of the datafiles for building this framework are large and cannot be stored on Github. To conserve space, the notebook get_and_unpack_data.ipynb or the script get_and_unpack_data.py can be used to download the data from the Watershed Boundary Dataset (WBD), the National Hydrography Dataset (NHDPlusV2), and the agricultural land dataset for the state of Wisconsin. The files WILakes.df and WIRivers.df metnioend in section 1 below are contained within the WI_lakes_rivers.zip folder, and the files 24k Hydro Waterbodies dataset are contained in a zip file under the directory DNR_data/Hydro_Waterbodies. These files can also be unpacked by running the corresponding cells in the notebook get_and_unpack_data.ipynb or get_and_unpack_data.py. 1. graph_construction This folder contains the data and code for building a graph of the watershed-river-waterbody hydrological system. It uses data from the Watershed Boundary Dataset (link here) and the National Hydrography Dataset (link here) as a basis and builds a list of directed edges. We use NetworkX to build and visualize the list as a graph. case_studies This folder contains three .ipynb files for three separate case studies. These three case studies focus on how "Hydrology Graphs" can be used to analyze pollutant impacts in surface waters. Details of these case studies can be found in the manuscript above. DNR_data This folder contains data from the Wisconsin Department of Natural Resources (DNR) on water quality in several Wisconsin lakes. The data was obtained from here using the file Web_scraping_script.py. The original downloaded reports are found in the folder original_lake_reports. These reports were then cleaned and reformatted using the script DNR_data_filter.ipynb. The resulting, cleaned reports are found in the Lakes folder. Each subfolder of the Lakes folder contains data for a single lake. The two .csvs lake_index_WBIC.csv contain an index for what lake each numbered subfolder corresponds. In addition, we added the corresponding COMID in lake_index_WBIC_COMID.csv by matching the NHDPlusV2 data to the Wisconsin DNR's 24k Hydro Waterbodies dataset which we downloaded from here. The DNR's reported data only matches lakes to a waterbody identification code (WBIC), so we use HYDROLakes (indexed by WBIC) to match to the COMID. This is done in the DNR_data_filter.ipynb script as well. Python Versions The .py files in graph_construction/ were run using Python version 3.9.7. The scripts used the following packages and version numbers: geopandas (0.10.2) shapely (1.8.1.post1) tqdm (4.63.0) networkx (2.7.1) pandas (1.4.1) numpy (1.21.2). This dataset is associated with the following publication: Cole, D.L., G.J. Ruiz-Mercado, and V.M. Zavala. A graph-based modeling framework for tracing hydrological pollutant transport in surface waters. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 179: 108457, (2023).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the supplementary material of the paper "Wealth Consumption, Sociopolitical Organization, and Change: A Perspective from Burial Analysis on the Middle Bronze Age in the Carpathian Basin" (accessible over doi: https://doi.org/10.1515/opar-2022-0281). Please consult the publication for in depth description of the data, its context and for the method applied on the data, as well as references to primary sources. The data tables comprise the burial data of the Hungarian Middle Bronze Age cemeteries of Dunaújváros-Duna-dűlő, Dömsöd, Adony, Lovasberény, Csanytelek-Palé, Kelebia, Hernádkak, Gelej, Pusztaszikszó and Streda nad Bodrogom. The script "supplementary_material_2_wealth_index_calculation.py" provides the calculation of a wealth index, based on grave goods, for the provided data. The script "supplementary_material_3_population_estimation.py" models the living population of Dunaújváros-Duna-dűlő. Both can be run by double-click. Requirements to be installed to run the scripts: Python 3 (https://www.python.org/) with the packages numpy (https://numpy.org/), pandas (https://pandas.pydata.org/), matplotlib (https://matplotlib.org/), seaborn (https://seaborn.pydata.org/) and scipy (https://scipy.org/); all included in Ancaonda (Python-Distribution, https://www.anaconda.com/).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.
The dataset covers the following categories of variables:
Resource access and learning environment: Resources, Internet, EduTech
Motivation and psychological factors: Motivation, StressLevel
Demographic information: Gender, Age (ranging from 18 to 30 years)
Learning preference classification: LearningStyle
Academic performance indicators: ExamScore, FinalGrade
In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.
The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:
Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.
Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).
Data Preprocessing –
Encoding categorical variables using LabelEncoder.
Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).
Detecting and removing duplicates.
Clustering Analysis –
Applying K-Means clustering to segment learners into distinct profiles.
Determining the optimal number of clusters using the Elbow Method and Silhouette Score.
Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).
Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.
Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.
Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.
Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The analysed data and full scripts for the ECSA determination, onset potential determination, micro-CT data analysis and DTW distance calculation used in the paper 'Novel Miniaturised Microbial Electrosynthesis Reactor: A Study on Replicability'.
Python version 3.10.13 with packages numpy, pandas, os, scipy.optimize, scipy.stats, sklearn.metrics, dtaidistance, math, skfda, kneed, matplotlib.pyplot are required to run the .py files. Ensure all packages are installed before running the scripts. Data files required to run the code (.xlsx and .csv format) are included in the relevant folders.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Energy landscapes (databases of minima and transition states) for GPO and GPP repeat collagen models under constant pulling forces as explored with OPTIM and PATHSAMPLE with an AMBER force field.
The systems are seven GPO or GPP per chain capped with ACE and NME.
The forces applied are 0 pN (F0), 10 pN (F1), 50 pN (F2), 100 pN (F3), 250 pN (F4), 500 pN (F5) and 750 pN (F6).
The folders contains numerous analysis scripts and graphs. Most of these assume python with numpy and pandas, as well as cpptraj from AMBERTools.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data Origin: This dataset was generated using information from the Community of Madrid, including traffic data collected by multiple sensors located throughout the city, as well as work calendar and meteorological data, all provided by the Community.
Data Type: The data consists of traffic measurements in Madrid from June 1, 2022, to September 30, 2023. Each record includes information on the date, time, location (longitude and latitude), traffic intensity, and associated road and weather conditions (e.g., whether it is a working day, holiday, information on wind, temperature, precipitation, etc.).
Technical Details:
Data Preprocessing: We utilized advanced techniques for cleaning and normalizing traffic data collected from sensors across Madrid. This included handling outliers and missing values to ensure data quality.
Geospatial Analysis: We used GeoPandas and OSMnx to map traffic data points onto Madrid's road network. This process involved processing spatial attributes such as street lanes and speed limits to add context to the traffic data.
Meteorological Data Integration: We incorporated Madrid's weather data, including temperature, precipitation, and wind speed. Understanding the impact of weather conditions on traffic patterns was crucial in this step.
Traffic Data Clustering: We implemented K-Means clustering to identify patterns in traffic data. This approach facilitated the selection of representative sensors from each cluster, focusing on the most relevant data points.
Calendar Integration: We combined the traffic data with the work calendar to distinguish between different types of days. This provided insights into traffic variations on working days and holidays.
Comprehensive Analysis Approach: The analysis was conducted using Python libraries such as Pandas, NumPy, scikit-learn, and Shapely. It covered data from the years 2022 and 2023, focusing on the unique characteristics of the Madrid traffic dataset.
Data Structure: Each row of the dataset represents an individual measurement from a traffic sensor, including:
id: Unique sensor identifier.
date: Date and time of the measurement.
longitude and latitude: Geographical coordinates of the sensor.
day type: Information about the day being a working day, holiday, or festive Sunday.
intensity: Measured traffic intensity.
Additional data like wind, temperature, precipitation, etc.
Purpose of the Dataset: This dataset is useful for traffic analysis, urban mobility studies, infrastructure planning, and research related to traffic behavior under different environmental and temporal conditions.
Acknowledgment and Funding:
This dataset was obtained as part of the R&D project PID2020-113037RB-I00, funded by MCIN/AEI/10.13039/501100011033.
In addition to the NEAT-AMBIENCE project, support from the Department of Science, University, and Knowledge Society of the Government of Aragon (Government of Aragon: group reference T64_23R, COSMOS research group) is also acknowledged.
For academic and research purposes, please reference this dataset using its DOI for proper attribution and tracking.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.