Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records
KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage
Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)
Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests
Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results
This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df
Facebook
TwitterWe created a semi-synthetic wind profile from wind turbine data and converted it to current and potential profiles for PEM and alkaline water electrolysis cells with a maximum power output of 40 and 4 W respectively. Then we conducted dynamic electrolysis with these profiles for up to 961 h with PEMWE and AWE single cells. The data obtained from the dynamic operation are included in the dataset. We applied two analysis methods to our datasets in Python to extract performance data from the electrolysis cells like I-V-curves, current density dependent cell voltage changes and resistances. The Python code is also part of the dataset.
Facebook
TwitterStudy Tool and Dataset[Environment preparation]1. Python version: 3.6 or upper version2. Dependent libraries:progressbar, nltk, textblob, sklearn, matplotlib, plotly, fuzzywuzzy, statsmodels, corpora, etc.Utilize pip install [lib_name] to install the libraries.[Running the program]1. Command linecollect.py -- for data collection, vulnerability categorization and language interfacing classification.Type "collect.py -h" for help.2. comman parameters<1> collect.pycollect.py -s collect -- grab raw repositories from github.collect.py -s repostats -- collect basic properies for each repository.collect.py -s langstats -- empirical analysis for language information: profile size, combinations, etc.collect.py -s cmmts -- collect commits for each project, and classify the commits with fuccywuzzy.collect.py -s nbr -- NBR analysis on the dataset.collect.py -s clone -- clone all projects to local storage.collect.py -s apisniffer -- classify the projects by language interface typesWe also provide the shell script for parallel execution in multiple processes to speed up the data collection and analysis.cmmts.sh [repository number]: execute the commit collection and classification in multiple processesclone.sh [repository number]: clone the repositories to local in multiple processessniffer.sh [repository number]: identify and category the repositories by langauge interfacing mechanisms in multiple processes3. Dataset<1> Data/OriginData/Repository_List.csv: original repository profile grabbed from github.<2> Data/CmmtSet: original commit data by repository, each file is named as the repository ID.<3> Data/Issues: original issue information by repository.<4> Data/StatData/CmmtSet: classified commit data by repository, each commit can be retrieved from github through 'sha' field.<5> Data/StatData/ApiSniffer.csv: classified repositories by language interfacing mechanisms
Facebook
TwitterThis data release contains acoustic Doppler current profiler (ADCP) data collected during 2023 from two uplooking tripods in bays of Lake Ontario in central New York. Data were collected at Irondequoit Bay (USGS station number 431314077315901) and at Sodus Bay (USGS station number 431533076582101). Data are organized by bay in child item datasets containing the raw binary data files from the ADCPs as well as tabulated text files of echo intensity, backscatter, velocity, and ancillary data. Tables were created by processing raw data files in R-language oceanographic package OCE (Kelley and others, 2022) and TRDI WinRiver II (Teledyne RD Instruments, 2007). All aggregation, manual magnetic variation calculations, and post-processing were completed using Python libraries pandas (McKinney, 2010) and NumPy (Harris and others, 2020). Tables were created by processing raw data files in R-language oceanographic package OCE (Kelley and others, 2022) and TRDI WinRiver II (Teledyne RD Instruments, 2007). All aggregation, manual magnetic variation calculations, and post-processing were completed using Python libraries pandas (McKinney, 2010) and NumPy (Harris and others, 2020).
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.
Request here for the large dataset Medium datasets
Checkout sample dataset in CSV
Training language models (LLMs)
Analyzing content trends and engagement
Sentiment and text classification
SEO research and author profiling
Academic or commercial research
High-volume, cleanly structured JSON
Ideal for developers, researchers, and data scientists
Easy integration with Python, R, SQL, and other data pipelines
Affordable and ready-to-use
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.
With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:
Quantify execution time, CPU energy usage, and carbon emissions
Enable reproducible analysis of performance–sustainability trade-offs
Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation
All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.
Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.
Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.
Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.
Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.
This dataset is ideal for:
Reproducible software sustainability studies
Benchmarking Python execution strategies
Analyzing energy–performance–carbon trade-offs
Validating green metrics and measurement tools
Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582
Facebook
TwitterThis dataset contains processed acoustic Doppler current profiler (ADCP) measurements from twenty energetic tidal energy sites in the United States, Scotland, and New Zealand, compiled for the 2025 publication Current Depth Profile Characterization for Tidal Energy Development (linked below). Measurements were sourced from peer-reviewed literature, the Marine and Hydrokinetic Data Repository, EMEC, and NOAA's C-MIST database, and were selected for sites with depth-averaged current speeds exceeding 1m/s. Data span a range of tidal cycles, depths (5-70m), and flow regimes, and have been quality-controlled, filtered, and transformed into principal flood and ebb flow directions. Each netCDF file corresponds to a single site, with file names based on the site codes defined in the publication. The dataset classifies current depth profiles by shape, reports their prevalence by flow regime, and provides fitted power law parameters for monotonic profiles, along with metrics for non-monotonic profiles. Detailed descriptions of variables, units, and file naming conventions are provided in the dataset README. The submission complies with FAIR data principles: it is findable through the open-access PRIMRE Marine and Hydrokinetic Data Repository with a DOI; accessible via self-describing netCDF files readable in open-source tools such as Python and R; interoperable for integration with other applications and databases; and reusable through comprehensive documentation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.
We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.
Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset presents the output of the application of the Jarkus Analysis Toolbox (JAT) to the Jarkus dataset. The Jarkus dataset is one of the most elaborate coastal datasets in the world and consists of coastal profiles of the entire Dutch coast, spaced about 250-500 m apart, which have been measured yearly since 1965. Different available definitions for extracting characteristic parameters from coastal profiles were collected and implemented in the JAT. The characteristic parameters allow stakeholders (e.g. scientists, engineers and coastal managers) to study the spatial and temporal variations in parameters like dune height, dune volume, dune foot, beach width and closure depth. This dataset includes a netcdf file (on the opendap server, see data link) that contains all characteristic parameters through space and time, and a distribution plot that shows the overview of each characteristic parameters. The Jarkus Analysis Toolbox and all scripts that were used to extract the characteristic parameters and create the distribution plots are available through Github (https://github.com/christavanijzendoorn/JAT). Example 5 that is included in the JAT provides a python script that shows how to load and work with the netcdf file.Documentation: https://jarkus-analysis-toolbox.readthedocs.io/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.
As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.
Description of the AI pipeline
The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:
Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.
AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.
AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.
In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.
Finally, all these artifacts are packed together in an RO-Crate.
For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.
Description of the RO-Crate
Process Run Crate related aspects
The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.
Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.
As a result, the crate consists the seven following “executables”:
Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.
Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.
For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.
Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.
CPM RO-Crate related aspects
The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.
In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.
Remarks
The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.
The input image files included in this RO-Crate are coming from the Camelyon16 dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.
File descriptions
Steps to reproduce
The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.
import numpy as np
from scipy.spatial.transform import Rotation as R
#The intrinsic matrix K is constructed using the following formulation:
focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
K = [[focalLengthPixel, 0, dimX/2],
[0, focalPixel, dimY/2],
[0, 0, 1]]
#The rotation vector as provided by Blender was first transformed to a rotation matrix:
r = R.from_euler('xyz', vectR, degrees=True)
matR = r.as_matrix()
#Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
R_world2bcam = np.transpose(matR)
#The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
R_bcam2cv = np.array([[1, 0, 0],
[0, -1, 0],
[0, 0, -1]])
#Thus the representation from WORLD to CV/STANDARD coordinates is:
R_world2cv = R_bcam2cv.dot(R_world2bcam)
#The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
T_world2bcam = -1 * R_world2bcam.dot(vectC)
T_world2cv = R_bcam2cv.dot(T_world2bcam)
The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.
Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.
Facebook
TwitterSupporting data for CRACMMv1, including the SPECIATE database mapped to CRACMM, input to the Speciation Tool, profile files output from Speciation Tool for input to SMOKE, python code for mapping species to CRACMM, chemical mechanism, and mechanism metadata is available at https://github.com/USEPA/CRACMM. Specific analyses and scripts used in the manuscript "Linking gas, particulate, and toxic endpoints to air emissions in the Community Regional Atmospheric Chemistry Multiphase Mechanism (CRACMM) version 1.0" such as the 2017 U.S. species-level inventory and code for figures is available here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This item containst data sets for Sych et al, Nature Biotechology, 2023. It contains raw fluroescence fluctuation data as excle sheet and raw figure files.
Abstract: We introduce a method, single-particle profiler (SPP), that provides single-particle information on the content and biophysical properties of thousands of particles in the size range 5-150 nm. We apply SPP to measure the mRNA encapsulation efficiency of lipid nanoparticles, viral binding efficiency of different nanobodies, and biophysical heterogeneity of liposomes, lipoproteins, exosomes and viruses.
Data usage Researchers are welcome to use the data contained in the dataset for any projects. Please cite this item upon use or when published. We encourage reuse using the same CC BY 4.0 License.
Data Content FCS files as raw data (.fcs) Excel and Prism files for graphs
.xlsx: Microsoft Excel .pzfx: GraphPad Prism .svg: Inkscape (https://inkscape.org/) .fcs: Single Particle Profiler (https://github.com/taras-sych/Single-particle-profiler) .ipynb: Jupyter Notebook, installed as part of anaconda platform, python 3.8.8 (https://www.anaconda.com/) .py: executed via anaconda platform, python 3.8.8 (https://www.anaconda.com/)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive collection of consumer behavior data that can be used for various market research and statistical analyses. It includes information on purchasing patterns, demographics, product preferences, customer satisfaction, and more, making it ideal for market segmentation, predictive modeling, and understanding customer decision-making processes.
The dataset is designed to help researchers, data scientists, and marketers gain insights into consumer purchasing behavior across a wide range of categories. By analyzing this dataset, users can identify key trends, segment customers, and make data-driven decisions to improve product offerings, marketing strategies, and customer engagement.
Key Features: Customer Demographics: Understand age, income, gender, and education level for better segmentation and targeted marketing. Purchase Behavior: Includes purchase amount, frequency, category, and channel preferences to assess spending patterns. Customer Loyalty: Features like brand loyalty, engagement with ads, and loyalty program membership provide insights into long-term customer retention. Product Feedback: Customer ratings and satisfaction levels allow for analysis of product quality and customer sentiment. Decision-Making: Time spent on product research, time to decision, and purchase intent reflect how customers make purchasing decisions. Influences on Purchase: Factors such as social media influence, discount sensitivity, and return rates are included to analyze how external factors affect purchasing behavior.
Columns Overview: Customer_ID: Unique identifier for each customer. Age: Customer's age (integer). Gender: Customer's gender (categorical: Male, Female, Non-binary, Other). Income_Level: Customer's income level (categorical: Low, Middle, High). Marital_Status: Customer's marital status (categorical: Single, Married, Divorced, Widowed). Education_Level: Highest level of education completed (categorical: High School, Bachelor's, Master's, Doctorate). Occupation: Customer's occupation (categorical: Various job titles). Location: Customer's location (city, region, or country). Purchase_Category: Category of purchased products (e.g., Electronics, Clothing, Groceries). Purchase_Amount: Amount spent during the purchase (decimal). Frequency_of_Purchase: Number of purchases made per month (integer). Purchase_Channel: The purchase method (categorical: Online, In-Store, Mixed). Brand_Loyalty: Loyalty to brands (1-5 scale). Product_Rating: Rating given by the customer to a purchased product (1-5 scale). Time_Spent_on_Product_Research: Time spent researching a product (integer, hours or minutes). Social_Media_Influence: Influence of social media on purchasing decision (categorical: High, Medium, Low, None). Discount_Sensitivity: Sensitivity to discounts (categorical: Very Sensitive, Somewhat Sensitive, Not Sensitive). Return_Rate: Percentage of products returned (decimal). Customer_Satisfaction: Overall satisfaction with the purchase (1-10 scale). Engagement_with_Ads: Engagement level with advertisements (categorical: High, Medium, Low, None). Device_Used_for_Shopping: Device used for shopping (categorical: Smartphone, Desktop, Tablet). Payment_Method: Method of payment used for the purchase (categorical: Credit Card, Debit Card, PayPal, Cash, Other). Time_of_Purchase: Timestamp of when the purchase was made (date/time). Discount_Used: Whether the customer used a discount (Boolean: True/False). Customer_Loyalty_Program_Member: Whether the customer is part of a loyalty program (Boolean: True/False). Purchase_Intent: The intent behind the purchase (categorical: Impulsive, Planned, Need-based, Wants-based). Shipping_Preference: Shipping preference (categorical: Standard, Express, No Preference). Payment_Frequency: Frequency of payment (categorical: One-time, Subscription, Installments). Time_to_Decision: Time taken from consideration to actual purchase (in days).
Use Cases: Market Segmentation: Segment customers based on demographics, preferences, and behavior. Predictive Analytics: Use data to predict customer spending habits, loyalty, and product preferences. Customer Profiling: Build detailed profiles of different consumer segments based on purchase behavior, social media influence, and decision-making patterns. Retail and E-commerce Insights: Analyze purchase channels, payment methods, and shipping preferences to optimize marketing and sales strategies.
Target Audience: Data scientists and analysts looking for consumer behavior data. Marketers interested in improving customer segmentation and targeting. Researchers are exploring factors influencing consumer decisions and preferences. Companies aiming to improve customer experience and increase sales through data-driven decisions.
This dataset is available in CSV format for easy integration into data analysis tools and platforms such as Python, R, and Excel.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Comprehensive profiling of lipid species in a biological sample, or lipidomics, is a valuable approach to elucidating disease pathogenesis and identifying biomarkers. Currently, a typical lipidomics experiment may track hundreds to thousands of individual lipid species. However, drawing biological conclusions requires multiple steps of data processing to enrich significantly altered features and confident identification of these features. Existing solutions for these data analysis challenges (i.e., multivariate statistics and lipid identification) involve performing various steps using different software applications, which imposes a practical limitation and potentially a negative impact on reproducibility. Hydrophilic interaction liquid chromatography-ion mobility-mass spectrometry (HILIC-IM-MS) has shown advantages in separating lipids through orthogonal dimensions. However, there are still gaps in the coverage of lipid classes in the literature. To enable reproducible and efficient analysis of HILIC-IM-MS lipidomics data, we developed an open-source Python package, LiPydomics, which enables performing statistical and multivariate analyses (“stats” module), generating informative plots (“plotting” module), identifying lipid species at different confidence levels (“identification” module), and carrying out all functions using a user-friendly text-based interface (“interactive” module). To support lipid identification, we assembled a comprehensive experimental database of m/z and CCS of 45 lipid classes with 23 classes containing HILIC retention times. Prediction models for CCS and HILIC retention time for 22 and 23 lipid classes, respectively, were trained using the large experimental data set, which enabled the generation of a large predicted lipid database with 145,388 entries. Finally, we demonstrated the utility of the Python package using Staphylococcus aureus strains that are resistant to various antimicrobials.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Ethical clearance reference number: refer to the uploaded document Ethics Certificate.pdf.
General (0)
0 - Built diagrams and figures.pdf: diagrams and figures used for the thesis
Analysis of country data (1)
0 - Country selection.xlsx: In this analysis the sub-Saharan country (Niger) is selected based on the kWh per capita data obtained from sources such as the United Nations and the World Bank. Other data used from these sources includes household size and electricity access. Some household data was projected using linear regression. Sample sizes VS error margins were also analyzed for the selection of a smaller area within the country.
Smart metering experiment (2)
The figures (PNG, JPG, PDF) include:
- The experiment components and assembly
- The use of device (meter and modem) softwar tools to program and analyse data
- Phasor and meter detail
- Extracted reports and graphs from the MDMS
The datasets (CSV, XLSX) include:
- Energy load profile and register data recorded by the smart meter and collected by both meter configuration and MDM applications.
- Data collected also includes events, alarm and QoS data.
Data applicability to SEAP (3)
3 - Energy data and SEAP.pdf: as part of the Smart Metering VS SEAP framework analysis, a comparison between SEAP's data requirements, the applicable energy data to those requirements, the benefits, and the calculation of indicators where applicable. 3 - SEAP indicators.xlsx: as part of the Smart Metering VS SEAP framework analysis, the applicable calculation of indicators for SEAP's data requirements.
Load prediction by machine learning (4)
The coding (IPYNB, PY, HTML, ZIP) shows the preparation and exploration of the energy data to train the machine learning model. The datasets (CSV, XLSX), sequentially named, are part of the process of extracting, transforming and loading the data into a machine learning algorithm, identifying the best regression model based on metrics, and predicting the data.
HRES analysis and optimization (5)
The figures (PNG, JPG, PDF) include:
- Household load, based on the energy data from the smart metering experiment and the machine learning exercise
- Pre-defined/synthetic load, provided by the software when no external data (household load) is available, and
- The HRES designed
- Application-generated reports with the results of the analysis, for both best case HRES and fully renewable scenarios.
The datasets (XLSX) include the 12-month input load for the simulation, and the input/output analysis and calculations. 5 - Gorou_Niger_20220529_v3.homer: software (Homer Pro) file with the simulated HRES
· Conferences (6)
6 – IEEE_MISTA_2022_paper_51.pdf: paper (research in progress) presented at the IEEE MISTA 2022 conference, occurred in March-2022, and published in the respective proceeding, 6 - IEEE_MISTA_2022_proceeding.pdf. 6 - ITAS_2023.pdf: paper (final research) recently presented at the ITAS 2023 conference in Doha, Qatar, in March-2023. 6 - Smart Energy Seminar 2023.pptx: PowerPoint slide version of the paper, recently presented at the Smart Energy Seminar held at CPUT, in March-2023.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A data-driven end-to-end analysis of Electric Vehicle adoption, performance, and policy alignment across Washington State. This project covers everything from data cleaning and exploration to visualization and presentation — using SQL, Python, and Power BI.
Facebook
TwitterGUI-based software coded in PYTHON to promote throughput image processing and analytics of a big dataset of satellite imagery and provide spatiotemporal monitoring of crop health conditions throughout the growing season by automatically illustrating 1) a field map calendar (FMC) with daily thumbnails of vegetation heatmaps in each month and 2) a seasonal Vegetation Index (VI) Profile of the crop fields. Output examples of FMC and VI Profile are found in files named in fmCalendar.jpg and NDVI_Profile.jpg, respectively, which were created satellite imagery on 5/1-10/31 in 2020 from a sugarbeet field in Moorhead, MN.
Facebook
TwitterABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc') orgc Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE ) B) To read the files into python first decompress the files to your selected folder. Then in python: # import the required library import pandas as pd # Read the observations data observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head() # Read the sites data sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t") # Read the profiles data profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t") # Read the layers data layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t") # Read the soil property data, e.g. 'cfvo' (do this for each observation) cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t") CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130 Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records
KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage
Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)
Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests
Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results
This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df