30 datasets found
  1. User Profiling and Segmentation Project

    • kaggle.com
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). User Profiling and Segmentation Project [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/user-profiling-and-segmentation-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records

    KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage

    1. Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)

    2. Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests

    Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results

    This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df

  2. J

    Data from dynamic wind profile long-term operation of alkaline and PEM water...

    • data-legacy.fz-juelich.de
    • resodate.org
    bin, csv +3
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jülich DATA (2025). Data from dynamic wind profile long-term operation of alkaline and PEM water electrolysis with extraction of performance data in Python [Dataset]. http://doi.org/10.26165/JUELICH-DATA/PYGQTO
    Explore at:
    csv(754156), csv(136688970), text/x-python(2537), csv(2771450), text/x-python(72207), csv(55902591), zip(7575081), txt(255), csv(20952), csv(68679746), bin(69319)Available download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    Jülich DATA
    Description

    We created a semi-synthetic wind profile from wind turbine data and converted it to current and potential profiles for PEM and alkaline water electrolysis cells with a maximum power output of 40 and 4 W respectively. Then we conducted dynamic electrolysis with these profiles for up to 961 h with PEMWE and AWE single cells. The data obtained from the dynamic operation are included in the dataset. We applied two analysis methods to our datasets in Python to extract performance data from the electrolysis cells like I-V-curves, current density dependent cell voltage changes and resistances. The Python code is also part of the dataset.

  3. f

    Data from: On the Vulnerability Proneness of Multilingual Code

    • datasetcatalog.nlm.nih.gov
    Updated Sep 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Wen (2022). On the Vulnerability Proneness of Multilingual Code [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000394225
    Explore at:
    Dataset updated
    Sep 3, 2022
    Authors
    Li, Wen
    Description

    Study Tool and Dataset[Environment preparation]1. Python version: 3.6 or upper version2. Dependent libraries:progressbar, nltk, textblob, sklearn, matplotlib, plotly, fuzzywuzzy, statsmodels, corpora, etc.Utilize pip install [lib_name] to install the libraries.[Running the program]1. Command linecollect.py -- for data collection, vulnerability categorization and language interfacing classification.Type "collect.py -h" for help.2. comman parameters<1> collect.pycollect.py -s collect -- grab raw repositories from github.collect.py -s repostats -- collect basic properies for each repository.collect.py -s langstats -- empirical analysis for language information: profile size, combinations, etc.collect.py -s cmmts -- collect commits for each project, and classify the commits with fuccywuzzy.collect.py -s nbr -- NBR analysis on the dataset.collect.py -s clone -- clone all projects to local storage.collect.py -s apisniffer -- classify the projects by language interface typesWe also provide the shell script for parallel execution in multiple processes to speed up the data collection and analysis.cmmts.sh [repository number]: execute the commit collection and classification in multiple processesclone.sh [repository number]: clone the repositories to local in multiple processessniffer.sh [repository number]: identify and category the repositories by langauge interfacing mechanisms in multiple processes3. Dataset<1> Data/OriginData/Repository_List.csv: original repository profile grabbed from github.<2> Data/CmmtSet: original commit data by repository, each file is named as the repository ID.<3> Data/Issues: original issue information by repository.<4> Data/StatData/CmmtSet: classified commit data by repository, each commit can be retrieved from github through 'sha' field.<5> Data/StatData/ApiSniffer.csv: classified repositories by language interfacing mechanisms

  4. d

    Data from: Acoustic Doppler Current Profiler Data for Irondequoit and Sodus...

    • catalog.data.gov
    • data.usgs.gov
    Updated Sep 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Acoustic Doppler Current Profiler Data for Irondequoit and Sodus Bays in Central New York, 2023 [Dataset]. https://catalog.data.gov/dataset/acoustic-doppler-current-profiler-data-for-irondequoit-and-sodus-bays-in-central-new-york-
    Explore at:
    Dataset updated
    Sep 12, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Irondequoit, Central New York, Sodus, New York, Sodus
    Description

    This data release contains acoustic Doppler current profiler (ADCP) data collected during 2023 from two uplooking tripods in bays of Lake Ontario in central New York. Data were collected at Irondequoit Bay (USGS station number 431314077315901) and at Sodus Bay (USGS station number 431533076582101). Data are organized by bay in child item datasets containing the raw binary data files from the ADCPs as well as tabulated text files of echo intensity, backscatter, velocity, and ancillary data. Tables were created by processing raw data files in R-language oceanographic package OCE (Kelley and others, 2022) and TRDI WinRiver II (Teledyne RD Instruments, 2007). All aggregation, manual magnetic variation calculations, and post-processing were completed using Python libraries pandas (McKinney, 2010) and NumPy (Harris and others, 2020). Tables were created by processing raw data files in R-language oceanographic package OCE (Kelley and others, 2022) and TRDI WinRiver II (Teledyne RD Instruments, 2007). All aggregation, manual magnetic variation calculations, and post-processing were completed using Python libraries pandas (McKinney, 2010) and NumPy (Harris and others, 2020).

  5. c

    Medium articles dataset

    • crawlfeeds.com
    • kaggle.com
    json, zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

    Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

    Request here for the large dataset Medium datasets

    Checkout sample dataset in CSV

    Use Cases:

    • Training language models (LLMs)

    • Analyzing content trends and engagement

    • Sentiment and text classification

    • SEO research and author profiling

    • Academic or commercial research

    Why Choose This Dataset?

    • High-volume, cleanly structured JSON

    • Ideal for developers, researchers, and data scientists

    • Easy integration with Python, R, SQL, and other data pipelines

    • Affordable and ready-to-use

  6. Python Energy Microscope: Benchmarking 5 Execution

    • kaggle.com
    zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Fatin Shadab Turja (2025). Python Energy Microscope: Benchmarking 5 Execution [Dataset]. https://www.kaggle.com/datasets/fatinshadab/python-energy-microscope-dataset
    Explore at:
    zip(176065 bytes)Available download formats
    Dataset updated
    Jun 18, 2025
    Authors
    Md. Fatin Shadab Turja
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.

    Purpose & Motivation

    With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:

    Quantify execution time, CPU energy usage, and carbon emissions

    Enable reproducible analysis of performance–sustainability trade-offs

    Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation

    Data Collection & Tools

    All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.

    Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.

    Dataset Structure Overview

    Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.

    Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.

    Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.

    Usage

    This dataset is ideal for:

    Reproducible software sustainability studies

    Benchmarking Python execution strategies

    Analyzing energy–performance–carbon trade-offs

    Validating green metrics and measurement tools

    Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.

  7. S

    Global scientific academies Dataset

    • scidb.cn
    Updated Nov 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chen xiaoli (2024). Global scientific academies Dataset [Dataset]. http://doi.org/10.57760/sciencedb.14674
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Dataset provided by
    Science Data Bank
    Authors
    chen xiaoli
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582

  8. d

    Processed ADCP Current Depth Profiles, Flow Classification, and Power Law...

    • catalog.data.gov
    • mhkdr.openei.org
    • +1more
    Updated Aug 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandia National Laboratories (2025). Processed ADCP Current Depth Profiles, Flow Classification, and Power Law Parameters at Tidal Energy Sites [Dataset]. https://catalog.data.gov/dataset/processed-adcp-current-depth-profiles-flow-classification-and-power-law-parameters-at-tida
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset provided by
    Sandia National Laboratories
    Description

    This dataset contains processed acoustic Doppler current profiler (ADCP) measurements from twenty energetic tidal energy sites in the United States, Scotland, and New Zealand, compiled for the 2025 publication Current Depth Profile Characterization for Tidal Energy Development (linked below). Measurements were sourced from peer-reviewed literature, the Marine and Hydrokinetic Data Repository, EMEC, and NOAA's C-MIST database, and were selected for sites with depth-averaged current speeds exceeding 1m/s. Data span a range of tidal cycles, depths (5-70m), and flow regimes, and have been quality-controlled, filtered, and transformed into principal flood and ebb flow directions. Each netCDF file corresponds to a single site, with file names based on the site codes defined in the publication. The dataset classifies current depth profiles by shape, reports their prevalence by flow regime, and provides fitted power law parameters for monotonic profiles, along with metrics for non-monotonic profiles. Detailed descriptions of variables, units, and file naming conventions are provided in the dataset README. The submission complies with FAIR data principles: it is findable through the open-access PRIMRE Marine and Hydrokinetic Data Repository with a DOI; accessible via self-describing netCDF files readable in open-source tools such as Python and R; interoperable for integration with other applications and databases; and reusable through comprehensive documentation.

  9. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  10. 4

    Characteristic parameters extracted from the Jarkus dataset using the Jarkus...

    • data.4tu.nl
    • figshare.com
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christa van IJzendoorn (2021). Characteristic parameters extracted from the Jarkus dataset using the Jarkus Analysis Toolbox [Dataset]. http://doi.org/10.4121/14514213.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Christa van IJzendoorn
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This dataset presents the output of the application of the Jarkus Analysis Toolbox (JAT) to the Jarkus dataset. The Jarkus dataset is one of the most elaborate coastal datasets in the world and consists of coastal profiles of the entire Dutch coast, spaced about 250-500 m apart, which have been measured yearly since 1965. Different available definitions for extracting characteristic parameters from coastal profiles were collected and implemented in the JAT. The characteristic parameters allow stakeholders (e.g. scientists, engineers and coastal managers) to study the spatial and temporal variations in parameters like dune height, dune volume, dune foot, beach width and closure depth. This dataset includes a netcdf file (on the opendap server, see data link) that contains all characteristic parameters through space and time, and a distribution plot that shows the overview of each characteristic parameters. The Jarkus Analysis Toolbox and all scripts that were used to extract the characteristic parameters and create the distribution plots are available through Github (https://github.com/christavanijzendoorn/JAT). Example 5 that is included in the JAT provides a python script that shows how to load and work with the netcdf file.Documentation: https://jarkus-analysis-toolbox.readthedocs.io/.

  11. Packing provenance using CPM RO-Crate profile

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rudolf Wittner; Rudolf Wittner; Matej Gallo; Matej Gallo; Simone Leo; Simone Leo; Stian Soiland-Reyes; Stian Soiland-Reyes (2023). Packing provenance using CPM RO-Crate profile [Dataset]. http://doi.org/10.5281/zenodo.7676924
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rudolf Wittner; Rudolf Wittner; Matej Gallo; Matej Gallo; Simone Leo; Simone Leo; Stian Soiland-Reyes; Stian Soiland-Reyes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.

    As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

    Description of the AI pipeline

    The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

    • Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

    • AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

    • AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

    In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.

    Finally, all these artifacts are packed together in an RO-Crate.

    For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.

    Description of the RO-Crate

    Process Run Crate related aspects

    The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.

    Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.

    As a result, the crate consists the seven following “executables”:

    • Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

    • Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

    For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

    Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

    CPM RO-Crate related aspects

    The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

    In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

    Remarks

    The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

    The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

  12. RailEnV-PASMVS: a dataset for multi-view stereopsis training and...

    • zenodo.org
    • resodate.org
    • +2more
    bin, csv, png, txt +1
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe (2024). RailEnV-PASMVS: a dataset for multi-view stereopsis training and reconstruction applications [Dataset]. http://doi.org/10.5281/zenodo.5233840
    Explore at:
    bin, csv, txt, zip, pngAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.

    File descriptions

    • RailEnV-PASMVS.blend (227 Mb) - Blender file (developed using Blender version 2.8.1) used to generate the dataset. The Blender file packs only one of the HDR environmental textures to use as an example, along with all the other asset textures.
    • RailEnV-PASMVS_sample.png (28 Mb) - A visual collage of 30 scenes, illustrating the variability introduced by using different models, illumination, material properties and camera focal lengths.
    • geometry.zip (2 Mb) - Geometry CSV files used for scenes 01 to 20. The Bezier curve defines the geometry of the rail profile (10 mm intervals).
    • PhysicalDataset.7z (2.0 Gb) - A smaller, secondary dataset of 320 manually annotated photographs of railway environments; only the railway profiles are annotated.
    • 01.7z-40.7z (2.0 Gb each) - Archive of every scene (01 through 40).
    • all_list.txt, training_list.txt, validation_list.txt - Text files containing the all the scene names, together with those used for validation (validation_list.txt) and training (training_list.txt), used by MVSNet.
    • index.csv - CSV file provides a convenient reference for all the sample files, linking the corresponding file and relative data path.

    Steps to reproduce

    The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.

    import numpy as np
    from scipy.spatial.transform import Rotation as R
    
    #The intrinsic matrix K is constructed using the following formulation:
    focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
    K = [[focalLengthPixel, 0, dimX/2],
       [0, focalPixel, dimY/2],
       [0, 0, 1]]
    
    #The rotation vector as provided by Blender was first transformed to a rotation matrix:
    r = R.from_euler('xyz', vectR, degrees=True)
    matR = r.as_matrix()
    
    #Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
    R_world2bcam = np.transpose(matR)
    
    #The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
    R_bcam2cv = np.array([[1, 0, 0],
                   [0, -1, 0],
                   [0, 0, -1]])
    
    #Thus the representation from WORLD to CV/STANDARD coordinates is:
    R_world2cv = R_bcam2cv.dot(R_world2bcam)
    
    #The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
    T_world2bcam = -1 * R_world2bcam.dot(vectC)
    T_world2cv = R_bcam2cv.dot(T_world2bcam)

    The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.

    Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.

  13. Data for the Community Regional Atmospheric Chemistry Multiphase Mechanism...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Sep 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Data for the Community Regional Atmospheric Chemistry Multiphase Mechanism (CRACMM) version 1.0 [Dataset]. https://catalog.data.gov/dataset/data-for-the-community-regional-atmospheric-chemistry-multiphase-mechanism-cracmm-version-
    Explore at:
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Supporting data for CRACMMv1, including the SPECIATE database mapped to CRACMM, input to the Speciation Tool, profile files output from Speciation Tool for input to SMOKE, python code for mapping species to CRACMM, chemical mechanism, and mechanism metadata is available at https://github.com/USEPA/CRACMM. Specific analyses and scripts used in the manuscript "Linking gas, particulate, and toxic endpoints to air emissions in the Community Regional Atmospheric Chemistry Multiphase Mechanism (CRACMM) version 1.0" such as the 2017 U.S. species-level inventory and code for figures is available here.

  14. s

    Data from: High-throughput measurement of the content and properties of...

    • figshare.scilifelab.se
    • datasetcatalog.nlm.nih.gov
    • +3more
    txt
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erdinc Sezgin; Taras Sych; Jan Schlegel; Hanna Barriga; Miina Ojansivu; Leo Hanke; Florian Weber; R. Beklem Bostancioglu; Kariem Ezzat; Herbert Stangl; Birgit Plochberger; Jurga Laurencikiene; Samir El Andaloussi; Daniel Furth; Molly M. Stevens (2025). High-throughput measurement of the content and properties of nano-sized bioparticles with single-particle profiler [Dataset]. http://doi.org/10.17044/scilifelab.20338869.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Karolinska Institutet
    Authors
    Erdinc Sezgin; Taras Sych; Jan Schlegel; Hanna Barriga; Miina Ojansivu; Leo Hanke; Florian Weber; R. Beklem Bostancioglu; Kariem Ezzat; Herbert Stangl; Birgit Plochberger; Jurga Laurencikiene; Samir El Andaloussi; Daniel Furth; Molly M. Stevens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This item containst data sets for Sych et al, Nature Biotechology, 2023. It contains raw fluroescence fluctuation data as excle sheet and raw figure files.

    Abstract: We introduce a method, single-particle profiler (SPP), that provides single-particle information on the content and biophysical properties of thousands of particles in the size range 5-150 nm. We apply SPP to measure the mRNA encapsulation efficiency of lipid nanoparticles, viral binding efficiency of different nanobodies, and biophysical heterogeneity of liposomes, lipoproteins, exosomes and viruses.

    Data usage Researchers are welcome to use the data contained in the dataset for any projects. Please cite this item upon use or when published. We encourage reuse using the same CC BY 4.0 License.

    Data Content FCS files as raw data (.fcs) Excel and Prism files for graphs

    Software to open files:

    .xlsx: Microsoft Excel .pzfx: GraphPad Prism .svg: Inkscape (https://inkscape.org/) .fcs: Single Particle Profiler (https://github.com/taras-sych/Single-particle-profiler) .ipynb: Jupyter Notebook, installed as part of anaconda platform, python 3.8.8 (https://www.anaconda.com/) .py: executed via anaconda platform, python 3.8.8 (https://www.anaconda.com/)

  15. Ecommerce Consumer Behavior Analysis Data

    • kaggle.com
    zip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salahuddin Ahmed (2025). Ecommerce Consumer Behavior Analysis Data [Dataset]. https://www.kaggle.com/datasets/salahuddinahmedshuvo/ecommerce-consumer-behavior-analysis-data
    Explore at:
    zip(44265 bytes)Available download formats
    Dataset updated
    Mar 3, 2025
    Authors
    Salahuddin Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive collection of consumer behavior data that can be used for various market research and statistical analyses. It includes information on purchasing patterns, demographics, product preferences, customer satisfaction, and more, making it ideal for market segmentation, predictive modeling, and understanding customer decision-making processes.

    The dataset is designed to help researchers, data scientists, and marketers gain insights into consumer purchasing behavior across a wide range of categories. By analyzing this dataset, users can identify key trends, segment customers, and make data-driven decisions to improve product offerings, marketing strategies, and customer engagement.

    Key Features: Customer Demographics: Understand age, income, gender, and education level for better segmentation and targeted marketing. Purchase Behavior: Includes purchase amount, frequency, category, and channel preferences to assess spending patterns. Customer Loyalty: Features like brand loyalty, engagement with ads, and loyalty program membership provide insights into long-term customer retention. Product Feedback: Customer ratings and satisfaction levels allow for analysis of product quality and customer sentiment. Decision-Making: Time spent on product research, time to decision, and purchase intent reflect how customers make purchasing decisions. Influences on Purchase: Factors such as social media influence, discount sensitivity, and return rates are included to analyze how external factors affect purchasing behavior.

    Columns Overview: Customer_ID: Unique identifier for each customer. Age: Customer's age (integer). Gender: Customer's gender (categorical: Male, Female, Non-binary, Other). Income_Level: Customer's income level (categorical: Low, Middle, High). Marital_Status: Customer's marital status (categorical: Single, Married, Divorced, Widowed). Education_Level: Highest level of education completed (categorical: High School, Bachelor's, Master's, Doctorate). Occupation: Customer's occupation (categorical: Various job titles). Location: Customer's location (city, region, or country). Purchase_Category: Category of purchased products (e.g., Electronics, Clothing, Groceries). Purchase_Amount: Amount spent during the purchase (decimal). Frequency_of_Purchase: Number of purchases made per month (integer). Purchase_Channel: The purchase method (categorical: Online, In-Store, Mixed). Brand_Loyalty: Loyalty to brands (1-5 scale). Product_Rating: Rating given by the customer to a purchased product (1-5 scale). Time_Spent_on_Product_Research: Time spent researching a product (integer, hours or minutes). Social_Media_Influence: Influence of social media on purchasing decision (categorical: High, Medium, Low, None). Discount_Sensitivity: Sensitivity to discounts (categorical: Very Sensitive, Somewhat Sensitive, Not Sensitive). Return_Rate: Percentage of products returned (decimal). Customer_Satisfaction: Overall satisfaction with the purchase (1-10 scale). Engagement_with_Ads: Engagement level with advertisements (categorical: High, Medium, Low, None). Device_Used_for_Shopping: Device used for shopping (categorical: Smartphone, Desktop, Tablet). Payment_Method: Method of payment used for the purchase (categorical: Credit Card, Debit Card, PayPal, Cash, Other). Time_of_Purchase: Timestamp of when the purchase was made (date/time). Discount_Used: Whether the customer used a discount (Boolean: True/False). Customer_Loyalty_Program_Member: Whether the customer is part of a loyalty program (Boolean: True/False). Purchase_Intent: The intent behind the purchase (categorical: Impulsive, Planned, Need-based, Wants-based). Shipping_Preference: Shipping preference (categorical: Standard, Express, No Preference). Payment_Frequency: Frequency of payment (categorical: One-time, Subscription, Installments). Time_to_Decision: Time taken from consideration to actual purchase (in days).

    Use Cases: Market Segmentation: Segment customers based on demographics, preferences, and behavior. Predictive Analytics: Use data to predict customer spending habits, loyalty, and product preferences. Customer Profiling: Build detailed profiles of different consumer segments based on purchase behavior, social media influence, and decision-making patterns. Retail and E-commerce Insights: Analyze purchase channels, payment methods, and shipping preferences to optimize marketing and sales strategies.

    Target Audience: Data scientists and analysts looking for consumer behavior data. Marketers interested in improving customer segmentation and targeting. Researchers are exploring factors influencing consumer decisions and preferences. Companies aiming to improve customer experience and increase sales through data-driven decisions.

    This dataset is available in CSV format for easy integration into data analysis tools and platforms such as Python, R, and Excel.

  16. f

    Data from: LiPydomics: A Python Package for Comprehensive Prediction of...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan H. Ross; Jang Ho Cho; Rutan Zhang; Kelly M. Hines; Libin Xu (2023). LiPydomics: A Python Package for Comprehensive Prediction of Lipid Collision Cross Sections and Retention Times and Analysis of Ion Mobility-Mass Spectrometry-Based Lipidomics Data [Dataset]. http://doi.org/10.1021/acs.analchem.0c02560.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Dylan H. Ross; Jang Ho Cho; Rutan Zhang; Kelly M. Hines; Libin Xu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Comprehensive profiling of lipid species in a biological sample, or lipidomics, is a valuable approach to elucidating disease pathogenesis and identifying biomarkers. Currently, a typical lipidomics experiment may track hundreds to thousands of individual lipid species. However, drawing biological conclusions requires multiple steps of data processing to enrich significantly altered features and confident identification of these features. Existing solutions for these data analysis challenges (i.e., multivariate statistics and lipid identification) involve performing various steps using different software applications, which imposes a practical limitation and potentially a negative impact on reproducibility. Hydrophilic interaction liquid chromatography-ion mobility-mass spectrometry (HILIC-IM-MS) has shown advantages in separating lipids through orthogonal dimensions. However, there are still gaps in the coverage of lipid classes in the literature. To enable reproducible and efficient analysis of HILIC-IM-MS lipidomics data, we developed an open-source Python package, LiPydomics, which enables performing statistical and multivariate analyses (“stats” module), generating informative plots (“plotting” module), identifying lipid species at different confidence levels (“identification” module), and carrying out all functions using a user-friendly text-based interface (“interactive” module). To support lipid identification, we assembled a comprehensive experimental database of m/z and CCS of 45 lipid classes with 23 classes containing HILIC retention times. Prediction models for CCS and HILIC retention time for 22 and 23 lipid classes, respectively, were trained using the large experimental data set, which enabled the generation of a large predicted lipid database with 145,388 entries. Finally, we demonstrated the utility of the Python package using Staphylococcus aureus strains that are resistant to various antimicrobials.

  17. c

    Data from: Smart metering and energy access programs: an approach to energy...

    • esango.cput.ac.za
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bennour Bacar (2023). Smart metering and energy access programs: an approach to energy poverty reduction in sub-Saharan Africa [Dataset]. http://doi.org/10.25381/cput.22264042.v1
    Explore at:
    Dataset updated
    May 31, 2023
    Dataset provided by
    Cape Peninsula University of Technology
    Authors
    Bennour Bacar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Sub-Saharan Africa
    Description

    Ethical clearance reference number: refer to the uploaded document Ethics Certificate.pdf.

    General (0)

    0 - Built diagrams and figures.pdf: diagrams and figures used for the thesis

    Analysis of country data (1)

    0 - Country selection.xlsx: In this analysis the sub-Saharan country (Niger) is selected based on the kWh per capita data obtained from sources such as the United Nations and the World Bank. Other data used from these sources includes household size and electricity access. Some household data was projected using linear regression. Sample sizes VS error margins were also analyzed for the selection of a smaller area within the country.

    Smart metering experiment (2)

    The figures (PNG, JPG, PDF) include:

        - The experiment components and assembly
        - The use of device (meter and modem) softwar tools to program and analyse data
        - Phasor and meter detail
        - Extracted reports and graphs from the MDMS
    

    The datasets (CSV, XLSX) include:

        - Energy load profile and register data recorded by the smart meter and collected by both meter configuration and MDM applications.
        - Data collected also includes events, alarm and QoS data.
    

    Data applicability to SEAP (3)

    3 - Energy data and SEAP.pdf: as part of the Smart Metering VS SEAP framework analysis, a comparison between SEAP's data requirements, the applicable energy data to those requirements, the benefits, and the calculation of indicators where applicable. 3 - SEAP indicators.xlsx: as part of the Smart Metering VS SEAP framework analysis, the applicable calculation of indicators for SEAP's data requirements.

    Load prediction by machine learning (4)

    The coding (IPYNB, PY, HTML, ZIP) shows the preparation and exploration of the energy data to train the machine learning model. The datasets (CSV, XLSX), sequentially named, are part of the process of extracting, transforming and loading the data into a machine learning algorithm, identifying the best regression model based on metrics, and predicting the data.

    HRES analysis and optimization (5)

    The figures (PNG, JPG, PDF) include:

        - Household load, based on the energy data from the smart metering experiment and the machine learning exercise
        - Pre-defined/synthetic load, provided by the software when no external data (household load) is available, and
        - The HRES designed
        - Application-generated reports with the results of the analysis, for both best case HRES and fully renewable scenarios.
    

    The datasets (XLSX) include the 12-month input load for the simulation, and the input/output analysis and calculations. 5 - Gorou_Niger_20220529_v3.homer: software (Homer Pro) file with the simulated HRES

    · Conferences (6)

    6 – IEEE_MISTA_2022_paper_51.pdf: paper (research in progress) presented at the IEEE MISTA 2022 conference, occurred in March-2022, and published in the respective proceeding, 6 - IEEE_MISTA_2022_proceeding.pdf. 6 - ITAS_2023.pdf: paper (final research) recently presented at the ITAS 2023 conference in Doha, Qatar, in March-2023. 6 - Smart Energy Seminar 2023.pptx: PowerPoint slide version of the paper, recently presented at the Smart Energy Seminar held at CPUT, in March-2023.

  18. Electric Vehicle Population Analysis

    • kaggle.com
    zip
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nibedita Sahu (2025). Electric Vehicle Population Analysis [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/electric-vehicle-population-analysis
    Explore at:
    zip(10564209 bytes)Available download formats
    Dataset updated
    Jun 23, 2025
    Authors
    Nibedita Sahu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Electric Vehicle Population Analysis

    A data-driven end-to-end analysis of Electric Vehicle adoption, performance, and policy alignment across Washington State. This project covers everything from data cleaning and exploration to visualization and presentation — using SQL, Python, and Power BI.

    Tools & Technologies

    • SQL (MySQL): Data cleaning, filtering, type conversion, preprocessing
    • Python (Jupyter Notebook): Pandas, SQLAlchemy, NumPy, Matplotlib, Seaborn
    • Pandas Profiling / YData EDA: Automated EDA for in-depth data profiling
    • Power BI: Interactive, multi-page report design and visual analysis
  19. d

    Data from: iCalendar: Satellite-based Field Map Calendar

    • datasets.ai
    • catalog.data.gov
    0, 22
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2024). iCalendar: Satellite-based Field Map Calendar [Dataset]. https://datasets.ai/datasets/icalendar-satellite-based-field-map-calendar
    Explore at:
    0, 22Available download formats
    Dataset updated
    May 31, 2024
    Dataset authored and provided by
    Department of Agriculture
    Description

    GUI-based software coded in PYTHON to promote throughput image processing and analytics of a big dataset of satellite imagery and provide spatiotemporal monitoring of crop health conditions throughout the growing season by automatically illustrating 1) a field map calendar (FMC) with daily thumbnails of vegetation heatmaps in each month and 2) a seasonal Vegetation Index (VI) Profile of the crop fields. Output examples of FMC and VI Profile are found in files named in fmCalendar.jpg and NDVI_Profile.jpg, respectively, which were created satellite imagery on 5/1-10/31 in 2020 from a sugarbeet field in Moorhead, MN.

  20. WoSIS snapshot - December 2023

    • data.isric.org
    • repository.soilwise-he.eu
    Updated Dec 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ISRIC - World Soil Information (2023). WoSIS snapshot - December 2023 [Dataset]. https://data.isric.org/geonetwork/srv/api/records/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec
    Explore at:
    www:link-1.0-http--related, www:link-1.0-http--link, www:download-1.0-ftp--downloadAvailable download formats
    Dataset updated
    Dec 20, 2023
    Dataset provided by
    International Soil Reference and Information Centre
    Authors
    ISRIC - World Soil Information
    Time period covered
    Jan 1, 1918 - Dec 1, 2022
    Area covered
    Description

    ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc') orgc Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE ) B) To read the files into python first decompress the files to your selected folder. Then in python: # import the required library import pandas as pd # Read the observations data observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head() # Read the sites data sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t") # Read the profiles data profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t") # Read the layers data layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t") # Read the soil property data, e.g. 'cfvo' (do this for each observation) cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t") CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130 Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sanjana Murthy (2024). User Profiling and Segmentation Project [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/user-profiling-and-segmentation-project
Organization logo

User Profiling and Segmentation Project

User Profiling and Segmentation Project in Python

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sanjana Murthy
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records

KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage

  1. Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)

  2. Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests

Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results

This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df

Search
Clear search
Close search
Google apps
Main menu