12 datasets found
  1. Klib library python

    • kaggle.com
    zip
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python
    Explore at:
    zip(89892446 bytes)Available download formats
    Dataset updated
    Jan 11, 2021
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  2. f

    Data from: BaNDyT: Bayesian Network Modeling of Molecular Dynamics...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizaveta Mukhaleva; Babgen Manookian; Hanyu Chen; Indira R. Sivaraj; Ning Ma; Wenyuan Wei; Konstancja Urbaniak; Grigoriy Gogoshin; Supriyo Bhattacharya; Nagarajan Vaidehi; Andrei S. Rodin; Sergio Branciamore (2025). BaNDyT: Bayesian Network Modeling of Molecular Dynamics Trajectories [Dataset]. http://doi.org/10.1021/acs.jcim.4c01981.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    Elizaveta Mukhaleva; Babgen Manookian; Hanyu Chen; Indira R. Sivaraj; Ning Ma; Wenyuan Wei; Konstancja Urbaniak; Grigoriy Gogoshin; Supriyo Bhattacharya; Nagarajan Vaidehi; Andrei S. Rodin; Sergio Branciamore
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Bayesian network modeling (BN modeling, or BNM) is an interpretable machine learning method for constructing probabilistic graphical models from the data. In recent years, it has been extensively applied to diverse types of biomedical data sets. Concurrently, our ability to perform long-time scale molecular dynamics (MD) simulations on proteins and other materials has increased exponentially. However, the analysis of MD simulation trajectories has not been data-driven but rather dependent on the user’s prior knowledge of the systems, thus limiting the scope and utility of the MD simulations. Recently, we pioneered using BNM for analyzing the MD trajectories of protein complexes. The resulting BN models yield novel fully data-driven insights into the functional importance of the amino acid residues that modulate proteins’ function. In this report, we describe the BaNDyT software package that implements the BNM specifically attuned to the MD simulation trajectories data. We believe that BaNDyT is the first software package to include specialized and advanced features for analyzing MD simulation trajectories using a probabilistic graphical network model. We describe here the software’s uses, the methods associated with it, and a comprehensive Python interface to the underlying generalist BNM code. This provides a powerful and versatile mechanism for users to control the workflow. As an application example, we have utilized this methodology and associated software to study how membrane proteins, specifically the G protein-coupled receptors, selectively couple to G proteins. The software can be used for analyzing MD trajectories of any protein as well as polymeric materials.

  3. Enterprise GenAI Adoption & Workforce Impact Data

    • kaggle.com
    zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishi (2025). Enterprise GenAI Adoption & Workforce Impact Data [Dataset]. https://www.kaggle.com/datasets/tfisthis/enterprise-genai-adoption-and-workforce-impact-data/discussion?sort=undefined
    Explore at:
    zip(3081470 bytes)Available download formats
    Dataset updated
    Jun 12, 2025
    Authors
    Rishi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Enterprise GenAI Adoption & Workforce Impact Dataset (100K+ Rows)

    This dataset originates from a multi-year enterprise survey conducted across industries and countries. It focuses on the organizational effects of adopting Generative AI tools such as ChatGPT, Claude, Gemini, Mixtral, LLaMA, and Groq. The dataset captures detailed metrics on job role creation, workforce transformation, productivity changes, and employee sentiment.

    Data Schema

    columns = [
      "Company Name",           # Anonymized name
      "Industry",             # Sector (e.g., Finance, Healthcare)
      "Country",              # Country of operation
      "GenAI Tool",            # GenAI platform used
      "Adoption Year",           # Year of initial deployment (2022–2024)
      "Number of Employees Impacted",   # Affected staff count
      "New Roles Created",        # Number of AI-driven job roles introduced
      "Training Hours Provided",     # Upskilling time investment
      "Productivity Change (%)",     # % shift in reported productivity
      "Employee Sentiment"        # Textual feedback from employees
    ]
    

    Load the Dataset

    import pandas as pd
    
    df = pd.read_csv("Large_Enterprise_GenAI_Adoption_Impact.csv")
    df.shape
    

    Basic Exploration

    df.head(10)
    df.describe()
    df["GenAI Tool"].value_counts()
    df["Industry"].unique()
    

    Filter Examples

    Filter by Year and Country

    df[(df["Adoption Year"] == 2023) & (df["Country"] == "India")]
    

    Get Top 5 Industries by Productivity Gain

    df.groupby("Industry")["Productivity Change (%)"].mean().sort_values(ascending=False).head()
    

    Text Analysis on Employee Sentiment

    Word Frequency Analysis

    from collections import Counter
    import re
    
    text = " ".join(df["Employee Sentiment"].dropna().tolist())
    words = re.findall(r'\b\w+\b', text.lower())
    common_words = Counter(words).most_common(20)
    print(common_words)
    

    Sentiment Length Distribution

    df["Sentiment Length"] = df["Employee Sentiment"].apply(lambda x: len(x.split()))
    df["Sentiment Length"].hist(bins=50)
    

    Group-Based Insights

    Role Creation by Tool

    df.groupby("GenAI Tool")["New Roles Created"].mean().sort_values(ascending=False)
    

    Training Hours by Industry

    df.groupby("Industry")["Training Hours Provided"].mean().sort_values(ascending=False)
    

    Sample Use Cases

    • Evaluate GenAI adoption patterns by sector or region
    • Analyze workforce upskilling initiatives and investments
    • Explore employee reactions to AI integration using NLP
    • Build models to predict productivity impact based on tool, industry, or country
    • Study role creation trends to anticipate future AI-based job market shifts
  4. Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...

    • zenodo.org
    application/gzip
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi (2024). Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python [Dataset]. http://doi.org/10.5281/zenodo.11584961
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package

    This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

    Requirements

    We recommend the following requirements to replicate our study:

    1. Internet access
    2. At least 100GB of space
    3. Docker installed
    4. Git installed

    Package Structure

    We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

    • data-analysis, an R-based Container we used to run our data analysis.
    • data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.
    • database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.
    • storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.
    • docker-compose.yml, the Docker file that configures all containers used in the package.

    In the remainder of this document, we describe how to set up each container properly.

    Using VSCode to Setup the Package

    We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

    You first need to set up the containers

    $ cd /replication/package/folder
    $ docker-compose build
    $ docker-compose up
    # Wait docker creating and running all containers
    

    Then, you can open them in Visual Studio Code:

    1. Open VSCode in project root folder
    2. Access the command palette and select "Dev Container: Reopen in Container"
      1. Select either Data Collection or Data Analysis.
    3. Start working

    If you want/need a more customized organization, the remainder of this file describes it in detail.

    Longest Road: Manual Package Setup

    Database Setup

    The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

    Build an image:

    $ cd ./database
    $ docker build --tag 'dabc-database' .
    $ docker image ls
    REPOSITORY  TAG    IMAGE ID    CREATED     SIZE
    dabc-database latest  b6f8af99c90d  50 minutes ago  18.5GB
    

    Create and enter inside the container:

    $ docker run -it --name dabc-database-1 dabc-database
    $ docker exec -it dabc-database-1 /bin/bash
    root# psql -U postgres -h localhost -d jupyter-notebooks
    jupyter-notebooks=# \dt
           List of relations
     Schema |    Name    | Type | Owner
    --------+-------------------+-------+-------
     public | Cell       | table | root
     public | Code_cell     | table | root
     public | Md_cell      | table | root
     public | Notebook     | table | root
     public | Notebook_features | table | root
     public | Notebook_metadata | table | root
     public | repository    | table | root
    

    If you got the tables list as above, your database is properly setup.

    It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

    Data Collection Setup

    This container is responsible for collecting the data to answer our research questions. It has the following structure:

    • dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.
    • dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.
    • Makefile, commands to set up and run both dabcs.py and dabcs-clients.py
    • matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.
    • storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.
    • requirements.txt, Python dependencies adopted in this module.

    Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

    $ cd ./data-collection
    $ docker build --tag "data-collection" .
    $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
    $ docker exec -it data-collection-1 /bin/bash
    $ ls
    Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
    

    If you see project files, it means the container is configured accordingly.

    Data Analysis Setup

    We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

    • dependencies.R, an R script containing the dependencies used in our data analysis.
    • data-analysis.Rmd, the R notebook we used to perform our data analysis
    • datasets, a docker volume pointing to the storage directory.

    Execute the following commands to run this container:

    $ cd ./data-analysis
    $ docker build --tag "data-analysis" .
    $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
    $ docker exec -it data-analysis-1 /bin/bash
    $ ls
    data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
    

    If you see project files, it means the container is configured accordingly.

    A note on storage shared folder

    As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

    $ make unzip # extract files
    $ ls
    clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
    $ make zip # compress files
    $ ls
    csv-files.tar.gz Makefile
  5. S

    Global scientific academies Dataset

    • scidb.cn
    Updated Nov 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chen xiaoli (2024). Global scientific academies Dataset [Dataset]. http://doi.org/10.57760/sciencedb.14674
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Dataset provided by
    Science Data Bank
    Authors
    chen xiaoli
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was generated as part of the study aimed at profiling global scientific academies, which play a significant role in promoting scholarly communication and scientific progress. Below is a detailed description of the dataset:Data Generation Procedures and Tools: The dataset was compiled using a combination of web scraping, manual verification, and data integration from multiple sources, including Wikipedia categories,member of union of scientific organizations, and web searches using specific query phrases (e.g., "country name + (academy OR society) AND site:.country code"). The records were enriched by cross-referencing data from the Wikidata API, the VIAF API, and the Research Organisation Registry (ROR). Additional manual curation ensured accuracy and consistency.Temporal and Geographical Scopes: The dataset covers scientific academies from a wide temporal scope, ranging from the 15th century to the present. The geographical scope includes academies from all continents, with emphasis on both developed and post-developing countries. The dataset aims to capture the full spectrum of scientific academies across different periods of historical development.Tabular Data Description: The dataset comprises a total of 301 academy records and 14,008 website navigation sections. Each row in the dataset represents a single scientific academy, while the columns describe attributes such as the academy’s name, founding date, location (city and country), website URL, email, and address.Missing Data: Although the dataset offers comprehensive coverage, some entries may have missing or incomplete fields. For instance, section was not available for all records.Data Errors and Error Ranges: The data has been verified through manual curation, reducing the likelihood of errors. However, the use of crowd-sourced data from platforms like Wikipedia introduces potential risks of outdated or incomplete information. Any errors are likely minor and confined to fields such as navigation menu classifications, which may not fully reflect the breadth of an academy's activities.Data Files, Formats, and Sizes: The dataset is provided in CSV format and JSON format, ensuring compatibility with a wide range of software applications, including Microsoft Excel, Google Sheets, and programming languages such as Python (via libraries like pandas).This dataset provides a valuable resource for further research into the organizational behaviors, geographic distribution, and historical significance of scientific academies across the globe. It can be used for large-scale analyses, including comparative studies across different regions or time periods.Any feedback on the data is welcome! Please contact the maintaner of the dataset!If you use the data, please cite the following paper:Xiaoli Chen and Xuezhao Wang. 2024. Profiling Global Scientific Academies. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3677389.3702582

  6. S

    Python code data of attention-based dual-scale hierarchical LSTM for tool...

    • scidb.cn
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Guo; Kunpeng Zhu (2022). Python code data of attention-based dual-scale hierarchical LSTM for tool wear monitoring [Dataset]. http://doi.org/10.57760/sciencedb.06004
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Hao Guo; Kunpeng Zhu
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The experiment is based on the common high speed milling data set to verify the robustness of the model to various tool types. The data set contains six sub data sets, corresponding to the wear process of six different types of tools. Three of the sub data sets contain tool wear labels, while the other three sub data sets do not. The tools used are all three edged 6mm ball cemented carbide tools, but their geometry and coating are different. The workpiece is Inconel 718, which is widely used for jet engine blade milling. The spindle speed is 10360rpm, and the cutting depth is 0.25mm. The tool cuts from the upper edge of the workpiece surface to the lower edge in a zigzag manner. In the whole milling process, the cutting length of each tool is about 0.1125m × 315pass = 35.44m. The cutting signal in Experiment 1 includes the cutting force signal collected by the three channel Kistler dynamometer and the vibration signal collected by the three channel Kistler accelerometer at a sampling rate of 50 kHz. Use the microscope LEICA MZ12 to measure the wear of the rear tool surface of the three teeth offline after each tool feeding. In this experiment, a cutting signal is collected every other period of time to predict the wear of the three teeth of the tool.The samples are divided into training set, evaluation set, test set and reconstruction set. The training set and evaluation set samples are from two kinds of tools, including 30000 and 4096 samples respectively; The samples of the test set are from another tool, including 9472 samples; The reconstruction set comes from the unlabeled data generated by the other three tools, including 40832 samples. Each sample contains three channels of cutting force signal and three channels of vibration signal. The sampling points of each channel signal are 2304. The following preprocessing steps are performed:1) Signal clippingSince the feed rate and sampling rate are constant throughout the experiment, the data set of each experiment can be approximately understood as a signal matrix evenly distributed on the workpiece surface, ignoring the slight difference in the number of sampling points for each tool path. The ordinate of the matrix corresponds to the index of the tool path times, and the abscissa corresponds to the index of the sampling point. Because the generation rules of cutting signals are different in uncut, cut in, cut out and stable states, the sampling points close to the edge of the workpiece are removed. Here we simply cut 2% off the two ends of the cutting signal obtained by each tool feed.2) Data amplificationBecause tool wear can only be observed with a microscope after each tool feeding, each wear tag corresponds to a cutting signal containing about 120000 sampling points, and the acquisition of tool wear also takes a lot of time. In this case, the number of tags extracted is not enough to fit the model, nor can the robustness of the algorithm be guaranteed. It is necessary to artificially split the sample and expand the tool wear label. Considering that the tool wear is a slow and continuous process, and there is a certain deviation in the experimental measurement, the linear interpolation method is adopted here. We also tested quadratic interpolation and polynomial fitting methods, but no better results were observed. It needs to be stated here that the essence of prediction is to find a function that maps the sample space to the target space. For any point in the sample space, the model can find the corresponding value in the target space. What sample amplification does is to sample more times in the target space, so as to more comprehensively describe this mapping relationship, rather than redefining this relationship.The task of this study is to monitor the wear of the rear cutter surface of the three teeth according to the six channel sensor signals. On the test set, the mean square error (MSE) and mean absolute percentage error (MAPE) between the predicted value and the observed value of the microscope are 0.0013 and 4%, respectively, and the average and maximum final prediction error (FPE) are 5 μ M and 23 μ m. The training time was 2130s, and the single prediction time was 1.79ms. The accuracy, training time and detection efficiency of tool wear monitoring can meet the current industrial needs. As MPAN realizes the mapping from cutting signal to tool wear, as the gate of control information flow, attention unit retains the importance information of input features. The predicted tool wear curve is basically consistent with the curve observed by the microscope.

  7. n

    Field and analytical data from Tipperary CO2 seep, Daylesford, Australia...

    • data-search.nerc.ac.uk
    • ckan.publishing.service.gov.uk
    • +2more
    Updated May 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Field and analytical data from Tipperary CO2 seep, Daylesford, Australia (2017) [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?format=.py%20python%20code
    Explore at:
    Dataset updated
    May 8, 2020
    Description

    This dataset contains: 1. An excel spreadsheet of field data from Tipperary pool, including CO2 bubble locations, raw and derived flux data, and field description. March 2017 field campaign. 2. Python scripts for two point correlation function, a spatial statistical method used to describe the spatial distribution of points, and applied to Tipperary pool CO2 bubbling points to determine geological control on their distribution. As reported in: Roberts, J.J., Leplastrier, A., Feitz, A., Bell, A., Karolyte, R., Shipton, Z.K. Structural controls on the location and distribution of CO2 leakage at a natural CO2 spring in Daylesford, Australia. IJGHGC.

  8. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  9. o

    Dash sylvereye: a Python library for dashboard-driven visualization of large...

    • repositorio.observatoriogeo.mx
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dash sylvereye: a Python library for dashboard-driven visualization of large street networks - Dataset - Repositorio del Observatorio Metropolitano CentroGeo [Dataset]. http://repositorio.observatoriogeo.mx/dataset/e0ce07a5189e
    Explore at:
    Dataset updated
    Oct 21, 2025
    Description

    State-of-the-art open network visualization tools like Gephi, KeyLines, and Cytoscape are not suitable for studying street networks with thousands of roads since they do not support simultaneously polylines for edges, navigable maps, GPU-accelerated rendering, interactivity, and the means for visualizing multivariate data. To fill this gap, the present paper presents Dash Sylvereye: a new Python library to produce interactive visualizations of primal street networks on top of tiled web maps. Thanks to its integration with the Dash framework, Dash Sylvereye can be used to develop web dashboards around temporal and multivariate street data by coordinating the various elements of a Dash Sylvereye visualization with other plotting and UI components provided by the Dash framework. Additionally, Dash Sylvereye provides convenient functions to easily import OpenStreetMap street topologies obtained with the OSMnx library. Moreover, Dash Sylvereye uses WebGL for GPU-accelerated rendering when redrawing the road network. We conduct experiments to assess the performance of Dash Sylvereye on a commodity computer when exploiting software acceleration in terms of frames per second, CPU time, and frame duration. We show that Dash Sylvereye can offer fast panning speeds, close to 60 FPS, and CPU times below 20 ms, for street networks with thousands of edges, and above 24 FPS, and CPU times below 40 ms, for networks with dozens of thousands of edges. Additionally, we conduct a performance comparison against two state-of-the-art street visualization tools. We found Dash Sylvereye to be competitive when compared to the state-of-the-art visualization libraries Kepler.gl and city-roads. Finally, we describe a web dashboard application that exploits Dash Sylvereye for the analysis of a SUMO vehicle traffic simulation.

  10. r

    Open data: The early but not the late neural correlate of auditory awareness...

    • researchdata.se
    • demo.researchdata.se
    • +1more
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Wiens; Rasmus Eklund; Billy Gerdfeldter (2021). Open data: The early but not the late neural correlate of auditory awareness reflects lateralized experiences [Dataset]. http://doi.org/10.17045/STHLMUNI.13067018
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Stockholm University
    Authors
    Stefan Wiens; Rasmus Eklund; Billy Gerdfeldter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    1. Title of Dataset: Open data: The early but not the late neural correlate of auditory awareness reflects lateralized experiences.

    2. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se

      B. Associate or Co-investigator Contact Information Name: Rasmus Eklund Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/raek2031-1.223133 Email: rasmus.eklund@psychology.su.se

      C. Associate or Co-investigator Contact Information Name: Billy Gerdfeldter Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/bige1544-1.403208 Email: billy.gerdfeldter@psychology.su.se

    3. Date of data collection: Subjects (N = 28) were tested between 2020-03-04 and 2020-09-18.

    4. Geographic location of data collection: Department of Psychology, Stockholm, Sweden

    5. Information about funding sources that supported the collection of the data: Marianne and Marcus Wallenberg (Grant 2019-0102)

    SHARING/ACCESS INFORMATION

    1. Licenses/restrictions placed on the data: CC BY 4.0

    2. Links to publications that cite or use the data: Eklund R., Gerdfeldter B., & Wiens S. (2021). The early but not the late neural correlate of auditory awareness reflects lateralized experiences. Neuropsychologia. https://doi.org/

    The study was preregistered: https://doi.org/10.17605/OSF.IO/PSRJF

    1. Links to other publicly accessible locations of the data: N/A

    2. Links/relationships to ancillary data sets: N/A

    3. Was data derived from another source? No

    4. Recommended citation for this dataset: Eklund R., Gerdfeldter B., & Wiens S. (2020). Open data: The early but not the late neural correlate of auditory awareness reflects lateralized experiences. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.13067018

    DATA & FILE OVERVIEW

    File List: The files contain the downsampled data in bids format, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.

    AAN_LRclick_experiment_scripts.zip: contains the Python files to run the experiment

    AAN_LRclick_bids_EEG.zip: contains EEG data files for each subject in .eeg format.

    AAN_LRclick_behavior_log.zip: contains log files of the EEG session (generated by Python)

    AAN_LRclick_EEG_scripts.zip: Python-MNE scripts to process and to analyze the EEG data

    AAN_LRclick_results.zip: contains summary data files, figures, and tables that are created by Python-MNE.

    METHODOLOGICAL INFORMATION

    1. Description of methods used for collection/generation of data: The auditory stimuli were 4-ms clicks. The experiment was programmed in Python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and converted to .eeg format. For more information, see linked publication.

    2. Methods for processing the data: We computed event-related potentials. See linked publication

    3. Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#

    4. Standards and calibration information, if appropriate: For information, see linked publication.

    5. Environmental/experimental conditions: For information, see linked publication.

    6. Describe any quality-assurance procedures performed on the data: For information, see linked publication.

    7. People involved with sample collection, processing, analysis and/or submission:

    • Data collection: Rasmus Eklund with assistance from Billy Gerdfeldter.
    • Data processing, analysis, and submission: Rasmus Eklund

    DATA-SPECIFIC INFORMATION: All relevant information can be found in the MNE-Python scripts (in EEG_scripts folder) that process the EEG data. For example, we added notes to explain what different variables mean.

    The folder structure needs to be as follows: AAN_LRclick (main folder) --->data --->--->bids (AAN_LRclick_bids_EEG) --->--->log (AAN_LRclick_behavior_log) --->MNE (AAN_LRclick_EEG_scripts) --->results (AAN_LRclick_results)

    To run the MNE-Python scripts: Anaconda was used with MNE-Python 0.22 (see installation at https://mne.tools/stable/index.html# ). For preprocess.py and analysis.py, the complete scripts should be run (from anaconda prompt).

  11. e

    Open data: Visual load effects on the auditory steady-state responses to...

    • data.europa.eu
    • demo.researchdata.se
    • +2more
    unknown
    Updated Nov 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stockholms universitet (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-17045-sthlmuni-12582002?locale=hr
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Nov 7, 2020
    Dataset authored and provided by
    Stockholms universitet
    Description

    The main results file are saved separately:

    • ASSR2.html: R output of the main analyses (N = 33)
    • ASSR2_subset.html: R output of the main analyses for the smaller sample (N = 25)

    FIGSHARE METADATA

    Categories

    • Biological psychology
    • Neuroscience and physiological psychology
    • Sensory processes, perception, and performance

    Keywords

    • crossmodal attention
    • electroencephalography (EEG)
    • early-filter theory
    • task difficulty
    • envelope following response

    References

    GENERAL INFORMATION

    1. Title of Dataset: Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones

    2. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se

      B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se

    3. Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.

    4. Geographic location of data collection: Department of Psychology, Stockholm, Sweden

    5. Information about funding sources that supported the collection of the data: Swedish Research Council (Vetenskapsrådet) 2015-01181

    SHARING/ACCESS INFORMATION

    1. Licenses/restrictions placed on the data: CC BY 4.0

    2. Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.

    The study was preregistered: https://doi.org/10.17605/OSF.IO/6FHR8

    1. Links to other publicly accessible locations of the data: N/A

    2. Links/relationships to ancillary data sets: N/A

    3. Was data derived from another source? No

    4. Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002

    DATA & FILE OVERVIEW

    File List: The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.

    ASSR2_experiment_scripts.zip: contains the Python files to run the experiment.

    ASSR2_rawdata.zip: contains raw datafiles for each subject

    • data_EEG: EEG data in bdf format (generated by Biosemi)
    • data_log: logfiles of the EEG session (generated by Python)

    ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG data

    ASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scripts

    ASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are:

    • ASSR2.html: R output of the main analyses
    • ASSR2_subset.html: R output of the main analyses but after excluding eight subjects who were recorded as pilots before preregistering the study

    ASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.

    METHODOLOGICAL INFORMATION

    1. Description of methods used for collection/generation of data: The auditory stimuli were amplitude-modulated tones with a carrier frequency (fc) of 500 Hz and modulation frequencies (fm) of 20.48 Hz, 40.96 Hz, or 81.92 Hz. The experiment was programmed in python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn

    The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.

    1. Methods for processing the data: We conducted frequency analyses and computed event-related potentials. See linked publication

    2. Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3

    3. Standards and calibration information, if appropriate: For information, see linked publication.

    4. Environmental/experimental conditions: For information, see linked publication.

    5. Describe any quality-assurance procedures per

  12. e

    Open data: Is Auditory Awareness Negativity Confounded by Performance?

    • data.europa.eu
    • demo.researchdata.se
    • +2more
    unknown
    Updated Jun 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stockholms universitet (2020). Open data: Is Auditory Awareness Negativity Confounded by Performance? [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-17045-sthlmuni-9724280?locale=cs
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jun 3, 2020
    Dataset authored and provided by
    Stockholms universitet
    Description

    The main file is performance_correction.html in AAN3_analysis_scripts.zip. It contains the results of the main analyses.

    See AAN3_readme_figshare.txt: 1. Title of Dataset:Open data: Is auditory awareness negativity confounded by performance?

    1. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se

      B. Associate or Co-investigator Contact Information Name: Rasmus Eklund Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/raek2031-1.223133 Email: rasmus.eklund@psychology.su.se

      C. Associate or Co-investigator Contact Information Name: Billy Gerdfeldter Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/bige1544-1.403208 Email: billy.gerdfeldter@psychology.su.se

    2. Date of data collection: Subjects (N = 28) were tested between 2018-12-03 and 2019-01-18.

    3. Geographic location of data collection: Department of Psychology, Stockholm, Sweden

    4. Information about funding sources that supported the collection of the data: Swedish Research Council / Vetenskapsrådet (Grant 2015-01181) Marianne and Marcus Wallenberg (Grant 2019-0102)

    SHARING/ACCESS INFORMATION

    1. Licenses/restrictions placed on the data: CC BY 4.0

    2. Links to publications that cite or use the data: Eklund R., Gerdfeldter B., & Wiens S. (2020). Is auditory awareness negativity confounded by performance? Consciousness and Cognition. https://doi.org/10.1016/j.concog.2020.102954

    The study was preregistered: https://doi.org/10.17605/OSF.IO/W4U7V

    1. Links to other publicly accessible locations of the data: N/A

    2. Links/relationships to ancillary data sets: N/A

    3. Was data derived from another source? No

    4. Recommended citation for this dataset: Eklund R., Gerdfeldter B., & Wiens S. (2020). Open data: Is auditory awareness negativity confounded by performance? Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.9724280

    DATA & FILE OVERVIEW

    File List: The files contain the raw data, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.

    AAN3_experiment_scripts.zip: contains the Python files to run the experiment

    AAN3_rawdata_EEG.zip: contains raw EEG data files for each subject in .bdf format (generated by Biosemi)

    AAN3_rawdata_log.zip: contains log files of the EEG session (generated by Python)

    AAN3_EEG_scripts.zip: Python-MNE scripts to process and to analyze the EEG data

    AAN3_EEG_source_localization_scripts.zip: Python-MNE files needed for source localization. The template MRI is provided in this zip. The files are obtained from the MNE tutorial (https://mne.tools/stable/auto_tutorials/source-modeling/plot_eeg_no_mri.html?highlight=template). Note that the stc folder is empty. The source time course files are not provided because of their large size. They can quickly be generated from the analysis script. They are needed for the source localization.

    AAN3_analysis_scripts.zip: R scripts to analyze the data. The main file is performance_correction.html. It contains the results of the main analyses.

    AAN3_results.zip: contains summary data files, figures, and tables that are created by Python-MNE and R.

    METHODOLOGICAL INFORMATION

    1. Description of methods used for collection/generation of data: The auditory stimuli were two 100-ms tones (f = 900 Hz and 1400 Hz, 5 ms fade-in and fade-out). The experiment was programmed in Python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.

    2. Methods for processing the data: We computed event-related potentials and source localization. See linked publication

    3. Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2016): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3

    4. Standards and calibration information, if appropriate: For information, see linked publication.

    5. Environmental/experimental conditions: For information, see linked publication.

    6. Describe any quality-assurance procedures performed on the data: For information, see linked publication.

    7. People involved with sample collection, processing, analysis and/or submission:

    • Data collection: R
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python
Organization logo

Klib library python

Easy to use library of customized functions for cleaning and analyzing data.

Explore at:
zip(89892446 bytes)Available download formats
Dataset updated
Jan 11, 2021
Authors
Sripaad Srinivasan
Description

klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib
import klib
import pandas as pd

df = pd.DataFrame(data)

# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT

Search
Clear search
Close search
Google apps
Main menu