48 datasets found
  1. Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

    • zenodo.org
    pdf, zip
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please also see the latest version of the repository:
    https://doi.org/10.5281/zenodo.6374011 and
    our website: https://ilandavis.com/jcb2023-yfp

    The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.

  2. Data from: Multidimensional Data Exploration with Glue

    • figshare.com
    pdf
    Updated Jan 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Openproceedings Bot (2016). Multidimensional Data Exploration with Glue [Dataset]. http://doi.org/10.6084/m9.figshare.935503.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 18, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Openproceedings Bot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.

  3. DataCite public data exploration

    • redivis.com
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Mathews (2025). DataCite public data exploration [Dataset]. https://redivis.com/workflows/hx1e-a6w8vmwsx
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Ian Mathews
    Description

    This is a sample project highlighting some basic methodologies in working with the DataCite public data file and Data Citation Corpus on Redivis.

    Using the transform interface, we extract all records associated with DOIs for Stanford datasets on Redivis. We then make a simple plot using a python notebook to see DOI issuance over time. The nested nature of some of the public data file fields makes exploration a bit challenging; future work could break this dataset into multiple related tables for easier analysis.

    We can also join with the Data Citation Corpus to find all citations referencing Stanford-on-Redivis DOIs (the citation corpus is a work in progress, and doesn't currently capture many of the citations in the literature).

  4. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  5. Zegami user manual for data exploration: "Systematic analysis of YFP gene...

    • zenodo.org
    pdf, zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Kiourlappou; Maria Kiourlappou; Stephen Taylor; Ilan Davis; Ilan Davis; Stephen Taylor (2024). Zegami user manual for data exploration: "Systematic analysis of YFP gene traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.6374012
    Explore at:
    pdf, zipAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Kiourlappou; Maria Kiourlappou; Stephen Taylor; Ilan Davis; Ilan Davis; Stephen Taylor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami, a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code used to annotate the figures.

  6. Healthcare Workforce Mental Health Dataset

    • kaggle.com
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rivalytics (2025). Healthcare Workforce Mental Health Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10768196
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2025
    Dataset provided by
    Kaggle
    Authors
    Rivalytics
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    📌**Context**

    The Healthcare Workforce Mental Health Dataset is designed to explore workplace mental health challenges in the healthcare industry, an environment known for high stress and burnout rates.

    This dataset enables users to analyze key trends related to:

    💠 Workplace Stressors: Examining the impact of heavy workloads, poor work environments, and emotional demands.

    💠 Mental Health Outcomes: Understanding how stress and burnout influence job satisfaction, absenteeism, and turnover intention.

    💠 Educational & Analytical Applications: A valuable resource for data analysts, students, and career changers looking to practice skills in data exploration and data visualization.

    To help users gain deeper insights, this dataset is fully compatible with a Power BI Dashboard, available as part of a complete analytics bundle for enhanced visualization and reporting.

    📌**Source**

    This dataset was synthetically generated using the following methods:

    💠 Python & Data Science Techniques: Probabilistic modeling to simulate realistic data distributions. Industry-informed variable relationships based on healthcare workforce studies.

    💠 Guidance & Validation Using AI (ChatGPT): Assisted in refining dataset realism and logical mappings.

    💠 Industry Research & Reports: Based on insights from WHO, CDC, OSHA, and academic studies on workplace stress and mental health in healthcare settings.

    📌**Inspiration**

    This dataset was inspired by ongoing discussions in healthcare regarding burnout, mental health, and staff retention. The goal is to bridge the gap between raw data and actionable insights by providing a structured, analyst-friendly dataset.

    For those who want a ready-to-use reporting solution, a Power BI Dashboard Template is available, designed for interactive data exploration, workforce insights, and stress factor analysis.

    📌**Important Note** This dataset is synthetic and intended for educational purposes only. It is not real-world employee data and should not be used for actual decision-making or policy implementation.

  7. Open Data Package for Article "Exploring Complexity Issues in Junior...

    • figshare.com
    xlsx
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur-Jozsef Molnar (2024). Open Data Package for Article "Exploring Complexity Issues in Junior Developer Code using Static Analysis and FCA" [Dataset]. http://doi.org/10.6084/m9.figshare.25729587.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Arthur-Jozsef Molnar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.

  8. m

    Dataset for twitter Sentiment Analysis using Roberta and Vader

    • data.mendeley.com
    Updated May 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jannatul Ferdoshi Jannatul Ferdoshi (2023). Dataset for twitter Sentiment Analysis using Roberta and Vader [Dataset]. http://doi.org/10.17632/2sjt22sb55.1
    Explore at:
    Dataset updated
    May 14, 2023
    Authors
    Jannatul Ferdoshi Jannatul Ferdoshi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our dataset comprises 1000 tweets, which were taken from Twitter using the Python programming language. The dataset was stored in a CSV file and generated using various modules. The random module was used to generate random IDs and text, while the faker module was used to generate random user names and dates. Additionally, the textblob module was used to assign a random sentiment to each tweet.

    This systematic approach ensures that the dataset is well-balanced and represents different types of tweets, user behavior, and sentiment. It is essential to have a balanced dataset to ensure that the analysis and visualization of the dataset are accurate and reliable. By generating tweets with a range of sentiments, we have created a diverse dataset that can be used to analyze and visualize sentiment trends and patterns.

    In addition to generating the tweets, we have also prepared a visual representation of the data sets. This visualization provides an overview of the key features of the dataset, such as the frequency distribution of the different sentiment categories, the distribution of tweets over time, and the user names associated with the tweets. This visualization will aid in the initial exploration of the dataset and enable us to identify any patterns or trends that may be present.

  9. Titanic data for Data Preprocessing

    • kaggle.com
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Sehgal (2021). Titanic data for Data Preprocessing [Dataset]. https://www.kaggle.com/akshaysehgal/titanic-data-for-data-preprocessing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshay Sehgal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.

    Columns

    • 'survived'
    • 'pclass'
    • 'sex'
    • 'age'
    • 'sibsp'
    • 'parch'
    • 'fare'
    • 'embarked'
    • 'class'
    • 'who'
    • 'adult_male'
    • 'deck'
    • 'embark_town'
    • 'alive'
    • 'alone'

    Acknowledgements

    Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv

    Inspiration

    Playground for visualizations, preprocessing feature engineering, model pipelining, and more.

  10. d

    Play Fairway Analysis of the Snake River Plain, Idaho: Final Report

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Utah State University (2025). Play Fairway Analysis of the Snake River Plain, Idaho: Final Report [Dataset]. https://catalog.data.gov/dataset/play-fairway-analysis-of-the-snake-river-plain-idaho-final-report-3cd0c
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Utah State University
    Area covered
    Snake River, Idaho, Snake River Plain
    Description

    This submission includes the final project report of the Snake River Plain Play Fairway Analysis project as well as a separate appendix for the final report. The final report outlines the application of Play Fairway Analysis (PFA) to geothermal exploration, specifically within the Snake River Plain volcanic province. The goals of the report are to use PFA to lower risk and cost of geothermal exploration and stimulate development of geothermal power resources in Idaho. Further use of this report could include the application of PFA for geothermal exploration throughout the geothermal industry. The report utilizes ArcGIS and Python for data analysis which used to developed a systematic workflow to automate data analysis. The appendix for the report includes ArcGIS maps and data compilation information regarding the report.

  11. d

    Visualization of Stream Temperature of Logan River by exploring Water...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhuwan Ghimire (2023). Visualization of Stream Temperature of Logan River by exploring Water Temperature Dataset Using Python [Dataset]. https://search.dataone.org/view/sha256%3Acc28c28eb10ae34bd6301bca67a8cf2244b8c65a6609128c6f026719c92027ba
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Bhuwan Ghimire
    Time period covered
    Jan 1, 2014 - Dec 31, 2021
    Area covered
    Description

    This resource contains the environmental data (Stream Temperature) for different monitoring sites of the Logan River Observatory in the SQLite database management system. The monitoring sites with SiteIDs 1,2,3,9 and 10 of the Logan River Observatory are considered for the evaluation and visualization of monthly average stream temperature whose Variable ID is 1. The python code which is included in this resource is capable to access the database (SQLite) file and this retrieved data can be analyzed to examine the average monthly stream temperature at different monitoring sites of the Logan River Observatory.

  12. Classification Analysis Using Python

    • kaggle.com
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nibedita Sahu (2023). Classification Analysis Using Python [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/classification-analysis-using-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nibedita Sahu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Iris dataset is a classic and widely used dataset in machine learning for classification tasks. It consists of measurements of different iris flowers, including sepal length, sepal width, petal length, and petal width, along with their corresponding species. With a total of 150 samples, the dataset is balanced and serves as an excellent choice for understanding and implementing classification algorithms. This notebook explores the dataset, preprocesses the data, builds a decision tree classification model, and evaluates its performance, showcasing the effectiveness of decision trees in solving classification problems.

  13. Data from: GreEn-ER - Electricity Consumption Data of a Tertiary Building

    • search.datacite.org
    • data.mendeley.com
    Updated Sep 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo Martin Nascimento (2020). GreEn-ER - Electricity Consumption Data of a Tertiary Building [Dataset]. http://doi.org/10.17632/h8mmnthn5w
    Explore at:
    Dataset updated
    Sep 20, 2020
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Mendeley
    Authors
    Gustavo Martin Nascimento
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides electricity consumption data collected from the building management system of GreEn-ER. This building, located in Grenoble, hosts Grenoble-INP Ense³ Engineering School and the G2ELab (Grenoble Electrical Engineering Laboratory). It brings together in one place the teaching and research actors around new energy technologies. The electricity consumption of the building is highly monitored with plus than 300 meters. The data from each meter is available in one csv file, which contains two columns. One contains the Timestamp and the other contains de electricity consumption in kWh. The sampling rate for all data is 10 min. There are data available for 2017 and 2018. The dataset also contains data of the external temperature for 2017 and 2018. The files are structured as follows: - The main folder called "Data" contains 2 sub-folders, each one corresponding to one year (2017 and 2018). - Each sub-folder contains 3 other sub-folders, each one corresponding to a sector of the building. - The main folder "Data" also contains the csv files with the electricity consumption data of the whole building and a file called "Temp.csv" with the temperature data. - The separator used in the csv files is ";". - The sampling rate is 10 min and the unity of the consumption is kWh. It means that each sample corresponds to the energy consumption in these 10 minutes. So if the user wants to retrieve the mean power in this period (that corresponds to each sample), the value must be multiplied by 6. - Four Jupyter Notebook files, a format that allows combining text, graphics and code in python are also available. These files allow exploring all the data within the dataset. - These jupyter notebook files contains all the metadata necessary for understanding the system, like drawings of the system design, of the building etc. - Each file is named by the number of its meter. These numbers can be retrieved in tables and drawings available in the Jupyter Notebooks. - A couple of csv files with the system design are also available. They are called "TGBT1_n.csv", "TGBT2_n.csv" and "PREDIS-MHI_n.csv".

  14. d

    Data from: Reinforcement-based processes actively regulate motor exploration...

    • dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Roth; Jan Calalo; Rakshith Lokesh; Seth Sullivan; Stephen Grill; John Jeka; Katinka van der Kooij; Michael Carter; Joshua Cashaback (2024). Reinforcement-based processes actively regulate motor exploration along redundant solution manifolds [Dataset]. http://doi.org/10.5061/dryad.ngf1vhj10
    Explore at:
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Adam Roth; Jan Calalo; Rakshith Lokesh; Seth Sullivan; Stephen Grill; John Jeka; Katinka van der Kooij; Michael Carter; Joshua Cashaback
    Time period covered
    Sep 12, 2023
    Description

    From a baby’s babbling to a songbird practicing a new tune, exploration is critical to motor learning. A hallmark of exploration is the emergence of random walk behaviour along solution manifolds, where successive motor actions are not independent but rather become serially dependent. Such exploratory random walk behaviour is ubiquitous across species, neural firing, gait patterns, and reaching behaviour. Past work has suggested that exploratory random walk behaviour arises from an accumulation of movement variability and a lack of error-based corrections. Here we test a fundamentally different idea—that reinforcement-based processes regulate random walk behaviour to promote continual motor exploration to maximize success. Across three human-reaching experiments, we manipulated the size of both the visually displayed target and an unseen reward zone, as well as the probability of reinforcement feedback. Our empirical and modelling results parsimoniously support the notion that explorato..., Data was collected using a Kinarm and processed using Kinarm's Matlab scripts. The output of the Matlab scripts was then processed using Python (3.8.13) and stored in custom Python objects. , , # Reinforcement-Based Processes Actively Regulate Motor Exploration Along Redundant Solution Manifolds

    https://doi.org/10.5061/dryad.ngf1vhj10

    All files are compressed using the Python package dill. Each file contains a custom Python object that has data attributes and analysis methods. For a complete list of methods and attributes, see Exploration_Subject.py in the repository https://github.com/CashabackLab/Exploration-Along-Solution-Manifolds-Data

    Files can be read into a Python script via the class method "from_pickle" inside the Exploration_Subject class.

  15. Follow-up Attention: An Empirical Study of Developer and Neural Model Code...

    • figshare.com
    zip
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Paltenghi (2024). Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration - Raw Eye Tracking Data [Dataset]. http://doi.org/10.6084/m9.figshare.23599251.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Matteo Paltenghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of 92 valid eye tracking sessions of 25 participants working in Vscode and answering 15 different code understanding questions (e.g., what is the output, side effects, algorithmic complexity, concurrency etc.) on source code written in 3 programming langauges: Python, C++, C#.

  16. m

    Multimodal Tactile Texture Dataset

    • data.mendeley.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno Monteiro Rocha Lima (2023). Multimodal Tactile Texture Dataset [Dataset]. http://doi.org/10.17632/n666tk4mw9.1
    Explore at:
    Dataset updated
    Aug 15, 2023
    Authors
    Bruno Monteiro Rocha Lima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The objective of this dataset is to provide a comprehensive collection of data that explores the recognition of tactile textures in dynamic exploration scenarios. The dataset was generated using a tactile-enabled finger with a multi-modal tactile sensing module. By incorporating data from pressure, gravity, angular rate, and magnetic field sensors, the dataset aims to facilitate research on machine learning methods for texture classification.

    The data is stored in pickle files, which can be read using Panda’s library in Python. The data files are organized in a specific folder structure and contain multiple readings for each texture and exploratory velocity. The dataset contains raw data of the recorded tactile measurements for 12 different textures and 3 different exploratory velocities stored in pickle files.

    Pickles_30 - Folder containing pickle files with tactile data at an exploratory velocity of 30 mm/s. Pickles_40 - Folder containing pickle files with tactile data at an exploratory velocity of 40 mm/s. Pickles_45 - Folder containing pickle files with tactile data at an exploratory velocity of 45 mm/s. Texture_01 to Texture_12 - Folders containing pickle files for each texture, labelled as texture_01, texture_02, and so on. Full_baro - Folder containing pickle files with barometer data for each texture. Full_imu - Folder containing pickle files with IMU (Inertial Measurement Unit) data for each texture.

    The "reading-pickle-file.ipynb" file is a script for reading and plotting the dataset.

  17. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  18. Earthquake Early Warning Dataset

    • figshare.com
    txt
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Fauvel; Daniel Balouek-Thomert; Diego Melgar; Pedro Silva; Anthony Simonet; Gabriel Antoniu; Alexandru Costan; Véronique Masson; Manish Parashar; Ivan Rodero; Alexandre Termier (2019). Earthquake Early Warning Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.9758555.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Kevin Fauvel; Daniel Balouek-Thomert; Diego Melgar; Pedro Silva; Anthony Simonet; Gabriel Antoniu; Alexandru Costan; Véronique Masson; Manish Parashar; Ivan Rodero; Alexandre Termier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is composed of GPS stations (1 file) and seismometers (1 file) multivariate time series data associated with three types of events (normal activity / medium earthquakes / large earthquakes). Files Format: plain textFiles Creation Date: 02/09/2019Data Type: multivariate time seriesNumber of Dimensions: 3 (east-west, north-south and up-down)Time Series Length: 60 (one data point per second)Period: 2001-2018Geographic Location: -62 ≤ latitude ≤ 73, -179 ≤ longitude ≤ 25Data Collection - Large Earthquakes: GPS stations and seismometers data are obtained from the archive [1]. This archive includes 29 large eathquakes. In order to be able to adopt a homogeneous labeling method, dataset is limited to the data available from the American Incorporated Research Institutions for Seismology - IRIS (14 large earthquakes remaining over 29). > GPS stations (14 events): High Rate Global Navigation Satellite System (HR-GNSS) displacement data (1-5Hz). Raw observations have been processed with a precise point positioning algorithm [2] to obtain displacement time series in geodetic coordinates. Undifferenced GNSS ambiguities were fixed to integers to improve accuracy, especially over the low frequency band of tens of seconds [3]. Then, coordinates have been rotated to a local east-west, north-south and up-down system. > Seismometers (14 events): seismometers strong motion data (1-10Hz). Channel files are specifying the units, sample rates, and gains of each channel. - Normal Activity / Medium Earthquakes: > GPS stations (255 events: 255 normal activity): High Rate Global Navigation Satellite System (HR-GNSS) normal activity displacement data (1Hz). GPS data outside of large earthquake periods can be considered as normal activity (noise). Data is downloaded from [4], an archive maintained by the University of Oregon which stores a representative extract of GPS noise. It is an archive of real-time three component positions for 240 stations in the western U.S. from California to Alaska and spanning from October 2018 to the present day. The raw GPS data (observations of phase and range to visible satellites) are processed with an algorithm called FastLane [5] and converted to 1 Hz sampled positions. Normal activity MTS are randomly sampled from the archive to match the number of seismometers events and to keep a ratio above 30% between the number of large earthquakes MTS and normal activity in order not to encounter a class imbalance issue.> Seismometers (255 events: 170 normal activity, 85 medium earthquakes): seismometers strong motion data (1-10Hz). Time series data collected from the international Federation of Digital Seismograph Networks (FDSN) client available in Python package ObsPy [6]. Channel information is specifying the units, sample rates, and gains of each channel. The number of medium earthquakes is calculated by the ratio of medium over large earthquakes during the past 10 years in the region. A ratio above 30% is kept between the number of 60 seconds MTS corresponding to earthquakes (medium + large) and total (earthquakes + normal activity) number of MTS to prevent a class imbalance issue. The number of GPS stations and seismometers for each event varies (tens to thousands). Preprocessing:- Conversion (seismometers): data are available as digital signal, which is specific for each sensor. Therefore, each instrument digital signal is converted to its physical signal (acceleration) to obtain comparable seismometers data- Aggregation (GPS stations and seismometers): data aggregation by second (mean)Variables:- event_id: unique ID of an event. Dataset is composed of 269 events.- event_time: timestamp of the event occurence - event_magnitude: magnitude of the earthquake (Richter scale)- event_latitude: latitude of the event recorded (degrees)- event_longitude: longitude of the event recorded (degrees)- event_depth: distance below Earth's surface where earthquake happened (km)- mts_id: unique multivariate time series ID. Dataset is composed of 2,072 MTS from GPS stations and 13,265 MTS from seismometers.- station: sensor name (GPS station or seismometer)- station_latitude: sensor (GPS station or seismometer) latitude (degrees)- station_longitude: sensor (GPS station or seismometer) longitude (degrees)- timestamp: timestamp of the multivariate time series- dimension_E: East-West component of the sensor (GPS station or seismometer) signal (cm/s/s)- dimension_N: North-South component of the sensor (GPS station or seismometer) signal (cm/s/s)- dimension_Z: Up-Down component of the sensor (GPS station or seismometer) signal (cm/s/s)- label: label associated with the event. There are 3 labels: normal activity (GPS stations: 255 events, seismometers: 170 events) / medium earthquake (GPS stations: 0 event, seismometers: 85 events) / large earthquake (GPS stations: 14 events, seismometers: 14 events). EEW relies on the detection of the primary wave (P-wave) before the secondary wave (damaging wave) arrive. P-waves follow a propagation model (IASP91 [7]). Therefore, each MTS is labeled based on the P-wave arrival time on each sensor (seismometers, GPS stations) calculated with the propagation model.[1] Ruhl, C. J., Melgar, D., Chung, A. I., Grapenthin, R. and Allen, R. M. 2019. Quantifying the value of real‐time geodetic constraints for earthquake early warning using a global seismic and geodetic data set. Journal of Geophysical Research: Solid Earth 124:3819-3837.[2] Geng, J., Bock, Y., Melgar, D, Crowell, B. W., and Haase, J. S. 2013. A new seismogeodetic approach applied to GPS and accelerometer observations of the 2012 Brawley seismic swarm: Implications for earthquake early warning. Geochemistry, Geophysics, Geosystems 14:2124-2142.[3] Geng, J., Jiang, P., and Liu, J. 2017. Integrating GPS with GLONASS for high‐rate seismogeodesy. Geophysical Research Letters 44:3139-3146.[4] http://tunguska.uoregon.edu/rtgnss/data/cwu/mseed/[5] Melgar, D., Melbourne, T., Crowell, B., Geng, J, Szeliga, W., Scrivner, C., Santillan, M. and Goldberg, D. 2019. Real-Time High-Rate GNSS Displacements: Performance Demonstration During the 2019 Ridgecrest, CA Earthquakes (Version 1.0) [Data set]. Zenodo.[6] https://docs.obspy.org/packages/obspy.clients.fdsn.html[7] Kennet, B. L. N. 1991. Iaspei 1991 Seismological Tables. Terra Nova 3:122–122.

  19. P

    HEAPO Dataset

    • paperswithcode.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). HEAPO Dataset [Dataset]. https://paperswithcode.com/dataset/heapo
    Explore at:
    Dataset updated
    Mar 20, 2025
    Description

    Heat pumps are essential for decarbonizing residential heating but consume substantial electrical energy, impacting operational costs and grid demand. Many systems run inefficiently due to planning flaws, operational faults, or misconfigurations. While optimizing performance requires skilled professionals, labor shortages hinder large-scale interventions. However, digital tools and improved data availability create new service opportunities for energy efficiency, predictive maintenance, and demand-side management. To support research and practical solutions, we present an open-source dataset of electricity consumption from 1,408 households with heat pumps and smart electricity meters in the canton of Zurich, Switzerland, recorded at 15-minute and daily resolutions between 2018-11-03 and 2024-03-21. The dataset includes household metadata, weather data from 8 stations, and ground truth data from 410 field visit protocols collected by energy consultants during system optimizations. Additionally, the dataset includes a Python-based data loader to facilitate seamless data processing and exploration.

  20. H

    Exploration of Process-Structure Linkages in Simulated Additive...

    • dataverse.harvard.edu
    Updated Sep 8, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Theron Rodgers (2015). Exploration of Process-Structure Linkages in Simulated Additive Manufacturing Microstructures [Dataset]. http://doi.org/10.7910/DVN/KJMK9Z
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Theron Rodgers
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this data set, we provide microstructure results from the simulation of additive manufacturing processes with the SPPARKS Monte Carlo code. The dataset will be used in our entry to the Materials Science and Engineering Data Challenge. The parameters varied during the study, and their extents are listed in the table below. All simulations were performed on a 300 x 300 x 200 rectangular lattice. All length and timescales are defined within the model and refer to no actual physical system. This release contains the input and output data files, as well as a Python script for the generation of Paraview-compatible files.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
Organization logo

Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system"

Explore at:
zip, pdfAvailable download formats
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Please also see the latest version of the repository:
https://doi.org/10.5281/zenodo.6374011 and
our website: https://ilandavis.com/jcb2023-yfp

The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.

Search
Clear search
Close search
Google apps
Main menu