Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please also see the latest version of the repository: |
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.
This is a sample project highlighting some basic methodologies in working with the DataCite public data file and Data Citation Corpus on Redivis.
Using the transform interface, we extract all records associated with DOIs for Stanford datasets on Redivis. We then make a simple plot using a python notebook to see DOI issuance over time. The nested nature of some of the public data file fields makes exploration a bit challenging; future work could break this dataset into multiple related tables for easier analysis.
We can also join with the Data Citation Corpus to find all citations referencing Stanford-on-Redivis DOIs (the citation corpus is a work in progress, and doesn't currently capture many of the citations in the literature).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami, a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code used to annotate the figures.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📌**Context**
The Healthcare Workforce Mental Health Dataset is designed to explore workplace mental health challenges in the healthcare industry, an environment known for high stress and burnout rates.
This dataset enables users to analyze key trends related to:
💠 Workplace Stressors: Examining the impact of heavy workloads, poor work environments, and emotional demands.
💠 Mental Health Outcomes: Understanding how stress and burnout influence job satisfaction, absenteeism, and turnover intention.
💠 Educational & Analytical Applications: A valuable resource for data analysts, students, and career changers looking to practice skills in data exploration and data visualization.
To help users gain deeper insights, this dataset is fully compatible with a Power BI Dashboard, available as part of a complete analytics bundle for enhanced visualization and reporting.
📌**Source**
This dataset was synthetically generated using the following methods:
💠 Python & Data Science Techniques: Probabilistic modeling to simulate realistic data distributions. Industry-informed variable relationships based on healthcare workforce studies.
💠 Guidance & Validation Using AI (ChatGPT): Assisted in refining dataset realism and logical mappings.
💠 Industry Research & Reports: Based on insights from WHO, CDC, OSHA, and academic studies on workplace stress and mental health in healthcare settings.
📌**Inspiration**
This dataset was inspired by ongoing discussions in healthcare regarding burnout, mental health, and staff retention. The goal is to bridge the gap between raw data and actionable insights by providing a structured, analyst-friendly dataset.
For those who want a ready-to-use reporting solution, a Power BI Dashboard Template is available, designed for interactive data exploration, workforce insights, and stress factor analysis.
📌**Important Note** This dataset is synthetic and intended for educational purposes only. It is not real-world employee data and should not be used for actual decision-making or policy implementation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our dataset comprises 1000 tweets, which were taken from Twitter using the Python programming language. The dataset was stored in a CSV file and generated using various modules. The random module was used to generate random IDs and text, while the faker module was used to generate random user names and dates. Additionally, the textblob module was used to assign a random sentiment to each tweet.
This systematic approach ensures that the dataset is well-balanced and represents different types of tweets, user behavior, and sentiment. It is essential to have a balanced dataset to ensure that the analysis and visualization of the dataset are accurate and reliable. By generating tweets with a range of sentiments, we have created a diverse dataset that can be used to analyze and visualize sentiment trends and patterns.
In addition to generating the tweets, we have also prepared a visual representation of the data sets. This visualization provides an overview of the key features of the dataset, such as the frequency distribution of the different sentiment categories, the distribution of tweets over time, and the user names associated with the tweets. This visualization will aid in the initial exploration of the dataset and enable us to identify any patterns or trends that may be present.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.
Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv
Playground for visualizations, preprocessing feature engineering, model pipelining, and more.
This submission includes the final project report of the Snake River Plain Play Fairway Analysis project as well as a separate appendix for the final report. The final report outlines the application of Play Fairway Analysis (PFA) to geothermal exploration, specifically within the Snake River Plain volcanic province. The goals of the report are to use PFA to lower risk and cost of geothermal exploration and stimulate development of geothermal power resources in Idaho. Further use of this report could include the application of PFA for geothermal exploration throughout the geothermal industry. The report utilizes ArcGIS and Python for data analysis which used to developed a systematic workflow to automate data analysis. The appendix for the report includes ArcGIS maps and data compilation information regarding the report.
This resource contains the environmental data (Stream Temperature) for different monitoring sites of the Logan River Observatory in the SQLite database management system. The monitoring sites with SiteIDs 1,2,3,9 and 10 of the Logan River Observatory are considered for the evaluation and visualization of monthly average stream temperature whose Variable ID is 1. The python code which is included in this resource is capable to access the database (SQLite) file and this retrieved data can be analyzed to examine the average monthly stream temperature at different monitoring sites of the Logan River Observatory.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Iris dataset is a classic and widely used dataset in machine learning for classification tasks. It consists of measurements of different iris flowers, including sepal length, sepal width, petal length, and petal width, along with their corresponding species. With a total of 150 samples, the dataset is balanced and serves as an excellent choice for understanding and implementing classification algorithms. This notebook explores the dataset, preprocesses the data, builds a decision tree classification model, and evaluates its performance, showcasing the effectiveness of decision trees in solving classification problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides electricity consumption data collected from the building management system of GreEn-ER. This building, located in Grenoble, hosts Grenoble-INP Ense³ Engineering School and the G2ELab (Grenoble Electrical Engineering Laboratory). It brings together in one place the teaching and research actors around new energy technologies. The electricity consumption of the building is highly monitored with plus than 300 meters. The data from each meter is available in one csv file, which contains two columns. One contains the Timestamp and the other contains de electricity consumption in kWh. The sampling rate for all data is 10 min. There are data available for 2017 and 2018. The dataset also contains data of the external temperature for 2017 and 2018. The files are structured as follows: - The main folder called "Data" contains 2 sub-folders, each one corresponding to one year (2017 and 2018). - Each sub-folder contains 3 other sub-folders, each one corresponding to a sector of the building. - The main folder "Data" also contains the csv files with the electricity consumption data of the whole building and a file called "Temp.csv" with the temperature data. - The separator used in the csv files is ";". - The sampling rate is 10 min and the unity of the consumption is kWh. It means that each sample corresponds to the energy consumption in these 10 minutes. So if the user wants to retrieve the mean power in this period (that corresponds to each sample), the value must be multiplied by 6. - Four Jupyter Notebook files, a format that allows combining text, graphics and code in python are also available. These files allow exploring all the data within the dataset. - These jupyter notebook files contains all the metadata necessary for understanding the system, like drawings of the system design, of the building etc. - Each file is named by the number of its meter. These numbers can be retrieved in tables and drawings available in the Jupyter Notebooks. - A couple of csv files with the system design are also available. They are called "TGBT1_n.csv", "TGBT2_n.csv" and "PREDIS-MHI_n.csv".
From a baby’s babbling to a songbird practicing a new tune, exploration is critical to motor learning. A hallmark of exploration is the emergence of random walk behaviour along solution manifolds, where successive motor actions are not independent but rather become serially dependent. Such exploratory random walk behaviour is ubiquitous across species, neural firing, gait patterns, and reaching behaviour. Past work has suggested that exploratory random walk behaviour arises from an accumulation of movement variability and a lack of error-based corrections. Here we test a fundamentally different idea—that reinforcement-based processes regulate random walk behaviour to promote continual motor exploration to maximize success. Across three human-reaching experiments, we manipulated the size of both the visually displayed target and an unseen reward zone, as well as the probability of reinforcement feedback. Our empirical and modelling results parsimoniously support the notion that explorato..., Data was collected using a Kinarm and processed using Kinarm's Matlab scripts. The output of the Matlab scripts was then processed using Python (3.8.13) and stored in custom Python objects. , , # Reinforcement-Based Processes Actively Regulate Motor Exploration Along Redundant Solution Manifolds
https://doi.org/10.5061/dryad.ngf1vhj10
All files are compressed using the Python package dill. Each file contains a custom Python object that has data attributes and analysis methods. For a complete list of methods and attributes, see Exploration_Subject.py in the repository https://github.com/CashabackLab/Exploration-Along-Solution-Manifolds-Data
Files can be read into a Python script via the class method "from_pickle" inside the Exploration_Subject class.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of 92 valid eye tracking sessions of 25 participants working in Vscode and answering 15 different code understanding questions (e.g., what is the output, side effects, algorithmic complexity, concurrency etc.) on source code written in 3 programming langauges: Python, C++, C#.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objective of this dataset is to provide a comprehensive collection of data that explores the recognition of tactile textures in dynamic exploration scenarios. The dataset was generated using a tactile-enabled finger with a multi-modal tactile sensing module. By incorporating data from pressure, gravity, angular rate, and magnetic field sensors, the dataset aims to facilitate research on machine learning methods for texture classification.
The data is stored in pickle files, which can be read using Panda’s library in Python. The data files are organized in a specific folder structure and contain multiple readings for each texture and exploratory velocity. The dataset contains raw data of the recorded tactile measurements for 12 different textures and 3 different exploratory velocities stored in pickle files.
Pickles_30 - Folder containing pickle files with tactile data at an exploratory velocity of 30 mm/s. Pickles_40 - Folder containing pickle files with tactile data at an exploratory velocity of 40 mm/s. Pickles_45 - Folder containing pickle files with tactile data at an exploratory velocity of 45 mm/s. Texture_01 to Texture_12 - Folders containing pickle files for each texture, labelled as texture_01, texture_02, and so on. Full_baro - Folder containing pickle files with barometer data for each texture. Full_imu - Folder containing pickle files with IMU (Inertial Measurement Unit) data for each texture.
The "reading-pickle-file.ipynb" file is a script for reading and plotting the dataset.
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of GPS stations (1 file) and seismometers (1 file) multivariate time series data associated with three types of events (normal activity / medium earthquakes / large earthquakes). Files Format: plain textFiles Creation Date: 02/09/2019Data Type: multivariate time seriesNumber of Dimensions: 3 (east-west, north-south and up-down)Time Series Length: 60 (one data point per second)Period: 2001-2018Geographic Location: -62 ≤ latitude ≤ 73, -179 ≤ longitude ≤ 25Data Collection - Large Earthquakes: GPS stations and seismometers data are obtained from the archive [1]. This archive includes 29 large eathquakes. In order to be able to adopt a homogeneous labeling method, dataset is limited to the data available from the American Incorporated Research Institutions for Seismology - IRIS (14 large earthquakes remaining over 29). > GPS stations (14 events): High Rate Global Navigation Satellite System (HR-GNSS) displacement data (1-5Hz). Raw observations have been processed with a precise point positioning algorithm [2] to obtain displacement time series in geodetic coordinates. Undifferenced GNSS ambiguities were fixed to integers to improve accuracy, especially over the low frequency band of tens of seconds [3]. Then, coordinates have been rotated to a local east-west, north-south and up-down system. > Seismometers (14 events): seismometers strong motion data (1-10Hz). Channel files are specifying the units, sample rates, and gains of each channel. - Normal Activity / Medium Earthquakes: > GPS stations (255 events: 255 normal activity): High Rate Global Navigation Satellite System (HR-GNSS) normal activity displacement data (1Hz). GPS data outside of large earthquake periods can be considered as normal activity (noise). Data is downloaded from [4], an archive maintained by the University of Oregon which stores a representative extract of GPS noise. It is an archive of real-time three component positions for 240 stations in the western U.S. from California to Alaska and spanning from October 2018 to the present day. The raw GPS data (observations of phase and range to visible satellites) are processed with an algorithm called FastLane [5] and converted to 1 Hz sampled positions. Normal activity MTS are randomly sampled from the archive to match the number of seismometers events and to keep a ratio above 30% between the number of large earthquakes MTS and normal activity in order not to encounter a class imbalance issue.> Seismometers (255 events: 170 normal activity, 85 medium earthquakes): seismometers strong motion data (1-10Hz). Time series data collected from the international Federation of Digital Seismograph Networks (FDSN) client available in Python package ObsPy [6]. Channel information is specifying the units, sample rates, and gains of each channel. The number of medium earthquakes is calculated by the ratio of medium over large earthquakes during the past 10 years in the region. A ratio above 30% is kept between the number of 60 seconds MTS corresponding to earthquakes (medium + large) and total (earthquakes + normal activity) number of MTS to prevent a class imbalance issue. The number of GPS stations and seismometers for each event varies (tens to thousands). Preprocessing:- Conversion (seismometers): data are available as digital signal, which is specific for each sensor. Therefore, each instrument digital signal is converted to its physical signal (acceleration) to obtain comparable seismometers data- Aggregation (GPS stations and seismometers): data aggregation by second (mean)Variables:- event_id: unique ID of an event. Dataset is composed of 269 events.- event_time: timestamp of the event occurence - event_magnitude: magnitude of the earthquake (Richter scale)- event_latitude: latitude of the event recorded (degrees)- event_longitude: longitude of the event recorded (degrees)- event_depth: distance below Earth's surface where earthquake happened (km)- mts_id: unique multivariate time series ID. Dataset is composed of 2,072 MTS from GPS stations and 13,265 MTS from seismometers.- station: sensor name (GPS station or seismometer)- station_latitude: sensor (GPS station or seismometer) latitude (degrees)- station_longitude: sensor (GPS station or seismometer) longitude (degrees)- timestamp: timestamp of the multivariate time series- dimension_E: East-West component of the sensor (GPS station or seismometer) signal (cm/s/s)- dimension_N: North-South component of the sensor (GPS station or seismometer) signal (cm/s/s)- dimension_Z: Up-Down component of the sensor (GPS station or seismometer) signal (cm/s/s)- label: label associated with the event. There are 3 labels: normal activity (GPS stations: 255 events, seismometers: 170 events) / medium earthquake (GPS stations: 0 event, seismometers: 85 events) / large earthquake (GPS stations: 14 events, seismometers: 14 events). EEW relies on the detection of the primary wave (P-wave) before the secondary wave (damaging wave) arrive. P-waves follow a propagation model (IASP91 [7]). Therefore, each MTS is labeled based on the P-wave arrival time on each sensor (seismometers, GPS stations) calculated with the propagation model.[1] Ruhl, C. J., Melgar, D., Chung, A. I., Grapenthin, R. and Allen, R. M. 2019. Quantifying the value of real‐time geodetic constraints for earthquake early warning using a global seismic and geodetic data set. Journal of Geophysical Research: Solid Earth 124:3819-3837.[2] Geng, J., Bock, Y., Melgar, D, Crowell, B. W., and Haase, J. S. 2013. A new seismogeodetic approach applied to GPS and accelerometer observations of the 2012 Brawley seismic swarm: Implications for earthquake early warning. Geochemistry, Geophysics, Geosystems 14:2124-2142.[3] Geng, J., Jiang, P., and Liu, J. 2017. Integrating GPS with GLONASS for high‐rate seismogeodesy. Geophysical Research Letters 44:3139-3146.[4] http://tunguska.uoregon.edu/rtgnss/data/cwu/mseed/[5] Melgar, D., Melbourne, T., Crowell, B., Geng, J, Szeliga, W., Scrivner, C., Santillan, M. and Goldberg, D. 2019. Real-Time High-Rate GNSS Displacements: Performance Demonstration During the 2019 Ridgecrest, CA Earthquakes (Version 1.0) [Data set]. Zenodo.[6] https://docs.obspy.org/packages/obspy.clients.fdsn.html[7] Kennet, B. L. N. 1991. Iaspei 1991 Seismological Tables. Terra Nova 3:122–122.
Heat pumps are essential for decarbonizing residential heating but consume substantial electrical energy, impacting operational costs and grid demand. Many systems run inefficiently due to planning flaws, operational faults, or misconfigurations. While optimizing performance requires skilled professionals, labor shortages hinder large-scale interventions. However, digital tools and improved data availability create new service opportunities for energy efficiency, predictive maintenance, and demand-side management. To support research and practical solutions, we present an open-source dataset of electricity consumption from 1,408 households with heat pumps and smart electricity meters in the canton of Zurich, Switzerland, recorded at 15-minute and daily resolutions between 2018-11-03 and 2024-03-21. The dataset includes household metadata, weather data from 8 stations, and ground truth data from 410 field visit protocols collected by energy consultants during system optimizations. Additionally, the dataset includes a Python-based data loader to facilitate seamless data processing and exploration.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this data set, we provide microstructure results from the simulation of additive manufacturing processes with the SPPARKS Monte Carlo code. The dataset will be used in our entry to the Materials Science and Engineering Data Challenge. The parameters varied during the study, and their extents are listed in the table below. All simulations were performed on a 300 x 300 x 200 rectangular lattice. All length and timescales are defined within the model and refer to no actual physical system. This release contains the input and output data files, as well as a Python script for the generation of Paraview-compatible files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please also see the latest version of the repository: |
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.