Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific and management challenges in the water domain require synthesis of diverse data. Many data analysis tasks are difficult because datasets are large and complex; standard data formats are not always agreed upon or mapped to efficient structures for analysis; scientists may lack training for tackling large and complex datasets; and it can be difficult to share, collaborate around, and reproduce scientific work. Overcoming barriers to accessing, organizing, and preparing datasets for analyses can transform the way water scientists work. Building on the HydroShare repository’s cyberinfrastructure, we have advanced two Python packages that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS) (i.e., a Python equivalent of USGS’ R dataRetrieval package), loading data into performant structures that integrate with existing visualization, analysis, and data science capabilities available in Python, and writing analysis results back to HydroShare for sharing and publication. While these Python packages can be installed for use within any Python environment, we will demonstrate how the technical burden for scientists associated with creating a computational environment for executing analyses can be reduced and how sharing and reproducibility of analyses can be enhanced through the use of these packages within CUAHSI’s HydroShare-linked JupyterHub server.
This HydroShare resource includes all of the materials presented in a workshop at the 2023 CUAHSI Biennial Colloquium.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Python codes were conceived to work with ASCII .txt files with XYZ arrays, both as input and output. This makes codes highly compatible and universally usable. Code A provides an example of conversion from a .s94 data format to the requested ASCII .txt. Image analysis software always allow to export source files to .txt files with XYZ arrays, sometimes placing a text header before the data values to indicate the data scales. The script (code A) is created to convert raw STM files (.s94) into XYZ-type ASCII files, that can be opened by the WSxM software The script (code B) is developed to read the XYZ-type ASCII files and perform the flattening and equalizing filters by operating with an entire input file folder. The script (code C) was conceived with the possibility of optimizing the number of clusters The script (code D) reads a sample of images starting from the first one to the number N, which is selected by the user, it calculates the maximum extension of the Z values distribution for every image and returns an average extension value The script (code E) correct the drift affecting STM images
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).
You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:
Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.
Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris
Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
The file downloaded is iris.data and is formatted as a comma delimited file.
This small data collection was created to help you test your skills with ingesting various data formats.
This file was processed to convert the data in the following formats:
* csv - comma separated values format
* tsv - tab separated values format
* parquet - parquet format
* feather - feather format
* parquet.gzip - compressed parquet format
* h5 - hdf5 format
* pickle - Python binary object file - pickle format
* xslx - Excel format
* npy - Numpy (Python library) binary format
* npz - Numpy (Python library) binary compressed format
* rds - Rds (R specific data format) binary format
I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.
Use these data formats to test your skills in ingesting data in various formats.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The Python Compiler market size was valued at USD 341.2 million in 2023, and it is projected to grow at a robust compound annual growth rate (CAGR) of 10.8% to reach USD 842.6 million by 2032. The market growth is primarily driven by the increasing adoption of Python as a programming language across various industries, due to its simplicity and versatility.
The growth of the Python Compiler market can be attributed to several key factors. Firstly, the rising prominence of Python in data science, machine learning, and artificial intelligence domains is a significant driver. Python’s extensive libraries and frameworks make it an ideal choice for data processing and algorithm development, leading to increased demand for efficient Python compilers. This widespread application is spurring investments and advancements in compiler technologies to support increasingly complex computational tasks. Additionally, the open-source nature of Python encourages innovation and customization, further fueling market expansion.
Secondly, the educational sector's growing emphasis on coding and computer science education is another pivotal growth factor. Python is often chosen as the introductory programming language in educational institutions due to its readability and straightforward syntax. This trend is creating a steady demand for Python compilers that are user-friendly and suitable for educational purposes. As more schools and universities integrate Python into their curriculums, the market for Python compilers is expected to grow correspondingly, supporting a new generation of programmers and developers.
Furthermore, the increasing adoption of Python by small and medium enterprises (SMEs) is propelling the market forward. SMEs are leveraging Python for various applications, including web development, automation, and data analysis, due to its cost-effectiveness and ease of use. Python’s versatility allows businesses to streamline their operations and develop robust solutions without significant financial investment. This has led to a burgeoning demand for both on-premises and cloud-based Python compilers that can cater to the diverse needs of SMEs across different sectors.
Regionally, the Python Compiler market is witnessing notable growth in North America and the Asia Pacific. North America remains a key market due to the early adoption of advanced technologies and a strong presence of tech giants and startups alike. In contrast, the Asia Pacific region is experiencing rapid growth thanks to its expanding technological infrastructure and burgeoning IT industry. Countries like India and China are emerging as significant players due to their large pool of skilled developers and increasing investment in tech education and innovation.
In the Python Compiler market, the component segment is divided into software and services. The software segment encompasses the actual compiler tools and integrated development environments (IDEs) that developers use to write and optimize Python code. This segment is crucial as it directly impacts the efficiency and performance of Python applications. The demand for advanced compiler software is on the rise due to the need for high-performance computing in areas like machine learning, artificial intelligence, and big data analytics. Enhanced features such as real-time error detection, optimization techniques, and seamless integration with other development tools are driving the adoption of sophisticated Python compiler software.
The services segment includes support, maintenance, consulting, and training services associated with Python compilers. As organizations increasingly adopt Python for critical applications, the need for professional services to ensure optimal performance and scalability is growing. Consulting services help businesses customize and optimize their Python environments to meet specific needs, while training services are essential for upskilling employees and staying competitive in the tech-driven market. Additionally, support and maintenance services ensure that the compilers continue to operate efficiently and securely, minimizing downtime and enhancing productivity.
Within the software sub-segment, integrated development environments (IDEs) like PyCharm, Spyder, and Jupyter Notebooks are gaining traction. These IDEs not only provide robust compiling capabilities but also offer features like debugging, syntax highlighting, and version control, which streamline the development process. The increasing complexity of software develo
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Recent advances in the sensitivity and speed of mass spectrometers coupled with improved sample preparation methods have enabled the field of single cell proteomics to proliferate. While heavy development is occurring in the label free space, dramatic improvements in throughput are provided by multiplexing with tandem mass tags. Hundreds or thousands of single cells can be analyzed with this method, yielding large data sets which may contain poor data arising from loss of material during cell sorting or poor digestion, labeling, and lysis. To date, no tools have been described that can assess data quality prior to data processing. We present herein a lightweight python script and accompanying graphic user interface that can rapidly quantify reporter ion peaks within each MS/MS spectrum in a file. With simple summary reports, we can identify single cell samples that fail to pass a set quality threshold, thus reducing analysis time waste. In addition, this tool, Diagnostic Ion Data Analysis Reduction (DIDAR), will create reduced MGF files containing only spectra possessing a user-specified number of single cell reporter ions. By reducing the number of spectra that have excessive zero values, we can speed up sample processing with little loss in data completeness as these spectra are removed in later stages in data processing workflows. DIDAR and the DIDAR GUI are compatible with all modern operating systems and are available at: https://github.com/orsburn/DIDARSCPQC. All files described in this study are available at www.massive.ucsd.edu as accession MSV000088887.
PyPlant is a simple coroutine-based framework for writing data processing pipelines. PyPlant's goal is to simplify caching of intermediate results in the pipeline and avoid re-running expensive early stages of the pipeline, when only the later stages have changed. PyPlant is a simple coroutine-based framework for writing data processing pipelines. Given a set of Python functions that consume and produce data, it automatically runs them in a correct order and caches intermediate results. When the pipeline is executed again, only the necessary parts are re-run. Importantly, PyPlant was designed with the following design consideration in mind: Simple: Quick to learn, no custom language and workflow design programs. Start prototyping right away. DRY: Function code is metadata. No need to write execution graphs or external metadata. It just works (tm). Automatic: No need to manually re-run outdated parts. Large data: Handle data that doesn't fit into memory. Persist between runs. PyPlant can be installed from PyPI: pip install pyplant For documentation, see README.md.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Given the wide diversity in applications of biological mass spectrometry, custom data analyses are often needed to fully interpret the results of an experiment. Such bioinformatics scripts necessarily include similar basic functionality to read mass spectral data from standard file formats, process it, and visualize it. Rather than having to reimplement this functionality, to facilitate this task, spectrum_utils is a Python package for mass spectrometry data processing and visualization. Its high-level functionality enables developers to quickly prototype ideas for computational mass spectrometry projects in only a few lines of code. Notably, the data processing functionality is highly optimized for computational efficiency to be able to deal with the large volumes of data that are generated during mass spectrometry experiments. The visualization functionality makes it possible to easily produce publication-quality figures as well as interactive spectrum plots for inclusion on web pages. spectrum_utils is available for Python 3.6+, includes extensive online documentation and examples, and can be easily installed using conda. It is freely available as open source under the Apache 2.0 license at https://github.com/bittremieux/spectrum_utils.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Python is a free computer language that prioritizes readability for humans and general application. It is one of the easier computer languages to learn and start especially with no prior programming knowledge. I have been using Python for Excel spreadsheet automation, data analysis, and data visualization. It has allowed me to better focus on learning how to automate my data analysis workload. I am currently examining the North Carolina Department of Environmental Quality (NCDEQ) database for water quality sampling for the Town of Nags Head, NC. It spans over 26 years (1997-2023) and lists a total of currently 41 different testing site locations. You can see at the bottom of image 2 below that I have 148,204 testing data points for the entirety of the NCDEQ testing for the state. From this large dataset 34,759 data points are from Dare County (Nags Head) specifically with this subdivided into testing sites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw CITEseq, ASAPseq, and ATAC seq FASTQ files that were processed using CountASAP, a recently developed software. These data were used in a manuscript highlighting the functionality and benchmarking of this software. These raw data should make it possible to recereate every figure in the manuscript.
Preprint Link:
https://www.biorxiv.org/content/10.1101/2024.05.20.595042v1
Github Link:
This resource contains a video recording for a presentation given as part of the National Water Quality Monitoring Council conference in April 2021. The presentation covers the motivation for performing quality control for sensor data, the development of PyHydroQC, a Python package with functions for automating sensor quality control including anomaly detection and correction, and the performance of the algorithms applied to data from multiple sites in the Logan River Observatory.
The initial abstract for the presentation: Water quality sensors deployed to aquatic environments make measurements at high frequency and commonly include artifacts that do not represent the environmental phenomena targeted by the sensor. Sensors are subject to fouling from environmental conditions, often exhibit drift and calibration shifts, and report anomalies and erroneous readings due to issues with datalogging, transmission, and other unknown causes. The suitability of data for analyses and decision making often depend on subjective and time-consuming quality control processes consisting of manual review and adjustment of data. Data driven and machine learning techniques have the potential to automate identification and correction of anomalous data, streamlining the quality control process. We explored documented approaches and selected several for implementation in a reusable, extensible Python package designed for anomaly detection for aquatic sensor data. Implemented techniques include regression approaches that estimate values in a time series, flag a point as anomalous if the difference between the sensor measurement exceeds a threshold, and offer replacement values for correcting anomalies. Additional algorithms that scaffold the central regression approaches include rules-based preprocessing, thresholds for determining anomalies that adjust with data variability, and the ability to detect and correct anomalies using forecasted and backcasted estimation. The techniques were developed and tested based on several years of data from aquatic sensors deployed at multiple sites in the Logan River Observatory in northern Utah, USA. Performance was assessed based on labels and corrections applied previously by trained technicians. In this presentation, we describe the techniques for detection and correction, report their performance, illustrate the workflow for applying to high frequency aquatic sensor data, and demonstrate the possibility for additional approaches to help increase automation of aquatic sensor data post processing.
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Thesis Artifacts for: Analyzing Environmental Data with the Titan Platform A detailed description can be found in the README.md. The corressponding thesis can be found here. Abstract The natural environment is one of our biggest treasures ― if not our biggest treasure ― and it must be preserved. Therefore, understanding our natural environment is becoming increasingly important as we struggle to respond to the implications of climate change and new diseases. Climate change is the greatest challenge faced by almost every species. With the help of environmental data, we can understand most of the complex interrelationships present in the climate and consequently take preventive measures. However, with the rise of new technologies, the collection of environmental data has become very rapid and complex, and therefore, it is difficult or even impossible to process them using traditional methods. To overcome this problem, it is important to extract pieces of information from the data as soon as they are produced. We used Titan, a software that applies stream processing, to analyze the environmental data collected across Germany. Titan comes with a set of solutions that facilitate the processing of huge amounts of data and allow software components to consume and supply data. In this work, three data sources have been connected directly to the Titan platform to collect diverse types of data. Once these connections were established, we analyzed the incoming data on the Titan platform using a set of data flows that consist of modular components connected to each other. Each modular component is responsible for a single task, such as data collection, data filtering, or data transformation. Moreover, we connect an external database to the platform to save the results of our analysis. Finally, we used a visualization tool to, create a dashboard for each flow implemented on the platform. We evaluate the scalability of the Titan platform based on the number of incoming records. To do so, we continuously increase the amount of workload into the Titan platform and determine the minimum number of component instances that are necessary to process data without reaching high traffic of data across the platform.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface.
The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user's system or to familiarize oneself with the pipeline.
Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.
Python has become one of the most popular programming languages, with a wide variety of use cases. In 2022, Python is most used for web development and data analysis, with ** percent and ** percent respectively.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Data processing code of HU method, which applies Ursell number as an index for the assessment of wave dissipation capacity by coastal and estuarine forested wetlands. The data processing procedure can be done with Python or MATLAB scripts, with a readme file. Before using the scripts, the users need to prepare measured data of significant wave height, water depth and peak wave period from field measurements or lab experiments. The distance between each station should also be prepared. An instruction of setting up field measuring transects and mechanism of the HU method can be find in the deposited file 'practice workflow.docx'. The Python codes are supported by Zhitong Jiang from School of Ecology, Environment and Resources, Guangdong University of Technology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).