1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.
his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.
In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.
The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management tool 'Change Management' (often encompassing Change Management Programs). Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Change Management dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "change management programs" + "change management" + "change management business". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Change Management Programs + Change Management. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Change Management-related keywords [("change management programs" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Change Mgmt Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Change Management Programs (2002, 2004, 2010, 2012, 2014, 2017, 2022). Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Change Management Programs (2002-2022). Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Change Management dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This training dataset was calculated using the mechanistic modeling approach. See “Big data training data for artificial intelligence-based Li-ion diagnosis and prognosis“ (Journal of Power Sources, Volume 479, 15 December 2020, 228806) and "Analysis of Synthetic Voltage vs. Capacity Datasets for Big Data Diagnosis and Prognosis" (Energies, under review) for more details
The V vs. Q dataset was compiled with a resolution of 0.01 for the triplets and C/25 charges. This accounts for more than 5,000 different paths. Each path was simulated with at most 0.85% increases for each The training dataset, therefore, contains more than 700,000 unique voltage vs. capacity curves.
4 Variables are included, see read me file for details and example how to use. Cell info: Contains information on the setup of the mechanistic model Qnorm: normalize capacity scale for all voltage curves pathinfo: index for simulated conditions for all voltage curves volt: voltage data. Each column corresponds to the voltage simulated under the conditions of the corresponding line in pathinfo.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).
WikiMed
Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.
WikiMed contains:
393,618 Wikipedia page texts
1,067,083 mentions of medical concepts
57,739 unique UMLS CUIs
Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.
PubMedDS
Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.
PubMedDS contains:
13,197,430 abstract texts
57,943,354 medical concept mentions
44,881 unique UMLS CUIs
Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.
Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.
Data format
Both datasets use JSON format with one document per line. Each document has the following structure:
{ "_id": "A unique identifier of each document", "text": "Contains text over which mentions are ", "title": "Title of Wikipedia/PubMed Article", "split": "[Not in PubMedDS] Dataset split: ", "mentions": [ { "mention": "Surface form of the mention", "start_offset": "Character offset indicating start of the mention", "end_offset": "Character offset indicating end of the mention", "link_id": "UMLS CUI. In case of multiple CUIs, they are concatenated using '|', i.e., CUI1|CUI2|..." }, {} ] }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains files produced by fMRIPrep that allow to transform the fMRI data between different spaces. For instance, any results obtained in the subjects' individual anatomical space could be transformed into the MNI standard space, allowing to compare results between subjects or even with other datasets.Part of THINGS-data: A multimodal collection of large-scale datasets for investigating object representations in brain and behavior.See related materials in Collection at: https://doi.org/10.25452/figshare.plus.c.6161151
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description:
The myusabank.csv
dataset contains daily financial data for a fictional bank (MyUSA Bank) over a two-year period. It includes various key financial metrics such as interest income, interest expense, average earning assets, net income, total assets, shareholder equity, operating expenses, operating income, market share, and stock price. The data is structured to simulate realistic scenarios in the banking sector, including outliers, duplicates, and missing values for educational purposes.
Potential Student Tasks:
Data Cleaning and Preprocessing:
Exploratory Data Analysis (EDA):
Calculating Key Performance Indicators (KPIs):
Building Tableau Dashboards:
Forecasting and Predictive Modeling:
Business Insights and Reporting:
Educational Goals:
The dataset aims to provide hands-on experience in data preprocessing, analysis, and visualization within the context of banking and finance. It encourages students to apply data science techniques to real-world financial data, enhancing their skills in data-driven decision-making and strategic analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the average performance metric values for k-means clustering of datasets having features with different (D1–D5) or the same (S1–S5) units.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management tool 'Customer Segmentation', including the closely related concept of Market Segmentation. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding Customer Segmentation dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "customer segmentation" + "market segmentation" + "customer segmentation marketing". Processing: None. Utilizes the original base-100 normalized Google Trends index. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Customer Segmentation + Market Segmentation. Processing: Annual relative frequency series normalized (peak year = 100). Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching Customer Segmentation-related keywords [("customer segmentation" OR ...) AND (...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly Crossref publications. Deduplicated via DOIs. Processing: Monthly relative share calculated (Segmentation Count / Total Count). Monthly relative share series normalized (peak month's share = 100). Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Customer Segmentation (1999, 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2017). Note: Not reported in 2022 survey data. Processing: Normalization: Original usability percentages normalized relative to its historical peak (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Customer Segmentation (1999-2017). Note: Not reported in 2022 survey data. Processing: Standardization (Z-scores): Using Z = (X - 3.0) / 0.891609. Index Scale Transformation: Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding Customer Segmentation dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
Authors: Karen Simonyan, Andrew Zisserman
https://arxiv.org/abs/1409.1556
https://imgur.com/uLXrKxe.jpg" alt="VGG Architecture">
A pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Learned features are often transferable to different data. For example, a model trained on a large dataset of bird images will contain learned features like edges or horizontal lines that you would be transferable your dataset.
Pre-trained models are beneficial to us for many reasons. By using a pre-trained model you are saving time. Someone else has already spent the time and compute resources to learn a lot of features and your model will likely benefit from it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with the same unit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Urban areas play a very important role in global climate change. There is an increasing interest in comprehending global urban areas with adequate geographic details for global climate change mitigation. Accurate and frequent urban area information is fundamental to comprehending urbanization processes and land use/cover change, as well as the impact of global climate and environmental change. Defense Meteorological Satellite Program/Operational Line Scan System (DMSP/OLS) night-light (NTL) imagery contributes powerfully to the spatial characterization of global cities, however, its application potential is seriously limited by its coarse resolution. In this paper, we generate annual Normalized Difference Urban Index (NDUI) to characterize global urban areas at a 30 m-resolution from 2000 to 2021 by combining Landsat-5,7,8 Normalized Difference Vegetation Index (NDVI) composites and DMSP/OLS NTL images on the Google Earth Engine (GEE) platform. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI datasets have the potential for urbanization studies at regional and global scales.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The electric grid is a key enabling infrastructure for the ambitious transition towards carbon neutrality as we grapple with climate change. With deepening penetration of renewable energy resources and electrified transportation, the reliable and secure operation of the electric grid becomes increasingly challenging. In this paper, we present PSML, a first-of-its-kind open-access multi-scale time-series dataset, to aid in the development of data-driven machine learning (ML) based approaches towards reliable operation of future electric grids. The dataset is generated through a novel transmission + distribution (T+D) co-simulation designed to capture the increasingly important interactions and uncertainties of the grid dynamics, containing electric load, renewable generation, weather, voltage and current measurements at multiple spatio-temporal scales. Using PSML, we provide state-of-the-art ML baselines on three challenging use cases of critical importance to achieve: (i) early detection, accurate classification and localization of dynamic disturbance events; (ii) robust hierarchical forecasting of load and renewable energy with the presence of uncertainties and extreme events; and (iii) realistic synthetic generation of physical-law-constrained measurement time series. We envision that this dataset will enable advances for ML in dynamic systems, while simultaneously allowing ML researchers to contribute towards carbon-neutral electricity and mobility.
Data Navigation
Please download, unzip and put somewhere for later benchmark results reproduction and data loading and performance evaluation for proposed methods.
wget https://zenodo.org/record/5130612/files/PSML.zip?download=1
7z x 'PSML.zip?download=1' -o./
Minute-level Load and Renewable
Minute-level PMU Measurements
Millisecond-level PMU Measurements
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides processed and normalized/standardized indices for the management tool group 'Supply Chain Management' (SCM), including related concepts like Supply Chain Integration. Derived from five distinct raw data sources, these indices are specifically designed for comparative longitudinal analysis, enabling the examination of trends and relationships across different empirical domains (web search, literature, academic publishing, and executive adoption). The data presented here represent transformed versions of the original source data, aimed at achieving metric comparability. Users requiring the unprocessed source data should consult the corresponding SCM dataset in the Management Tool Source Data (Raw Extracts) Dataverse. Data Files and Processing Methodologies: Google Trends File (Prefix: GT_): Normalized Relative Search Interest (RSI) Input Data: Native monthly RSI values from Google Trends (Jan 2004 - Jan 2025) for the query "supply chain management" + "supply chain logistics" + "supply chain". Processing: None. The dataset utilizes the original Google Trends index, which is base-100 normalized against the peak search interest for the specified terms and period. Output Metric: Monthly Normalized RSI (Base 100). Frequency: Monthly. Google Books Ngram Viewer File (Prefix: GB_): Normalized Relative Frequency Input Data: Annual relative frequency values from Google Books Ngram Viewer (1950-2022, English corpus, no smoothing) for the query Supply Chain Management + Supply Chain Integration + Supply Chain. Processing: The annual relative frequency series was normalized by setting the year with the maximum value to 100 and scaling all other values (years) proportionally. Output Metric: Annual Normalized Relative Frequency Index (Base 100). Frequency: Annual. Crossref.org File (Prefix: CR_): Normalized Relative Publication Share Index Input Data: Absolute monthly publication counts matching SCM-related keywords [("supply chain management" OR ...) AND ("management" OR ...) - see raw data for full query] in titles/abstracts (1950-2025), alongside total monthly publication counts in Crossref. Data deduplicated via DOIs. Processing: For each month, the relative share of SCM-related publications (SCM Count / Total Crossref Count for that month) was calculated. This monthly relative share series was then normalized by setting the month with the maximum relative share to 100 and scaling all other months proportionally. Output Metric: Monthly Normalized Relative Publication Share Index (Base 100). Frequency: Monthly. Bain & Co. Survey - Usability File (Prefix: BU_): Normalized Usability Index Input Data: Original usability percentages (%) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Normalization: The combined series of original usability percentages was normalized relative to its own highest observed historical value across all included years (Max % = 100). Output Metric: Biennial Estimated Normalized Usability Index (Base 100 relative to historical peak). Frequency: Biennial (Approx.). Bain & Co. Survey - Satisfaction File (Prefix: BS_): Standardized Satisfaction Index Input Data: Original average satisfaction scores (1-5 scale) from Bain surveys for specific years: Supply Chain Integration (1999, 2000, 2002); Supply Chain Management (2004, 2006, 2008, 2010, 2012, 2014, 2017, 2022). Processing: Semantic Grouping: Data points for "Supply Chain Integration" and "Supply Chain Management" were treated as a single conceptual series for SCM. Standardization (Z-scores): Original scores (X) were standardized using Z = (X - ?) / ?, with ?=3.0 and ??0.891609. Index Scale Transformation: Z-scores were transformed via Index = 50 + (Z * 22). Output Metric: Biennial Standardized Satisfaction Index (Center=50, Range?[1,100]). Frequency: Biennial (Approx.). File Naming Convention: Files generally follow the pattern: PREFIX_Tool_Processed.csv or similar, where the PREFIX indicates the data source (GT_, GB_, CR_, BU_, BS_). Consult the parent Dataverse description (Management Tool Comparative Indices) for general context and the methodological disclaimer. For original extraction details (specific keywords, URLs, etc.), refer to the corresponding SCM dataset in the Raw Extracts Dataverse. Comprehensive project documentation provides full details on all processing steps.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Untargeted liquid chromatography–mass spectrometry metabolomics studies are typically performed under roughly identical experimental settings. Measurements acquired with different LC-MS protocols or following extended time intervals harbor significant variation in retention times and spectral abundances due to altered chromatographic, spectrometric, and other factors, raising many data analysis challenges. We developed a computational workflow for merging and harmonizing metabolomics data acquired under disparate LC-MS conditions. Plasma metabolite profiles were collected from two sets of maternal subjects three years apart using distinct instruments and LC-MS procedures. Metabolomics features were aligned using metabCombiner to generate lists of compounds detected across all experimental batches. We applied data set-specific normalization methods to remove interbatch and interexperimental variation in spectral intensities, enabling statistical analysis on the assembled data matrix. Bioinformatics analyses revealed large-scale metabolic changes in maternal plasma between the first and third trimesters of pregnancy and between maternal plasma and umbilical cord blood. We observed increases in steroid hormones and free fatty acids from the first trimester to term of gestation, along with decreases in amino acids coupled to increased levels in cord blood. This work demonstrates the viability of integrating nonidentically acquired LC-MS metabolomics data and its utility in unconventional metabolomics study designs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
With increasing population growth and land-use change, urban communities in the desert southwest are progressively looking to remote basins to supplement existing water supplies. Recent applications for groundwater appropriations from Dixie Valley, Nevada, a primarily undeveloped basin neighboring the Carson Desert to the east, have prompted a reevaluation of the quantity of naturally discharging groundwater.The objective of this study was to develop a new, independent estimate of groundwater discharge by evapotranspiration (ET) from Dixie Valley using a combination of eddy-covariance evapotranspiration measurements and multispectral satellite imagery. Mean annual groundwater ET (ETg) was estimated during October 2009-2011 at four eddy covariance sites. Two sites were located in phreatophytic shrubland dominated by greasewood and two were located on a playa. Estimates were scaled to the basin level by combining remotely sensed imagery with field reconnaissance and site-scale ETg estimates.The Enhanced Vegetation Index (EVI) was calculated for 10 Landsat 5 Thematic mapper scenes and combined with brightness temperature in an effort to reduce confounding (high) EVI values resulting from forbes and cheat grass in sparsely vegetated areas, and biological soil crusts from bare soil to densely vegetated areas. The resulting EVI/TB images represented by this dataset were used to calculate ET units and scale actual and potential ETg to the basin level.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset simulates student learning behavior during educational sessions, specifically capturing physiological, emotional, and activity-related data. It integrates data collected from multiple IoT sensors, including wearable devices (for tracking movement and physiological states), cameras (for analyzing facial expressions), and motion sensors (for activity tracking). The dataset contains 1,200 student-session records and is structured to represent diverse learning environments, capturing various engagement levels and emotional states.
Here’s a breakdown of the dataset and its features:
Session_ID:
Unique identifier for each session. Type: Integer Student_ID:
A unique identifier for each student participating in the session. Type: Integer HRV (Heart Rate Variability):
A physiological measure of heart rate variability, indicating the variability between consecutive heartbeats, which can provide insights into stress or engagement levels. Type: Continuous (normalized values) Skin_Temperature:
Skin temperature during the session, used to infer physiological responses to learning (such as stress or excitement). Type: Continuous (normalized values) Expression_Joy:
A feature extracted from facial expression analysis, representing the level of joy detected on the student’s face. Type: Continuous (value between 0 and 1) Expression_Confusion:
A feature extracted from facial expression analysis, representing the level of confusion detected on the student’s face. Type: Continuous (value between 0 and 1) Steps:
The number of steps the student has taken during the session, serving as an indicator of activity level. Type: Integer Emotion:
Categorized emotional state of the student during the session, derived from facial expression and engagement analysis. Values: Interest, Boredom, Confusion, Happiness Type: Categorical Engagement_Level:
A rating scale from 1 to 5 that measures the level of engagement of the student during the session. Type: Integer (1 to 5) Session_Duration:
The total duration of the session in minutes, capturing how long the student was engaged in the learning activity. Type: Integer (15 to 60 minutes) Learning_Phase:
The phase of the learning session (e.g., Introduction, Practice, Conclusion). Values: Introduction, Practice, Conclusion Type: Categorical Start_Time:
The timestamp of when the learning session started. Type: DateTime End_Time:
The timestamp of when the learning session ended. Type: DateTime Learning_Outcome:
The result of the learning session, based on the student's engagement level and session duration. Values: Successful, Unsuccessful, Partially Successful Type: Categorical HRV_Frequency_Feature:
A frequency-domain feature derived from the Fourier Transform of the HRV signal, capturing periodic fluctuations in heart rate during the session. Type: Continuous Skin_Temperature_Frequency_Feature:
A frequency-domain feature derived from the Fourier Transform of the skin temperature signal, capturing periodic variations in temperature. Type: Continuous Emotion_Label:
A numeric label corresponding to the Emotion column, used for machine learning model training. Values: 0 to 3 (corresponding to Interest, Boredom, Confusion, Happiness) Type: Integer Learning_Phase_Label:
A numeric label corresponding to the Learning_Phase column, used for machine learning model training. Values: 0 to 2 (corresponding to Introduction, Practice, Conclusion) Type: Integer
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Synthetic end-use specific electric household load profiles for four weather years in 29 European countries. For 2011, the profiles are normalized to an annual sum of 1000 to enable the user to scale them to a preferred annual consumption. For other years, a value other than 1000 is caused by the influence of weather and indicates a higher or lower consumption (e.g. higher consumption for space heating in cold years). For user convenience, we provide heatpump COPs and annual consumption values.
1000 simulated data sets stored in a list of R dataframes used in support of Reisetter et al. (submitted) 'Mixture model normalization for non-targeted gas chromatography / mass spectrometry metabolomics data'. These are results after normalization using median scaling as described in Reisetter et al.