Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.
The dataset contains handwritten digit samples with the following distribution:
(Note: Actual distribution counts would be calculated from your specific dataset)
import pandas as pd
# Load the dataset
df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
# Separate features and labels
X = df.drop('label', axis=1)
y = df['label']
# Normalize pixel values
X_normalized = X / 255.0
Facebook
TwitterThe original Ames data that is being used for the competition House Prices: Advanced Regression Techniques and predicting sales price is edited and engineered to suit a beginner for applying a model without worrying too much about missing data while focusing on the features.
The train data has the shape 1460x80 and test data has the shape 1458x79 with feature 'SalePrice' to be predicted for the test set. The train data has different types of features, categorical and numerical.
A detailed info about the data can be obtained from the Data Description file among other data files.
a. Handling Missing Values: Some variables such as 'PoolQC', 'MiscFeature', 'Alley' have over 90% missing values. However from the data description, it is implied that the missing value indicates the absence of such features in a particular house. Well, most of the missing data implies the feature does not exist for the particular house on further inspection of the dataset and data description.
Similarly, features which are missing such as 'GarageType', 'GarageYrBuilt', 'BsmtExposure', etc indicated no garage in that house but also corresponding attributes such as 'GarageCars', 'GarageArea','BsmtCond' etc are set to 0.
A house on a street might have similar front lawn area to the houses in the same neighborhood, hence the missing values can be median of the values in a neighborhood.
Missing values in features such as 'SaleType', 'KitchenCond', etc have been imputed with the mode of the feature.
b. Dropping Variables: 'Utilities' attribute should be dropped from the data frame because almost all the houses have all public Utilities (E,G,W,& S) available.
c. Further exploration: The feature 'Electrical' has one missing value. The first intuition would be to drop the row. But on further inspection, the missing value is from a house built in 2006. After the 1970's all the houses have Standard Circuit Breakers & Romex 'SkBrkr' installed. So, the value can be inferred from this observation.
d. Transformation: There were some variables which are really categorical but were represented numerically such as 'MSSubClass', 'OverallCond' and 'YearSold'/'MonthSold' as they are discrete in nature. These have also been transformed to categorical variables.
e. X Normalizing the 'SalePrice' Variable: During EDA it was discovered that the Sale price of homes is right skewed. However on normalizing the skewness decreases and the (linear) models fit better. The feature is left for the user to normalize.
Finally the train and test sets were split and sale price appended to train set.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.
The data after the transformation done by me can easily be fitted on to a model after label encoding and normalizing features to reduce skewness. The main variable to be predicted is 'SalePrice' for the TestData csv file.
Facebook
TwitterThe technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference genes used in normalizing qRT-PCR data are critical for the accuracy of gene expression analysis. However, many traditional reference genes used in zebrafish early development are not appropriate because of their variable expression levels during embryogenesis. In the present study, we used our previous RNA-Seq dataset to identify novel reference genes suitable for gene expression analysis during zebrafish early developmental stages. We first selected 197 most stably expressed genes from an RNA-Seq dataset (29,291 genes in total), according to the ratio of their maximum to minimum RPKM values. Among the 197 genes, 4 genes with moderate expression levels and the least variation throughout 9 developmental stages were identified as candidate reference genes. Using four independent statistical algorithms (delta-CT, geNorm, BestKeeper and NormFinder), the stability of qRT-PCR expression of these candidates was then evaluated and compared to that of actb1 and actb2, two commonly used zebrafish reference genes. Stability rankings showed that two genes, namely mobk13 (mob4) and lsm12b, were more stable than actb1 and actb2 in most cases. To further test the suitability of mobk13 and lsm12b as novel reference genes, they were used to normalize three well-studied target genes. The results showed that mobk13 and lsm12b were more suitable than actb1 and actb2 with respect to zebrafish early development. We recommend mobk13 and lsm12b as new optimal reference genes for zebrafish qRT-PCR analysis during embryogenesis and early larval stages.
Facebook
TwitterThe values in this raster are unit-less scores ranging from 0 to 1 that represent normalized dollars per acre damage claims from antelope on Wyoming lands. This raster is one of 9 inputs used to calculate the "Normalized Importance Index."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGene expression analysis is an essential part of biological and medical investigations. Quantitative real-time PCR (qPCR) is characterized with excellent sensitivity, dynamic range, reproducibility and is still regarded to be the gold standard for quantifying transcripts abundance. Parallelization of qPCR such as by microfluidic Taqman Fluidigm Biomark Platform enables evaluation of multiple transcripts in samples treated under various conditions. Despite advanced technologies, correct evaluation of the measurements remains challenging. Most widely used methods for evaluating or calculating gene expression data include geNorm and ΔΔCt, respectively. They rely on one or several stable reference genes (RGs) for normalization, thus potentially causing biased results. We therefore applied multivariable regression with a tailored error model to overcome the necessity of stable RGs.ResultsWe developed a RG independent data normalization approach based on a tailored linear error model for parallel qPCR data, called LEMming. It uses the assumption that the mean Ct values within samples of similarly treated groups are equal. Performance of LEMming was evaluated in three data sets with different stability patterns of RGs and compared to the results of geNorm normalization. Data set 1 showed that both methods gave similar results if stable RGs are available. Data set 2 included RGs which are stable according to geNorm criteria, but became differentially expressed in normalized data evaluated by a t-test. geNorm-normalized data showed an effect of a shifted mean per gene per condition whereas LEMming-normalized data did not. Comparing the decrease of standard deviation from raw data to geNorm and to LEMming, the latter was superior. In data set 3 according to geNorm calculated average expression stability and pairwise variation, stable RGs were available, but t-tests of raw data contradicted this. Normalization with RGs resulted in distorted data contradicting literature, while LEMming normalized data did not.ConclusionsIf RGs are coexpressed but are not independent of the experimental conditions the stability criteria based on inter- and intragroup variation fail. The linear error model developed, LEMming, overcomes the dependency of using RGs for parallel qPCR measurements, besides resolving biases of both technical and biological nature in qPCR. However, to distinguish systematic errors per treated group from a global treatment effect an additional measurement is needed. Quantification of total cDNA content per sample helps to identify systematic errors.
Facebook
TwitterThe hurricane risk index is simply the product of the cumulative hurricane strikes per coastal county and the CDC Overall Social Vulnerability Index (SVI) for the given county. We normalize the hurricane strikes data to match the SVI data classification scheme (i.e., max value at 1); however, using the raw or normalized values of hurricane strikes has no impact on the spatial pattern of the risk index. Therefore, a risk index of value 1 indicates the county has the highest hurricane strikes of all the counties and is the most vulnerable county in the nation according to the SVI index. Because the analysis is over multiple states, we use the ‘United States’ SVI dataset at the county level. Values of the index are unevenly distributed so we classify intervals using the Jenks method and the first break at 0.08 is roughly equal to the median index value. For counties north of North Carolina, the low hurricane risk is most dependent on the low number of hurricane strikes. The vast majority of the counties fall in the lowest risk category and any in the second lowest category are there because of high social vulnerability.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of key environmental and physiological parameters influencing plant growth. This includes temperature, humidity, solar radiation, pH, total dissolved solids (TDS), leaves green area, and plant height. These features were collected across 24 recorded instances from January 2, 2025, to February 3, 2025. The data was preprocessed to remove inconsistencies and normalize values to ensure model stability and robustness. The dataset includes two sets of Thai Basil images for 24 instances. It also includes csv data for features.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Normalization
# Generate a resting state (rs) timeseries (ts)
# Install / load package to make fake fMRI ts
# install.packages("neuRosim")
library(neuRosim)
# Generate a ts
ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
# 3dDetrend -normalize
# R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
# Do for the full timeseries
ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
# Do this again for a shorter version of the same timeseries
ts.shorter.length <- length(ts.normalised.long)/4
ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
# By looking at the summaries, it can be seen that the median values become larger
summary(ts.normalised.long)
summary(ts.normalised.short)
# Plot results for the long and short ts
# Truncate the longer ts for plotting only
ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
# Give the plot a title
title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
# Add zero line
lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
# 3dDetrend -normalize -polort 0 for long timeseries
lines(ts.normalised.long.made.shorter, col='blue');
# 3dDetrend -normalize -polort 0 for short timeseries
lines(ts.normalised.short, col='red');
Standardization/modernization
New afni_proc.py command line
afni_proc.py \
-subj_id "$sub_id_name_1" \
-blocks despike tshift align tlrc volreg mask blur scale regress \
-radial_correlate_blocks tcat volreg \
-copy_anat anatomical_warped/anatSS.1.nii.gz \
-anat_has_skull no \
-anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \
-anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \
-anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \
-anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \
-anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \
-anat_follower_erode fsvent fswm \
-dsets media_?.nii.gz \
-tcat_remove_first_trs 8 \
-tshift_opts_ts -tpattern alt+z2 \
-align_opts_aea -cost lpc+ZZ -giant_move -check_flip \
-tlrc_base "$basedset" \
-tlrc_NL_warp \
-tlrc_NL_warped_dsets \
anatomical_warped/anatQQ.1.nii.gz \
anatomical_warped/anatQQ.1.aff12.1D \
anatomical_warped/anatQQ.1_WARP.nii.gz \
-volreg_align_to MIN_OUTLIER \
-volreg_post_vr_allin yes \
-volreg_pvra_base_index MIN_OUTLIER \
-volreg_align_e2a \
-volreg_tlrc_warp \
-mask_opts_automask -clfrac 0.10 \
-mask_epi_anat yes \
-blur_to_fwhm -blur_size $blur \
-regress_motion_per_run \
-regress_ROI_PC fsvent 3 \
-regress_ROI_PC_per_run fsvent \
-regress_make_corr_vols aeseg fsvent \
-regress_anaticor_fast \
-regress_anaticor_label fswm \
-regress_censor_motion 0.3 \
-regress_censor_outliers 0.1 \
-regress_apply_mot_types demean deriv \
-regress_est_blur_epits \
-regress_est_blur_errts \
-regress_run_clustsim no \
-regress_polort 2 \
-regress_bandpass 0.01 1 \
-html_review_style pythonic
We used similar command lines to generate ‘blurred and not censored’ and the ‘not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.
Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.
Effect on results
Facebook
TwitterIMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. This dataset backcasts estimated modeled savings for a subset of 2007-2012 completed projects in the Home Performance with ENERGY STAR® Program against normalized savings calculated by an open source energy efficiency meter available at https://www.openee.io/. Open source code uses utility-grade metered consumption to weather-normalize the pre- and post-consumption data using standard methods with no discretionary independent variables. The open source energy efficiency meter allows private companies, utilities, and regulators to calculate energy savings from energy efficiency retrofits with increased confidence and replicability of results. This dataset is intended to lay a foundation for future innovation and deployment of the open source energy efficiency meter across the residential energy sector, and to help inform stakeholders interested in pay for performance programs, where providers are paid for realizing measurable weather-normalized results. To download the open source code, please visit the website at https://github.com/openeemeter/eemeter/releases
D I S C L A I M E R: Normalized Savings using open source OEE meter. Several data elements, including, Evaluated Annual Elecric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), and Post-retrofit Usage Gas (MMBtu) are direct outputs from the open source OEE meter.
Home Performance with ENERGY STAR® Estimated Savings. Several data elements, including, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and Estimated First Year Energy Savings represent contractor-reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the Home Performance with ENERGY STAR impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf.
This dataset includes the following data points for a subset of projects completed in 2007-2012: Contractor ID, Project County, Project City, Project ZIP, Climate Zone, Weather Station, Weather Station-Normalization, Project Completion Date, Customer Type, Size of Home, Volume of Home, Number of Units, Year Home Built, Total Project Cost, Contractor Incentive, Total Incentives, Amount Financed through Program, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, Estimated First Year Energy Savings, Evaluated Annual Electric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), Post-retrofit Usage Gas (MMBtu), Central Hudson, Consolidated Edison, LIPA, National Grid, National Fuel Gas, New York State Electric and Gas, Orange and Rockland, Rochester Gas and Electric.
How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 8: Supplemental Table 7. Immunofluorescence of IFNAR1. Tables of IFNAR1 immunofluorescence intensity, as well as area-normalized mean intensity, for segmented individual cells. Datasets were randomly subset to normalize n values.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains facial expression data (specifically smiling vs. non-smiling) represented in pixel format, stored inside CSV files. It is designed for training and evaluating machine learning and deep learning models for facial expression recognition (FER).
The dataset includes two files:
train.csv
Contains labeled pixel-based image data for training.
test.csv
Contains unlabeled or labeled pixel-based image data for testing/evaluation.
Each CSV file stores image data in the following structure:
pixels → A string or sequence of pixel values (grayscale), typically flattened into a single row per image.label (in training file only) → Indicates whether the image represents Smile / Non-Smile (or other classes if applicable).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Per-fold number of features on Álvez dataset, weighted and unweighted models.
Facebook
TwitterThis data set represents the average normalized atmospheric (wet) deposition, in kilograms, of Total Inorganic Nitrogen for the year 2002 compiled for every catchment of NHDPlus for the conterminous United States. Estimates of Total Inorganic Nitrogen deposition are based on National Atmospheric Deposition Program (NADP) measurements (B. Larsen, U.S. Geological Survey, written commun., 2007). De-trending methods applied to the year 2002 are described in Alexander and others, 2001. NADP site selection met the following criteria: stations must have records from 1995 to 2002 and have a minimum of 30 observations. The NHDPlus Version 1.1 is an integrated suite of application-ready geospatial datasets that incorporates many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,00-scale NHD), improved networking, naming, and value-added attributes (VAAs). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainage enforcement technique first widely used in New England, and thus referred to as "the New England Method." This technique involves "burning in" the 1:100,000-scale NHD and when available building "walls" using the National Watershed Boundary Dataset (WBD). The resulting modified digital elevation model (HydroDEM) is used to produce hydrologic derivatives that agree with the NHD and WBD. Over the past two years, an interdisciplinary team from the U.S. Geological Survey (USGS), and the U.S. Environmental Protection Agency (USEPA), and contractors, found that this method produces the best quality NHD catchments using an automated process (USEPA, 2007). The NHDPlus dataset is organized by 18 Production Units that cover the conterminous United States. The NHDPlus version 1.1 data are grouped by the U.S. Geologic Survey's Major River Basins (MRBs, Crawford and others, 2006). MRB1, covering the New England and Mid-Atlantic River basins, contains NHDPlus Production Units 1 and 2. MRB2, covering the South Atlantic-Gulf and Tennessee River basins, contains NHDPlus Production Units 3 and 6. MRB3, covering the Great Lakes, Ohio, Upper Mississippi, and Souris-Red-Rainy River basins, contains NHDPlus Production Units 4, 5, 7 and 9. MRB4, covering the Missouri River basins, contains NHDPlus Production Units 10-lower and 10-upper. MRB5, covering the Lower Mississippi, Arkansas-White-Red, and Texas-Gulf River basins, contains NHDPlus Production Units 8, 11 and 12. MRB6, covering the Rio Grande, Colorado and Great Basin River basins, contains NHDPlus Production Units 13, 14, 15 and 16. MRB7, covering the Pacific Northwest River basins, contains NHDPlus Production Unit 17. MRB8, covering California River basins, contains NHDPlus Production Unit 18.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
README — Code and data
Project: LOCALISED
Work Package 7, Task 7.1
Paper: A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union
What this repo does
-------------------
Builds the Transition‑Risk Index (TRI) for EU manufacturing at NUTS‑2 × NACE Rev.2, and reproduces the article’s Figures 3–6:
• Exposure (emissions by region/sector)
• Vulnerability (composite index)
• Risk = Exposure ⊗ Vulnerability
Outputs include intermediate tables, the final analysis dataset, and publication figures.
Folder of interest
------------------
Code and data/
├─ Code/ # R scripts (run in order 1A → 5)
│ └─ Create Initial Data/ # scripts to (re)build Initial data/ from Eurostat API with imputation
├─ Initial data/ # Eurostat inputs imputed for missing values
├─ Derived data/ # intermediates
├─ Final data/ # final analysis-ready tables
└─ Figures/ # exported figures
Quick start
-----------
1) Open R (or RStudio) and set the working directory to “Code and data/Code”.
Example: setwd(".../Code and data/Code")
2) Initial data/ contains the required Eurostat inputs referenced by the scripts.
To reproduce the inputs in Initial data/, run the scripts in Code/Create Initial Data/.
These scripts download the required datasets from the respective API and impute missing values; outputs are written to ../Initial data/.
3) Run scripts sequentially (they use relative paths to ../Raw data, ../Derived data, etc.):
1A-non-sector-data.R → 1B-sector-data.R → 1C-all-data.R → 2-reshape-data.R → 3-normalize-data-by-n-enterpr.R → 4-risk-aggregation.R → 5A-results-maps.R, 5B-results-radar.R
What each script does
---------------------
Create Initial Data — Recreate inputs
• Download source tables from the Eurostat API or the Localised DSP, apply light cleaning, and impute missing values.
• Write the resulting inputs to Initial data/ for the analysis pipeline.
1A / 1B / 1C — Build the unified base
• Read individual Eurostat datasets (some sectoral, some only regional).
• Harmonize, aggregate, and align them into a single analysis-ready schema.
• Write aggregated outputs to Derived data/ (and/or Final data/ as needed).
2 — Reshape and enrich
• Reshapes the combined data and adds metadata.
• Output: Derived data/2_All_data_long_READY.xlsx (all raw indicators in tidy long format, with indicator names and values).
3 — Normalize (enterprises & min–max)
• Divide selected indicators by number of enterprises.
• Apply min–max normalization to [0.01, 0.99].
• Exposure keeps real zeros (zeros remain zero).
• Write normalized tables to Derived data/ or Final data/.
4 — Aggregate indices
• Vulnerability: build dimension scores (Energy, Labour, Finance, Supply Chain, Technology).
– Within each dimension: equal‑weight mean of directionally aligned, [0.01,0.99]‑scaled indicators.
– Dimension scores are re‑scaled to [0.01,0.99].
• Aggregate Vulnerability: equal‑weight mean of the five dimensions.
• TRI (Risk): combine Exposure (E) and Vulnerability (V) via a weighted geometric rule with α = 0.5 in the baseline.
– Policy‑intuitive properties: high E & high V → high risk; imbalances penalized (non‑compensatory).
• Output: Final data/ (main analysis tables).
5A / 5B — Visualize results
• 5A: maps and distribution plots for Exposure, Vulnerability, and Risk → Figures 3 & 4.
• 5B: comparative/radar profiles for selected countries/regions/subsectors → Figures 5 & 6.
• Outputs saved to Figures/.
Data flow (at a glance)
-----------------------
Initial data → (1A–1C) Aggregated base → (2) Tidy long file → (3) Normalized indicators → (4) Composite indices → (5) Figures
| | |
v v v
Derived data/ 2_All_data_long_READY.xlsx Final data/ & Figures/
Assumptions & conventions
-------------------------
• Geography: EU NUTS‑2 regions; Sector: NACE Rev.2 manufacturing subsectors.
• Equal weights by default where no evidence supports alternatives.
• All indicators directionally aligned so that higher = greater transition difficulty.
• Relative paths assume working directory = Code/.
Reproducing the article
-----------------------
• Optionally run the codes from the Code/Create Initial Data subfolder
• Run 1A → 5B without interruption to regenerate:
– Figure 3: Exposure, Vulnerability, Risk maps (total manufacturing).
– Figure 4: Vulnerability dimensions (Energy, Labour, Finance, Supply Chain, Technology).
– Figure 5: Drivers of risk—highest vs. lowest risk regions (example: Germany & Greece).
– Figure 6: Subsector case (e.g., basic metals) by selected regions.
• Final tables for the paper live in Final data/. Figures export to Figures/.
Requirements
------------
• R (version per your environment).
• Install any missing packages listed at the top of each script (e.g., install.packages("...")).
Troubleshooting
---------------
• “File not found”: check that the previous script finished and wrote its outputs to the expected folder.
• Paths: confirm getwd() ends with /Code so relative paths resolve to ../Raw data, ../Derived data, etc.
• Reruns: optionally clear Derived data/, Final data/, and Figures/ before a clean rebuild.
Provenance & citation
---------------------
• Inputs: Eurostat and related sources cited in the paper and headers of the scripts.
• Methods: OECD composite‑indicator guidance; IPCC AR6 risk framing (see paper references).
• If you use this code, please cite the article:
A Systemic Framework for Assessing the Risk of Decarbonization to Regional Manufacturing Activities in the European Union.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Zurich Summer v1.0" dataset is a collection of 20 chips (crops), taken from a QuickBird acquisition of the city of Zurich (Switzerland) in August 2002. QuickBird images are composed by 4 channels (NIR-R-G-B) and were pansharpened to the PAN resolution of about 0.62 cm GSD. We manually annotated 8 different urban and periurban classes : Roads, Buildings, Trees, Grass, Bare Soil, Water, Railways and Swimming pools. The cumulative number of class samples is highly unbalanced, to reflect real world situations. Note that annotations are not perfect, are not ultradense (not every pixel is annotated) and there might be some errors as well. We performed annotations by jointly selecting superpixels (SLIC) and drawing (freehand) over regions which we could confidently assign an object class.
The dataset is composed by 20 image - ground truth pairs, in geotiff format. Images are distributed in raw DN values. We provide a rough and dirty MATLAB script (preprocess.m) to:
i) extract basic statistics from images (min, max, mean and average std) which should be used to globally normalize the data (note that class distribution of the chips is highly uneven, so single-frame normalization would shift distribution of classes).
ii) Visualize raw DN images (with unsaturated values) and a corresponding stretched version (good for illustration purposes). It also saves a raw and adjusted image version in MATLAB format (.mat) in a local subfolder.
iii) Convert RGB annotations to index mask (CLASS \in {1,...,C}) (via rgb2label.m provided).
iv) Convert index mask to georeferenced RGB annotations (via rgb2label.m provided). Useful if you want to see the final maps of the tiles in some GIS software (coordinate system copied from original geotiffs).
Some requests from you
We encourage researchers to report the ID of images used for training / validation / test (e.g. train: zh1 to zh7, validation zh8 to zh12 and test zh13 to zh20). The purpose of distributing datasets is to encourage reproducibility of experiments.
Acknowledgements
We release this data after a kind agreement obtained with DigitalGlobe, co. This data can be redistributed freely, provided that this document and corresponding license are part of the distribution. Ideally, since the dataset could be updated over the time, I suggest to distribute the dataset by the official link from which this archive has been downloaded.
We would like to thank (a lot) Nathan Longbotham @ DigitalGlobe and the whole DG team for his / their help for granting the distribution of the dataset.
We release this dataset hoping that will help researchers working in semantic classification / segmentation of remote sensing data in comparing to other state-of-the-art methods using this dataset as well in testing models on a larger and more complete set of images (with respect to most benchmarks available in our community). As you can imagine, it has been a tedious work in preparing everything. Just for you.
If you are using the data please cite the following work
Volpi, M. & Ferrari, V.; Semantic segmentation of urban scenes by learning local class interactions, In IEEE CVPR 2015 Workshop "Looking from above: when Earth observation meets vision" (EARTHVISION), Boston, USA, 2015.
Facebook
TwitterThis Level 1 (L1) dataset contains the Version 2.1 geo-located Delay Doppler Maps (DDMs) calibrated into Power Received (Watts) and Bistatic Radar Cross Section (BRCS) expressed in units of meters squared from the Delay Doppler Mapping Instrument aboard the CYGNSS satellite constellation. This version supersedes Version 2.0. Other useful scientific and engineering measurement parameters include the DDM of Normalized Bistatic Radar Cross Section (NBRCS), the Delay Doppler Map Average (DDMA) of the NBRCS near the specular reflection point, and the Leading Edge Slope (LES) of the integrated delay waveform. The L1 dataset contains a number of other engineering and science measurement parameters, including sets of quality flags/indicators, error estimates, and bias estimates as well as a variety of orbital, spacecraft/sensor health, timekeeping, and geolocation parameters. At most, 8 netCDF data files (each file corresponding to a unique spacecraft in the CYGNSS constellation) are provided each day; under nominal conditions, there are typically 6-8 spacecraft retrieving data each day, but this can be maximized to 8 spacecraft under special circumstances in which higher than normal retrieval frequency is needed (i.e., during tropical storms and or hurricanes). Latency is approximately 6 days (or better) from the last recorded measurement time. The Version 2.1 release represents the second science-quality release. Here is a summary of improvements that reflect the quality of the Version 2.1 data release: 1) data is now available when the CYGNSS satellites are rolled away from nadir during orbital high beta-angle periods, resulting in a significant amount of additional data; 2) correction to coordinate frames result in more accurate estimates of receiver antenna gain at the specular point; 3) improved calibration for analog-to-digital conversion results in better consistency between CYGNSS satellites measurements at nearly the same location and time; 4) improved GPS EIRP and transmit antenna pattern calibration results in significantly reduced PRN-dependence in the observables; 5) improved estimation of the location of the specular point within the DDM; 6) an altitude-dependent scattering area is used to normalize the scattering cross section (v2.0 used a simpler scattering area model that varied with incidence and azimuth angles but not altitude); 7) corrections added for noise floor-dependent biases in scattering cross section and leading edge slope of delay waveform observed in the v2.0 data. Users should also note that the receiver antenna pattern calibration is not applied per-DDM-bin in this v2.1 release.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.
Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.
This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).
This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Facebook
TwitterThis data set represents the average normalized atmospheric (wet) deposition, in kilograms, of Nitrate (NO3) for the year 2002 compiled for every catchment of NHDPlus for the conterminous United States. Estimates of NO3 deposition are based on National Atmospheric Deposition Program (NADP) measurements (B. Larsen, U.S. Geological Survey, written commun., 2007). De-trending methods applied to the year 2002 are described in Alexander and others, 2001. NADP site selection met the following criteria: stations must have records from 1995 to 2002 and have a minimum of 30 observations. The NHDPlus Version 1.1 is an integrated suite of application-ready geospatial datasets that incorporates many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,00-scale NHD), improved networking, naming, and value-added attributes (VAAs). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainage enforcement technique first widely used in New England, and thus referred to as "the New England Method." This technique involves "burning in" the 1:100,000-scale NHD and when available building "walls" using the National Watershed Boundary Dataset (WBD). The resulting modified digital elevation model (HydroDEM) is used to produce hydrologic derivatives that agree with the NHD and WBD. Over the past two years, an interdisciplinary team from the U.S. Geological Survey (USGS), and the U.S. Environmental Protection Agency (USEPA), and contractors, found that this method produces the best quality NHD catchments using an automated process (USEPA, 2007). The NHDPlus dataset is organized by 18 Production Units that cover the conterminous United States. The NHDPlus version 1.1 data are grouped by the U.S. Geologic Survey's Major River Basins (MRBs, Crawford and others, 2006). MRB1, covering the New England and Mid-Atlantic River basins, contains NHDPlus Production Units 1 and 2. MRB2, covering the South Atlantic-Gulf and Tennessee River basins, contains NHDPlus Production Units 3 and 6. MRB3, covering the Great Lakes, Ohio, Upper Mississippi, and Souris-Red-Rainy River basins, contains NHDPlus Production Units 4, 5, 7 and 9. MRB4, covering the Missouri River basins, contains NHDPlus Production Units 10-lower and 10-upper. MRB5, covering the Lower Mississippi, Arkansas-White-Red, and Texas-Gulf River basins, contains NHDPlus Production Units 8, 11 and 12. MRB6, covering the Rio Grande, Colorado and Great Basin River basins, contains NHDPlus Production Units 13, 14, 15 and 16. MRB7, covering the Pacific Northwest River basins, contains NHDPlus Production Unit 17. MRB8, covering California River basins, contains NHDPlus Production Unit 18.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.
The dataset contains handwritten digit samples with the following distribution:
(Note: Actual distribution counts would be calculated from your specific dataset)
import pandas as pd
# Load the dataset
df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
# Separate features and labels
X = df.drop('label', axis=1)
y = df['label']
# Normalize pixel values
X_normalized = X / 255.0