Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset published by Luecken et al. 2021 which contains data from human bone marrow measured through joint profiling of single-nucleus RNA and Antibody-Derived Tags (ADTs) using the 10X 3' Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0.File Descriptioncite_quality_control.h5mu: Filtered cell by feature MuData object after quality control.cite_normalization.h5mu: MuData object of normalized data using DSB (denoised and scaled by background) normalization.cite_doublet_removal_xdbt.h5mu: MuData of data after doublet removal based on known cell type markers. Cells were removed if they were double positive for mutually exclusive markers with a DSB value >2.5.cite_dimensionality_reduction.h5mu: MuData of data after dimensionality reduction.cite_batch_correction.h5mu: MuData of data after batch correction.CitationLuecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021).Original data linkhttps://openproblems.bio/neurips_docs/data/dataset/
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing — September 17–21, 2015 — Lisboa, Portugal.
ABSTRACT: Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e. text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the existing normalization methods, different metrics or different datasets by the same metric yield inconsistent results, particularly for the single-cell RNA sequencing (scRNA-seq) data. The worst situations could be that one method evaluated as the best by one metric is evaluated as the poorest by another metric, or one method evaluated as the best using one dataset is evaluated as the poorest using another dataset. Here raises an open question: principles need to be established to guide the evaluation of normalization methods. In this study, we propose a principle that one normalization method evaluated as the best by one metric should also be evaluated as the best by another metric (the consistency of metrics) and one method evaluated as the best using scRNA-seq data should also be evaluated as the best using bulk RNA-seq data or microarray data (the consistency of datasets). Then, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods using both scRNA-seq data and bulk RNA-seq data, satisfying the consistency of metrics and the consistency of datasets. Our findings paved the way to guide future studies in the normalization of gene expression data with its evaluation. The raw gene expression data, normalization methods, and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to select the best method for the normalization of their gene expression data based on the evaluation of different methods (particularly some data-driven methods or their own methods) in the principle of the consistency of metrics and the consistency of datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Enriched GO terms for deconvolution. This file is in a tab-separated format and contains the top 200 GO terms that were enriched in the set of DE genes unique to deconvolution. The identifier and name of each term is shown along with the total number of genes associated with the term, the number of associated genes that are also DE, the expected number under the null hypothesis, and the Fisher p value. (13 KB PDF)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data used in the experiments described in:
Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)
Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.
There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)
The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (http://nl.ijs.si/janes/english/).
The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
Normalised Digital Surface Model - 1m resolution. The dataset contains the Normalised Digital Surface Model for the Washington Area.Voids exist in the data due to data redaction conducted under the guidance of the United States Secret Service. All lidar data returns and collected data were removed from the dataset based on the redaction footprint shapefile generated in 2017.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Surface reflectance is a critical physical variable that affects the energy budget in land-atmosphere interactions, feature recognition and classification, and climate change research. This dataset uses the relative radiometric normalization method, and takes the Landsat-8 Operational Land Imager (OLI) surface reflectance products as the reference image to normalize the GF-1 satellite WFV sensor cloud-free images of Shandong Province in 2018. Relative radiometric normalization processing mainly includes atmospheric correction, image resampling, image registration, mask, extract the no-change pixels and calculate normalization coefficients. After relative radiometric normalization, the no-change pixels of each GF-1 WFV image and its reference image, R2 is 0.7295 above, RMSE is below 0.0172. The surface reflectance accuracy of GF-1 WFV image is improved, which can be used in cooperation with Landsat data to provide data support for remote sensing quantitative inversion. This dataset is in GeoTIFF format, and the spatial resolution of the image is 16 m.
This is a white beam data set from V to normalize the relative detector perfromance. See the ARCS_175852.md file for more information
ALL DATA IN THIS RECORD ARE REDUNDANT. I.E., THEY WERE OBTAINED DIRECTLY FROM OTHER DATA IN THIS FILE, USUALLY BY EXTRAPOLATION OR INTEGRATION.. THE FOLLOWING COMMENTS ARE TAKEN FROM THE PI N COMPILATION OF R.L. KELLY. THEY ARE THAT COMPILATION& apos;S COMPLETE SET OF COMMENTS FOR PAPERS RELATED TO THE SAME EXPERIMENT (DESIGNATED BUSZA69) AS THE CURRENT PAPER. (THE IDENTIFIER PRECEDING THE REFERENCE AND COMMENT FOR EACH PAPER IS FOR CROSS-REFERENCING WITHIN THESE COMMENTS ONLY AND DOES NOT NECESSARILY AGREE WITH THE SHORT CODE USED ELSEWHERE IN THE PRESENT COMPILATION.) /// BELLAMY65 [E. H. BELLAMY,PROC. ROY. SOC. (LONDON) 289,509(1965)] -- /// BUSZA67 [W. BUSZA,NC 52A,331(1967)] -- PI- P DCS FROM 2K ELASTIC EVENTS AT EACH OF 5 MOMENTA BETWEEN 1.72 AND 2.46 GEV/C. DONE AT NIMROD WITH OPTICAL SPARK CHAMBERS. THE APPARATUS IS DESCRIBED IN BELLAMY65, THE RESULTS IN BUSZA67. /// BUSZA69 [W. BUSZA,PR 180,1339(1969)] -- PI+ P DCS AT 10 MOMENTA BETWEEN 1.72 AND 2.80 GEV/C,AND PI- P DCS AT 5 MOMENTA BETWEEN 2.17 AND 2.80 GEV/C. THE DATA REPORTED IN BUSZA67 ARE ALSO REPEATED HERE. THE NEW MEASUREMENTS WERE DONE WITH AN IMPROVED VERSION OF THE APPARATUS USED BY BUSZA67. THE PI- DATA (INCLUDING BUSZA67)ARE NORMALIZED TO FORWARD DISPERSION RELATIONS,THE PI+ DATAHAS ITS OWN EXPERIMENTAL NORMALIZATION BUT NO NE IS GIVEN. WE HAVE INCREASED THE ERROR OF THE MOST FORWARD PI+ POINT AT 1.72 GEV/C BECAUSE OF AN AMBIGUOUS FOOTNOTE CONCERNING THIS POINT. /// COMMENTS FROM LOVELACE71 COMPILATION OF THESE DATA -- LOVELACE71 CLAIMS SOME USE WAS MADE OF FORWARD DISPERSION RELATIONS TO NORMALIZE THE PI+ DATA AS WELL AS THE PI-. THE FOLLOWING NORMALIZATION ERRORS AND RENORMALIZATION FACTORS ARE RECOMMENDED FOR THE PI+ P AND PI- P DIFFERENTIAL CROSS SECTIONS -- PLAB=1720 MEV/C -- NE(PI+ P)=INFIN, NE(PI- P)=INFIN. PLAB=1890 MEV/C -- RF(PI+ P)=1.245, RF(PI- P)=0.941. PLAB=2070 MEV/C -- NE(PI+ P)=INFIN, RF(PI- P)=1.224. PLAB=2170 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2270 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2360 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2460 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=INFIN. PLAB=2560 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2650 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . PLAB=2800 MEV/C -- NE(PI+ P)=0.1 , NE(PI- P)=0.1 . /// COMMENTS ON MODIFICATIONS TO LOVELACE71 COMPILATION BY KELLY -- WE HAVE TAKEN ALL PI- NES TO BE INFINITE,AND ALL PI+ NES TO BE UNKNOWN. ALSO ONE MINOR MISTAKE IN THE PI- (PI+) DATA AT 2.36 (2.65) GEV/C HAS BEEN CORRECTED.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of bulk RNA sequencing (RNA-Seq) data is a valuable tool to understand transcription at the genome scale. Targeted sequencing of RNA has emerged as a practical means of assessing the majority of the transcriptomic space with less reliance on large resources for consumables and bioinformatics. TempO-Seq is a templated, multiplexed RNA-Seq platform that interrogates a panel of sentinel genes representative of genome-wide transcription. Nuances of the technology require proper preprocessing of the data. Various methods have been proposed and compared for normalizing bulk RNA-Seq data, but there has been little to no investigation of how the methods perform on TempO-Seq data. We simulated count data into two groups (treated vs. untreated) at seven-fold change (FC) levels (including no change) using control samples from human HepaRG cells run on TempO-Seq and normalized the data using seven normalization methods. Upper Quartile (UQ) performed the best with regard to maintaining FC levels as detected by a limma contrast between treated vs. untreated groups. For all FC levels, specificity of the UQ normalization was greater than 0.84 and sensitivity greater than 0.90 except for the no change and +1.5 levels. Furthermore, K-means clustering of the simulated genes normalized by UQ agreed the most with the FC assignments [adjusted Rand index (ARI) = 0.67]. Despite having an assumption of the majority of genes being unchanged, the DESeq2 scaling factors normalization method performed reasonably well as did simple normalization procedures counts per million (CPM) and total counts (TCs). These results suggest that for two class comparisons of TempO-Seq data, UQ, CPM, TC, or DESeq2 normalization should provide reasonably reliable results at absolute FC levels ≥2.0. These findings will help guide researchers to normalize TempO-Seq gene expression data for more reliable results.
IMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. This dataset backcasts estimated modeled savings for a subset of 2007-2012 completed projects in the Home Performance with ENERGY STAR® Program against normalized savings calculated by an open source energy efficiency meter available at https://www.openee.io/. Open source code uses utility-grade metered consumption to weather-normalize the pre- and post-consumption data using standard methods with no discretionary independent variables. The open source energy efficiency meter allows private companies, utilities, and regulators to calculate energy savings from energy efficiency retrofits with increased confidence and replicability of results. This dataset is intended to lay a foundation for future innovation and deployment of the open source energy efficiency meter across the residential energy sector, and to help inform stakeholders interested in pay for performance programs, where providers are paid for realizing measurable weather-normalized results. To download the open source code, please visit the website at https://github.com/openeemeter/eemeter/releases D I S C L A I M E R: Normalized Savings using open source OEE meter. Several data elements, including, Evaluated Annual Elecric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), and Post-retrofit Usage Gas (MMBtu) are direct outputs from the open source OEE meter. Home Performance with ENERGY STAR® Estimated Savings. Several data elements, including, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and Estimated First Year Energy Savings represent contractor-reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the Home Performance with ENERGY STAR impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf. This dataset includes the following data points for a subset of projects completed in 2007-2012: Contractor ID, Project County, Project City, Project ZIP, Climate Zone, Weather Station, Weather Station-Normalization, Project Completion Date, Customer Type, Size of Home, Volume of Home, Number of Units, Year Home Built, Total Project Cost, Contractor Incentive, Total Incentives, Amount Financed through Program, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, Estimated First Year Energy Savings, Evaluated Annual Electric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), Post-retrofit Usage Gas (MMBtu), Central Hudson, Consolidated Edison, LIPA, National Grid, National Fuel Gas, New York State Electric and Gas, Orange and Rockland, Rochester Gas and Electric. How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.
This data set represents the average normalized atmospheric (wet) deposition, in kilograms, of Total Inorganic Nitrogen for the year 2002 compiled for every catchment of NHDPlus for the conterminous United States. Estimates of Total Inorganic Nitrogen deposition are based on National Atmospheric Deposition Program (NADP) measurements (B. Larsen, U.S. Geological Survey, written commun., 2007). De-trending methods applied to the year 2002 are described in Alexander and others, 2001. NADP site selection met the following criteria: stations must have records from 1995 to 2002 and have a minimum of 30 observations. The NHDPlus Version 1.1 is an integrated suite of application-ready geospatial datasets that incorporates many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,00-scale NHD), improved networking, naming, and value-added attributes (VAAs). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainage enforcement technique first widely used in New England, and thus referred to as "the New England Method." This technique involves "burning in" the 1:100,000-scale NHD and when available building "walls" using the National Watershed Boundary Dataset (WBD). The resulting modified digital elevation model (HydroDEM) is used to produce hydrologic derivatives that agree with the NHD and WBD. Over the past two years, an interdisciplinary team from the U.S. Geological Survey (USGS), and the U.S. Environmental Protection Agency (USEPA), and contractors, found that this method produces the best quality NHD catchments using an automated process (USEPA, 2007). The NHDPlus dataset is organized by 18 Production Units that cover the conterminous United States. The NHDPlus version 1.1 data are grouped by the U.S. Geologic Survey's Major River Basins (MRBs, Crawford and others, 2006). MRB1, covering the New England and Mid-Atlantic River basins, contains NHDPlus Production Units 1 and 2. MRB2, covering the South Atlantic-Gulf and Tennessee River basins, contains NHDPlus Production Units 3 and 6. MRB3, covering the Great Lakes, Ohio, Upper Mississippi, and Souris-Red-Rainy River basins, contains NHDPlus Production Units 4, 5, 7 and 9. MRB4, covering the Missouri River basins, contains NHDPlus Production Units 10-lower and 10-upper. MRB5, covering the Lower Mississippi, Arkansas-White-Red, and Texas-Gulf River basins, contains NHDPlus Production Units 8, 11 and 12. MRB6, covering the Rio Grande, Colorado and Great Basin River basins, contains NHDPlus Production Units 13, 14, 15 and 16. MRB7, covering the Pacific Northwest River basins, contains NHDPlus Production Unit 17. MRB8, covering California River basins, contains NHDPlus Production Unit 18.
GINI Index Data consists of information based on primary household survey data obtained from government statistical agencies and World Bank country departments. In economics, the GINI index (sometimes expressed as a GINI ratio, GINI coefficient or a normalized GINI index) is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation's residents, and is the most commonly used measure of inequality.
This dataset contains gridded daily Normalized Difference Vegetation Index (NDVI) derived from the NOAA Climate Data Record (CDR) of Advanced Very High Resolution Radiometer (AVHRR) Surface Reflectance. The data record spans from 1981 to 10 days before the present using data from eight NOAA polar orbiting satellites: NOAA-7, -9, -11, -14, -16, -17, -18 and -19. The data are projected on a 0.05 degree x 0.05 degree global grid. This dataset is one of the Land Surface CDR Version 5 products produced by the NASA Goddard Space Flight Center (GSFC) and the University of Maryland (UMD). Improvements for Version 5 include using the improved surface reflectance data, correcting the data for known errors in time, latitude, and longitude variables, as well as improvements in the global and variable attribute definitions. The dataset is in the netCDF-4 file format following ACDD and CF Conventions. The dataset is accompanied by algorithm documentation, data flow diagram and source code for the NOAA CDR Program.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Composite Leading Indicator (CLI) Normalized for United States (USALOLITONOSTSAM) from Jan 1955 to Jan 2024 about leading indicator and USA.
This data set represents the average normalized atmospheric (wet) deposition, in kilograms, of Ammonium (NH4) for the year 2002 compiled for every catchment of NHDPlus for the conterminous United States. Estimates of NH4 deposition are based on National Atmospheric Deposition Program (NADP) measurements (B. Larsen, U.S. Geological Survey, written commun., 2007). De-trending methods applied to the year 2002 are described in Alexander and others, 2001. NADP site selection met the following criteria: stations must have records from 1995 to 2002 and have a minimum of 30 observations. The NHDPlus Version 1.1 is an integrated suite of application-ready geospatial datasets that incorporates many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,00-scale NHD), improved networking, naming, and value-added attributes (VAAs). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainage enforcement technique first widely used in New England, and thus referred to as "the New England Method." This technique involves "burning in" the 1:100,000-scale NHD and when available building "walls" using the National Watershed Boundary Dataset (WBD). The resulting modified digital elevation model (HydroDEM) is used to produce hydrologic derivatives that agree with the NHD and WBD. Over the past two years, an interdisciplinary team from the U.S. Geological Survey (USGS), and the U.S. Environmental Protection Agency (USEPA), and contractors, found that this method produces the best quality NHD catchments using an automated process (USEPA, 2007). The NHDPlus dataset is organized by 18 Production Units that cover the conterminous United States. The NHDPlus version 1.1 data are grouped by the U.S. Geologic Survey's Major River Basins (MRBs, Crawford and others, 2006). MRB1, covering the New England and Mid-Atlantic River basins, contains NHDPlus Production Units 1 and 2. MRB2, covering the South Atlantic-Gulf and Tennessee River basins, contains NHDPlus Production Units 3 and 6. MRB3, covering the Great Lakes, Ohio, Upper Mississippi, and Souris-Red-Rainy River basins, contains NHDPlus Production Units 4, 5, 7 and 9. MRB4, covering the Missouri River basins, contains NHDPlus Production Units 10-lower and 10-upper. MRB5, covering the Lower Mississippi, Arkansas-White-Red, and Texas-Gulf River basins, contains NHDPlus Production Units 8, 11 and 12. MRB6, covering the Rio Grande, Colorado and Great Basin River basins, contains NHDPlus Production Units 13, 14, 15 and 16. MRB7, covering the Pacific Northwest River basins, contains NHDPlus Production Unit 17. MRB8, covering California River basins, contains NHDPlus Production Unit 18.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for United Kingdom (GBRLORSGPNOSTSAM) from Feb 1955 to Aug 2023 about leading indicator, United Kingdom, and GDP.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for China (CHNLORSGPNOSTSAM) from Jan 1978 to Dec 2023 about leading indicator, China, and GDP.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Composite Leading Indicators: Reference Series (GDP) Normalized for Korea (KORLORSGPNOSTSAM) from Feb 1960 to Nov 2023 about leading indicator, Korea, and GDP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset published by Luecken et al. 2021 which contains data from human bone marrow measured through joint profiling of single-nucleus RNA and Antibody-Derived Tags (ADTs) using the 10X 3' Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0.File Descriptioncite_quality_control.h5mu: Filtered cell by feature MuData object after quality control.cite_normalization.h5mu: MuData object of normalized data using DSB (denoised and scaled by background) normalization.cite_doublet_removal_xdbt.h5mu: MuData of data after doublet removal based on known cell type markers. Cells were removed if they were double positive for mutually exclusive markers with a DSB value >2.5.cite_dimensionality_reduction.h5mu: MuData of data after dimensionality reduction.cite_batch_correction.h5mu: MuData of data after batch correction.CitationLuecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021).Original data linkhttps://openproblems.bio/neurips_docs/data/dataset/