7 datasets found
  1. o

    mnist_784

    • openml.org
    Updated Sep 29, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yann LeCun; Corinna Cortes; Christopher J.C. Burges (2014). mnist_784 [Dataset]. https://www.openml.org/d/554
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2014
    Authors
    Yann LeCun; Corinna Cortes; Christopher J.C. Burges
    Description

    Author: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
    Source: MNIST Website - Date unknown
    Please cite:

    The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples

    It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

    With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.

    The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.

  2. Z

    GATE simulated cylindrical PET with NEMA-like phantom

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GATE simulated cylindrical PET with NEMA-like phantom [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12743217
    Explore at:
    Dataset updated
    Jul 15, 2024
    Dataset authored and provided by
    Wettenhovi, Ville-Veikko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a GATE-simulated data using a cylindrical PET scanner (based on the GATE cylindrical PET example) and a NEMA-like phantom. Included are the sinograms created by OMEGA software, for both TOF and non-TOF cases as mat-files as well as normalization correction coefficients. You can open these mat-files in MATLAB, Octave, Python, Julia or in practically any other language. This data can be also used as a testing data for OMEGA. Included is also the ground truth image (i.e. the source image), attenuation image as created by GATE, as well as the original ROOT files. The ROOT file package also contains the original macros that give details on the scanner and the phantom.

    The sinogram data contains several different sinograms. raw_SinM is the raw sinogram with no modifications, SinM contains normalization and randoms correction precorrected (not available for TOF data), SinDelayed contains delayed coincidences, SinTrues the trues, SinRandoms the true randoms, SinScatter the true scattered photons, appliedCorrections show the corrections applied to SinM and RandProp and ScatterProp show whether variance reduction or smoothing was applied to the delayed coincidences or (not present) scatter estimation data. The attenuation data is already correctly scaled and is saved as a MetaImage file.

    Also included is the ground truth, or original source, image. This is saved as the variable C. RA is the randoms image while SC is the scatter image. The latter two are in singles mode, i.e. they show the locations of the photons that either were random (two different events) or scattered along the way.

    The normalization data works for other measurements with the same scanner as long as the sinogram dimensions remain the same. OMEGA will automatically use the normalization data if it's present in the mat-files folder.

  3. m

    A Python Code for Statistical Mirroring

    • data.mendeley.com
    Updated Oct 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kabir Bindawa Abdullahi (2024). A Python Code for Statistical Mirroring [Dataset]. http://doi.org/10.17632/ppfvc65m2v.4
    Explore at:
    Dataset updated
    Oct 14, 2024
    Authors
    Kabir Bindawa Abdullahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical mirroring is the measure of the proximity or deviation of transformed data points from a specified location estimate within a given distribution [2]. Within the framework of Kabirian-based optinalysis [1], statistical mirroring is conceptualized as the isoreflectivity of the transformed data points to a defined statistical mirror. This statistical mirror is an amplified location estimate of the distribution, achieved through a specified size or length. The location estimate may include parameters such as the mean, median, mode, maximum, minimum, or reference value [2]. The process of statistical mirroring comprises two distinct phases: a) Preprocessing phase [2]: This involves applying preprocessing transformations, such as compulsory theoretical ordering, with or without centering the data. It also encompasses tasks like statistical mirror design and optimizations within the established optinalytic construction. These optimizations include selecting an efficient pairing style, central normalization, and establishing an isoreflective pair between the preprocessed data and its designed statistical mirror. b) Optinalytic model calculation phase [1]: This phase is focused on computing estimates based on Kabirian-based isomorphic optinalysis models.

    References: [1] K.B. Abdullahi, Kabirian-based optinalysis: A conceptually grounded framework for symmetry/asymmetry, similarity/dissimilarity, and identity/unidentity estimations in mathematical structures and biological sequences, MethodsX 11 (2023) 102400. https://doi.org/10.1016/j.mex.2023.102400 [2] K.B. Abdullahi, Statistical mirroring: A robust method for statistical dispersion estimation, MethodsX 12 (2024) 102682. https://doi.org/10.1016/j.mex.2024.102682

  4. Z

    TCGA Glioblastoma Multiforme (GBM) Gene Expression

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swati Baskiyar (2023). TCGA Glioblastoma Multiforme (GBM) Gene Expression [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8187688
    Explore at:
    Dataset updated
    Jul 27, 2023
    Dataset authored and provided by
    Swati Baskiyar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract:

    The Cancer Genome Atlas (TCGA) was a large-scale collaborative project initiated by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). It aimed to comprehensively characterize the genomic and molecular landscape of various cancer types. This dataset contains information about GBM, an aggressive and highly malignant brain tumor that arises from glial cells, characterized by rapid growth and infiltrative behavior. The gene expression profile was measured experimentally using the Affymetrix HT Human Genome U133a microarray platform by the Broad Institute of MIT and Harvard University cancer genomic characterization center. The Sample IDs serve as unique identifiers for each sample.

    Inspiration:

    This dataset was uploaded to UBRITE for GTKB project.

    Instruction:

    The log2(x) normalization was removed, and z-normalization was performed on the dataset using a Python script.

    Acknowledgments:

    Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8

    The Cancer Genome Atlas Research Network., Weinstein, J., Collisson, E. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45, 1113–1120 (2013). https://doi.org/10.1038/ng.2764

    U-BRITE last update: 07/13/2023

  5. Z

    Onset of mining operations

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remelgado, Ruben (2024). Onset of mining operations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214548
    Explore at:
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    Remelgado, Ruben
    Meyer, Carsten
    Description

    Motivation

    Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

    Approach

    For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

    After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

    Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

    We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

    To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

    Content

    This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

    00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

    01_analysis - Contains several outputs of our analysis:

    xy.tar.gz - Sample locations for each mining site.

    sr.tar.gz - Spectral profiles for each sample location.

    mine_start.csv - First year when we detected the start of mining.

    02_code - Includes all code used in our analysis.

    requirements.txt - Python module requirements that can be fed to pip to replicate our study.

    config.yml - Configuration file, including information on the Landsat products used.

  6. f

    Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and...

    • frontiersin.figshare.com
    docx
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 7_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s008
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.

  7. f

    Open source software for HTS data analysis and their characteristics.

    • figshare.com
    xls
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina Nunes; Jasper Anckaert; Fanny De Vloed; Jolien De Wyn; Kaat Durinck; Jo Vandesompele; Frank Speleman; Vanessa Vermeirssen (2024). Open source software for HTS data analysis and their characteristics. [Dataset]. http://doi.org/10.1371/journal.pone.0296322.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Carolina Nunes; Jasper Anckaert; Fanny De Vloed; Jolien De Wyn; Kaat Durinck; Jo Vandesompele; Frank Speleman; Vanessa Vermeirssen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open source software for HTS data analysis and their characteristics.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yann LeCun; Corinna Cortes; Christopher J.C. Burges (2014). mnist_784 [Dataset]. https://www.openml.org/d/554

mnist_784

Explore at:
48 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2014
Authors
Yann LeCun; Corinna Cortes; Christopher J.C. Burges
Description

Author: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
Source: MNIST Website - Date unknown
Please cite:

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.

Search
Clear search
Close search
Google apps
Main menu