10 datasets found
  1. d

    WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized Antelope Damage Claims) [Dataset]. https://catalog.data.gov/dataset/wlci-important-agricultural-lands-assessment-input-raster-normalized-antelope-damage-claim
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    The values in this raster are unit-less scores ranging from 0 to 1 that represent normalized dollars per acre damage claims from antelope on Wyoming lands. This raster is one of 9 inputs used to calculate the "Normalized Importance Index."

  2. šŸ”¢šŸ–Šļø Digital Recognition: MNIST Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). šŸ”¢šŸ–Šļø Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
    Explore at:
    zip(2278207 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Handwritten Digits Pixel Dataset - Documentation

    Overview

    The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

    Dataset Description

    Basic Information

    • Format: CSV (Comma-Separated Values)
    • Total Samples: [Number of rows based on your dataset]
    • Features: 784 pixel columns (28Ɨ28 pixels) + 1 label column
    • Label Range: Digits 0-9
    • Pixel Value Range: 0-255 (grayscale intensity)

    File Structure

    Column Description

    • label: The target variable representing the digit (0-9)
    • pixel columns: 784 columns named in format [row]xcolumn
    • Each pixel column contains integer values from 0-255 representing grayscale intensity

    Data Characteristics

    Label Distribution

    The dataset contains handwritten digit samples with the following distribution:

    • Digit 0: [X] samples
    • Digit 1: [X] samples
    • Digit 2: [X] samples
    • Digit 3: [X] samples
    • Digit 4: [X] samples
    • Digit 5: [X] samples
    • Digit 6: [X] samples
    • Digit 7: [X] samples
    • Digit 8: [X] samples
    • Digit 9: [X] samples

    (Note: Actual distribution counts would be calculated from your specific dataset)

    Data Quality

    • Missing Values: No missing values detected
    • Data Type: All values are integers
    • Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)
    • Consistency: Uniform 28Ɨ28 grid structure across all samples

    Technical Specifications

    Data Preprocessing Requirements

    • Normalization: Scale pixel values from 0-255 to 0-1 range
    • Reshaping: Convert 1D pixel arrays to 2D 28Ɨ28 matrices for visualization
    • Train-Test Split: Recommended 80-20 or 70-30 split for model development

    Recommended Machine Learning Approaches

    Classification Algorithms:

    • Random Forest
    • Support Vector Machines (SVM)
    • Neural Networks
    • K-Nearest Neighbors (KNN)

    Deep Learning Architectures:

    • Convolutional Neural Networks (CNNs)
    • Multi-layer Perceptrons (MLPs)

    Dimensionality Reduction:

    • PCA (Principal Component Analysis)
    • t-SNE for visualization

    Usage Examples

    Loading the Dataset

    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
    
    # Separate features and labels
    X = df.drop('label', axis=1)
    y = df['label']
    
    # Normalize pixel values
    X_normalized = X / 255.0
    
  3. c

    Data from: LVMED: Dataset of Latvian text normalisation samples for the...

    • repository.clarin.lv
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
    Explore at:
    Dataset updated
    May 30, 2023
    Authors
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

    Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

    All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

  4. Potential Control Location - Integrated

    • usfs.hub.arcgis.com
    Updated Apr 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2024). Potential Control Location - Integrated [Dataset]. https://usfs.hub.arcgis.com/maps/usfs::potential-control-location-integrated
    Explore at:
    Dataset updated
    Apr 16, 2024
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    U.S. Forest Service
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Description

    PCL boundaries are part of a wildfire transmission analysis that comprises the Tier I success metric within the Wildfire Crisis Strategy Prioritization Framework for the Mount Hood National Priority Landscape (NPL). This transmission analysis models the origin points of fires that burn communities, key infrastructure, and drinking water sources, how they move across the landscape, and what potential control lines they cross. Summarizing these results within Potential Operational Delineations (PODs) will allow prioritization of area-based treatments within PODs. Summarizing transmission along PCL boundaries will help support strategic decision-making, suppression effectiveness, and reduce firefighter exposure during wildfire response.

    The analysis utilizes fire simulation output from the Pacific Northwest Quantitative Wildland Fire Risk Assessment (QWRA) (McEvoy et al., 2023) conducted by Pyrologix, specifically the ignitions and associated perimeters from the Large Fire Simulator (FSIM), and select values data compiled for the QWRA Highly Valued Resources and Assets (HVRAs).

    PODs were selected from the Forest Service national feature service dataset on 3/21/2024. FSIM ignitions and perimeter data were based on the 12/15/2022 event set. HVRA point data was converted to a grid representing number of structures per grid cell (sum). HVRA line and polygon data was also converted to a grid where each grid cell was given a value of one. Grid resolution was ninety meters. Only ignitions-perimeters originating within PODs that intersected the analysis area were utilized.

    Potential Control Location Prioritization

    Potential Control Lines were prioritized by simply summing the perimeter HVRA counts (number of values which intersect each perimeter) from those perimeters that intersected each PCL and then normalizing/rescaling using the same process as above. It should be pointed out there is no consideration of flow direction in this process i.e., PCL received the HVRA perimeter sum even if they were not between the ignition and the PCL.

    Potential Operational Delineation Prioritization

    For each HVRA value, grid zonal statistics were performed on each perimeter to obtain the sum of total impacted, regardless of land ownership or analysis area boundary. The sum of structures per ignition (perimeter) was then divided by the total number of ignition iterations to normalize across Fire Occurrence Areas.

    Finally, the number of structures per ignition iteration were summed by POD, and then divided by the total number of ignitions within the POD to normalize across PODS and multiplied by ten thousand (majority FOA iterations) to rescale the value to represent a relative number of HVRA impacted per uncontrolled fire per POD. This value was then rescaled again to between zero and one so that HVRA could be combined based on relative importance and/or HVRA type.

    2023 PNW QWRA Methods Report: https://oe.oregonexplorer.info/externalcontent/wildfire/PNW_QWRA_2023Methods.pdf

    Primary Data Contact: Ian Rickert, Regional Fire Planner, Forest Service R6/R10, ian.rickert@usda.gov

    Citations:

    McEvoy, Andy; Dunn, Christopher; Rickert, Ian. 2023 PNW Quantitative Wildfire Risk Assessment Methods (2023). [Unpublished Manuscript].

  5. Dataset for: Redefining Multi-Target Weather Forecasting with a novel Deep...

    • figshare.com
    csv
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Anamul Kabir; Chatak Chakma (2025). Dataset for: Redefining Multi-Target Weather Forecasting with a novel Deep Learning model: Hierarchical Temporal Convolutional Long Short-Term Memory with Attention (HTC-LSTM-Attn) in Bangladesh [Dataset]. http://doi.org/10.6084/m9.figshare.29876786.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Md Anamul Kabir; Chatak Chakma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The climate data set was compiled monthly for Bangladesh, from January 1961 to December 2022; it was generated from the Central Climate Information Management System of the BARC. The initial data consisted of measurements from 35 weather stations, covering a multitude of weather parameters that include solar radiation, potential evaporation (PE), evapotranspiration (ETo), maximum temperature, rainfall, humidity, wind speed, cloud cover, and sunshine duration. In the scope of the research titled "Redefining Multi-Target Weather Forecasting with a Novel Deep Learning Model: HTC-LSTM-Attn in Bangladesh," the dataset underwent several pre-processing steps to ensure its quality and suitability for deep learning-based forecasting. Some of these were:Data consolidation: The merging of multiple CSV files (solar radiation, PET, sunshine, wind speed, cloud coverage, humidity, rainfall, and temperature) into one dataset keyed by station code, year, and month.Station filtering: Eleven stations were excluded due to incomplete or unreliable records, retaining 24 stations representing various climate regions.Outlier treatment: Anomalies are detected by the Interquartile range (IQR) method, and such values are replaced with the closest nearest valid value for the same station.Missing value imputation: For gap-filling, k-nearest neighbours (k=5) are applied.Feature engineering: Added seasonal indicators, lag features, and rolling averages to account for temporal dependencies.Feature Selection: By removing highly correlated variables (Pearson's r > 0.9), redundancy was reduced.Normalization: Normalize numerical columns between 0 and 1 scaling using statistics calculated over training sets.Usage:The processed dataset is optimized for deep learning–based weather forecasting models such as HTC-LSTM-Attn, but can also be used for climate trend analysis, seasonal prediction, and meteorological research.

  6. Naturalistic Neuroimaging Database

    • openneuro.org
    Updated Apr 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper (2021). Naturalistic Neuroimaging Database [Dataset]. http://doi.org/10.18112/openneuro.ds002837.v1.1.3
    Explore at:
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Sarah Aliko; Jiawen Huang; Florin Gheorghiu; Stefanie Meliss; Jeremy I Skipper
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    • The Naturalistic Neuroimaging Database (NNDb v2.0) contains datasets from 86 human participants doing the NIH Toolbox and then watching one of 10 full-length movies during functional magnetic resonance imaging (fMRI).The participants were all right-handed, native English speakers, with no history of neurological/psychiatric illnesses, with no hearing impairments, unimpaired or corrected vision and taking no medication. Each movie was stopped in 40-50 minute intervals or when participants asked for a break, resulting in 2-6 runs of BOLD-fMRI. A 10 minute high-resolution defaced T1-weighted anatomical MRI scan (MPRAGE) is also provided.
    • The NNDb V2.0 is now on Neuroscout, a platform for fast and flexible re-analysis of (naturalistic) fMRI studies. See: https://neuroscout.org/

    v2.0 Changes

    • Overview
      • We have replaced our own preprocessing pipeline with that implemented in AFNI’s afni_proc.py, thus changing only the derivative files. This introduces a fix for an issue with our normalization (i.e., scaling) step and modernizes and standardizes the preprocessing applied to the NNDb derivative files. We have done a bit of testing and have found that results in both pipelines are quite similar in terms of the resulting spatial patterns of activity but with the benefit that the afni_proc.py results are 'cleaner' and statistically more robust.
    • Normalization

      • Emily Finn and Clare Grall at Dartmouth and Rick Reynolds and Paul Taylor at AFNI, discovered and showed us that the normalization procedure we used for the derivative files was less than ideal for timeseries runs of varying lengths. Specifically, the 3dDetrend flag -normalize makes 'the sum-of-squares equal to 1'. We had not thought through that an implication of this is that the resulting normalized timeseries amplitudes will be affected by run length, increasing as run length decreases (and maybe this should go in 3dDetrend’s help text). To demonstrate this, I wrote a version of 3dDetrend’s -normalize for R so you can see for yourselves by running the following code:
      # Generate a resting state (rs) timeseries (ts)
      # Install / load package to make fake fMRI ts
      # install.packages("neuRosim")
      library(neuRosim)
      # Generate a ts
      ts.rs <- simTSrestingstate(nscan=2000, TR=1, SNR=1)
      # 3dDetrend -normalize
      # R command version for 3dDetrend -normalize -polort 0 which normalizes by making "the sum-of-squares equal to 1"
      # Do for the full timeseries
      ts.normalised.long <- (ts.rs-mean(ts.rs))/sqrt(sum((ts.rs-mean(ts.rs))^2));
      # Do this again for a shorter version of the same timeseries
      ts.shorter.length <- length(ts.normalised.long)/4
      ts.normalised.short <- (ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))/sqrt(sum((ts.rs[1:ts.shorter.length]- mean(ts.rs[1:ts.shorter.length]))^2));
      # By looking at the summaries, it can be seen that the median values become  larger
      summary(ts.normalised.long)
      summary(ts.normalised.short)
      # Plot results for the long and short ts
      # Truncate the longer ts for plotting only
      ts.normalised.long.made.shorter <- ts.normalised.long[1:ts.shorter.length]
      # Give the plot a title
      title <- "3dDetrend -normalize for long (blue) and short (red) timeseries";
      plot(x=0, y=0, main=title, xlab="", ylab="", xaxs='i', xlim=c(1,length(ts.normalised.short)), ylim=c(min(ts.normalised.short),max(ts.normalised.short)));
      # Add zero line
      lines(x=c(-1,ts.shorter.length), y=rep(0,2), col='grey');
      # 3dDetrend -normalize -polort 0 for long timeseries
      lines(ts.normalised.long.made.shorter, col='blue');
      # 3dDetrend -normalize -polort 0 for short timeseries
      lines(ts.normalised.short, col='red');
      
    • Standardization/modernization

      • The above individuals also encouraged us to implement the afni_proc.py script over our own pipeline. It introduces at least three additional improvements: First, we now use Bob’s @SSwarper to align our anatomical files with an MNI template (now MNI152_2009_template_SSW.nii.gz) and this, in turn, integrates nicely into the afni_proc.py pipeline. This seems to result in a generally better or more consistent alignment, though this is only a qualitative observation. Second, all the transformations / interpolations and detrending are now done in fewers steps compared to our pipeline. This is preferable because, e.g., there is less chance of inadvertently reintroducing noise back into the timeseries (see Lindquist, Geuter, Wager, & Caffo 2019). Finally, many groups are advocating using tools like fMRIPrep or afni_proc.py to increase standardization of analyses practices in our neuroimaging community. This presumably results in less error, less heterogeneity and more interpretability of results across studies. Along these lines, the quality control (ā€˜QC’) html pages generated by afni_proc.py are a real help in assessing data quality and almost a joy to use.
    • New afni_proc.py command line

      • The following is the afni_proc.py command line that we used to generate blurred and censored timeseries files. The afni_proc.py tool comes with extensive help and examples. As such, you can quickly understand our preprocessing decisions by scrutinising the below. Specifically, the following command is most similar to Example 11 for ā€˜Resting state analysis’ in the help file (see https://afni.nimh.nih.gov/pub/dist/doc/program_help/afni_proc.py.html): afni_proc.py \ -subj_id "$sub_id_name_1" \ -blocks despike tshift align tlrc volreg mask blur scale regress \ -radial_correlate_blocks tcat volreg \ -copy_anat anatomical_warped/anatSS.1.nii.gz \ -anat_has_skull no \ -anat_follower anat_w_skull anat anatomical_warped/anatU.1.nii.gz \ -anat_follower_ROI aaseg anat freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI aeseg epi freesurfer/SUMA/aparc.a2009s+aseg.nii.gz \ -anat_follower_ROI fsvent epi freesurfer/SUMA/fs_ap_latvent.nii.gz \ -anat_follower_ROI fswm epi freesurfer/SUMA/fs_ap_wm.nii.gz \ -anat_follower_ROI fsgm epi freesurfer/SUMA/fs_ap_gm.nii.gz \ -anat_follower_erode fsvent fswm \ -dsets media_?.nii.gz \ -tcat_remove_first_trs 8 \ -tshift_opts_ts -tpattern alt+z2 \ -align_opts_aea -cost lpc+ZZ -giant_move -check_flip \ -tlrc_base "$basedset" \ -tlrc_NL_warp \ -tlrc_NL_warped_dsets \ anatomical_warped/anatQQ.1.nii.gz \ anatomical_warped/anatQQ.1.aff12.1D \ anatomical_warped/anatQQ.1_WARP.nii.gz \ -volreg_align_to MIN_OUTLIER \ -volreg_post_vr_allin yes \ -volreg_pvra_base_index MIN_OUTLIER \ -volreg_align_e2a \ -volreg_tlrc_warp \ -mask_opts_automask -clfrac 0.10 \ -mask_epi_anat yes \ -blur_to_fwhm -blur_size $blur \ -regress_motion_per_run \ -regress_ROI_PC fsvent 3 \ -regress_ROI_PC_per_run fsvent \ -regress_make_corr_vols aeseg fsvent \ -regress_anaticor_fast \ -regress_anaticor_label fswm \ -regress_censor_motion 0.3 \ -regress_censor_outliers 0.1 \ -regress_apply_mot_types demean deriv \ -regress_est_blur_epits \ -regress_est_blur_errts \ -regress_run_clustsim no \ -regress_polort 2 \ -regress_bandpass 0.01 1 \ -html_review_style pythonic We used similar command lines to generate ā€˜blurred and not censored’ and the ā€˜not blurred and not censored’ timeseries files (described more fully below). We will provide the code used to make all derivative files available on our github site (https://github.com/lab-lab/nndb).

      We made one choice above that is different enough from our original pipeline that it is worth mentioning here. Specifically, we have quite long runs, with the average being ~40 minutes but this number can be variable (thus leading to the above issue with 3dDetrend’s -normalise). A discussion on the AFNI message board with one of our team (starting here, https://afni.nimh.nih.gov/afni/community/board/read.php?1,165243,165256#msg-165256), led to the suggestion that '-regress_polort 2' with '-regress_bandpass 0.01 1' be used for long runs. We had previously used only a variable polort with the suggested 1 + int(D/150) approach. Our new polort 2 + bandpass approach has the added benefit of working well with afni_proc.py.

      Which timeseries file you use is up to you but I have been encouraged by Rick and Paul to include a sort of PSA about this. In Paul’s own words: * Blurred data should not be used for ROI-based analyses (and potentially not for ICA? I am not certain about standard practice). * Unblurred data for ISC might be pretty noisy for voxelwise analyses, since blurring should effectively boost the SNR of active regions (and even good alignment won't be perfect everywhere). * For uncensored data, one should be concerned about motion effects being left in the data (e.g., spikes in the data). * For censored data: * Performing ISC requires the users to unionize the censoring patterns during the correlation calculation. * If wanting to calculate power spectra or spectral parameters like ALFF/fALFF/RSFA etc. (which some people might do for naturalistic tasks still), then standard FT-based methods can't be used because sampling is no longer uniform. Instead, people could use something like 3dLombScargle+3dAmpToRSFC, which calculates power spectra (and RSFC params) based on a generalization of the FT that can handle non-uniform sampling, as long as the censoring pattern is mostly random and, say, only up to about 10-15% of the data. In sum, think very carefully about which files you use. If you find you need a file we have not provided, we can happily generate different versions of the timeseries upon request and can generally do so in a week or less.

    • Effect on results

      • From numerous tests on our own analyses, we have qualitatively found that results using our old vs the new afni_proc.py preprocessing pipeline do not change all that much in terms of general spatial patterns. There is, however, an
  7. Raw and normalized X-ray fluorescence (XRF) scannings of IODP Expedition...

    • doi.pangaea.de
    • search.dataone.org
    zip
    Updated May 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell W Lyle; Ann E Holbourn; Thomas Westerhold; Ed C Hathorne; Shinya Yamamoto; Annette Olivarez Lyle; T J Gorgas; Katsunori Kimoto (2012). Raw and normalized X-ray fluorescence (XRF) scannings of IODP Expedition 321, Site U1338 [Dataset]. http://doi.org/10.1594/PANGAEA.780290
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 2, 2012
    Dataset provided by
    PANGAEA
    Authors
    Mitchell W Lyle; Ann E Holbourn; Thomas Westerhold; Ed C Hathorne; Shinya Yamamoto; Annette Olivarez Lyle; T J Gorgas; Katsunori Kimoto
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    We used X-ray fluorescence (XRF) scanning on Site U1338 sediments from Integrated Ocean Drilling Program Expedition 321 to measure sediment geochemical compositions at 2.5 cm resolution for the 450 m of the Site U1338 spliced sediment column. This spatial resolution is equivalent to ~2 k.y. age sampling in the 0-5 Ma section and ~1 k.y. resolution from 5 to 17 Ma. Here we report the data and describe data acquisition conditions to measure Al, Si, K, Ca, Ti, Fe, Mn, and Ba in the solid phase. We also describe a method to convert the data from volume-based raw XRF scan data to a normalized mass measurement ready for calibration by other geochemical methods. Both the raw and normalized data are reported along the Site U1338 splice.

  8. Animal 10

    • kaggle.com
    zip
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pham Tuyet (2023). Animal 10 [Dataset]. https://www.kaggle.com/datasets/phamtuyet/animal-10/suggestions
    Explore at:
    zip(1372313431 bytes)Available download formats
    Dataset updated
    Sep 28, 2023
    Authors
    Pham Tuyet
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    { 0: 'butterfly', 1: 'cat', 2: 'chicken', 3: 'cow', 4: 'dog', 5: 'elephant', 6: 'horse', 7: 'sheep', 8: 'spider', 9: 'squirrel' } Load data pytorch ``` class Dataset(DT): def init(self, mode="train", size=224, augment=False, augment_rate=0.2, normalize=False, random_state=42): super(Dataset, self)._init_()

      self.mode = mode
      self.size = size
      self.augment = augment
      self.augment_rate = augment_rate
      self.normalize = normalize
      self.random_state = random_state
      self.X = []
      self.Y = []
      self.labels = {
        0: 'butterfly',
        1: 'cat',
        2: 'chicken',
        3: 'cow',
        4: 'dog',
        5: 'elephant',
        6: 'horse',
        7: 'sheep',
        8: 'spider',
        9: 'squirrel'
      }
    
      self.load_data()
    
    def load_data(self):
      if self.mode == "train":
        self.X = np.load("./data/trainX.npy")
        self.Y = np.load("./data/trainY.npy")
      else:
        self.X = np.load("./data/testX.npy")
        self.Y = np.load("./data/testY.npy")
      print(self.mode, "====", self.X.shape[0])
    
    def _len_(self):
      return self.X.shape[0]
    
    def _getitem_(self, index):
      image = self.X[index]
      label = [0] * 10
      label[self.Y[index]] = 1
      return torch.tensor(image, dtype=torch.float32).permute(2, 0, 1) / 255.0, torch.tensor(label, dtype=torch.float32)
    
    def sample(self, index=0):
      image = self.X[index]
      label = [0] * 10
      label[self.Y[index]] = 1
      return image, label
    
  9. Predictive Analysis: Vehicle Prices

    • kaggle.com
    zip
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Galdino (2024). Predictive Analysis: Vehicle Prices [Dataset]. https://www.kaggle.com/datasets/leonardogaldinno/predictive-analysis-vehicle-price
    Explore at:
    zip(375879 bytes)Available download formats
    Dataset updated
    Jul 29, 2024
    Authors
    Leonardo Galdino
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Project Description:

    In this project, I developed a linear regression model to predict car prices based on key features such as fuel tank capacity, width, length, and year of manufacture. The goal was to understand how these factors influence car prices and to assess the effectiveness of the model in making accurate predictions.

    Key Features:

    Fuel Tank Capacity: The capacity of the car’s fuel tank. Width: The width of the car. Length: The length of the car. Year: The year of manufacture of the car.

    Target Variable:

    Price: The price of the car, which is the primary variable being predicted.

    Methodology:

    Data Preparation:

    • Extracted relevant features and the target variable from the dataset.
    • Split the data into training and testing sets to evaluate model performance.

    Model Training:

    • Built and trained a Linear Regression model using the training dataset.
    • Evaluated the model using Mean Absolute Error (MAE) and R-squared (R²) metrics to gauge prediction accuracy and model fit.

    Feature Scaling:

    • Applied standard scaling to normalize the feature values, transforming them to have a mean of 0 and standard deviation of 1.
    • Retrained the model with the scaled data and reassessed its performance using the same evaluation metrics.

    Evaluation:

    • Compared the model's performance before and after scaling to determine the impact of feature normalization. MAE was used to measure the average prediction error, while R² indicated how well the model explained the variance in car prices.

    Visualization:

    • Created scatter plots to visualize the relationship between actual and predicted prices.
    • Added a reference line representing perfect predictions to highlight the model's accuracy.

    Results:

    • The project demonstrated the effectiveness of linear regression in predicting car prices.
    • Feature scaling improved model performance, as reflected in the evaluation metrics.
    • Visualizations provided a clear understanding of the model’s accuracy and areas for improvement.

    Technologies Used:

    • Python, with libraries such as pandas, NumPy, scikit-learn, and matplotlib for data manipulation, model building, and visualization.
  10. OCEAN model of Psychology

    • kaggle.com
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niyati Savant (2025). OCEAN model of Psychology [Dataset]. https://www.kaggle.com/datasets/niyatisavant/ocean-model-of-psychology/discussion
    Explore at:
    zip(3157929 bytes)Available download formats
    Dataset updated
    Jun 11, 2025
    Authors
    Niyati Savant
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original Source

    https://www.kaggle.com/datasets/tunguz/big-five-personality-test

    The Transformation process

    https://colab.research.google.com/drive/1ZsS76ZsRjcL1tg_YvqEB_WlzvlmsiinP?usp=sharing

    Issues with original Dataset

    - Lack of Labels: The original dataset did not categorize the responses into specific personality traits, making it impossible to directly train a supervised machine learning model. - Complexity in Interpretation: Although raw scores range from 1 to 5, they were not directly interpretable as personality traits since different numbers of positively and negatively keyed questions meant the maximum score for each trait was different.

    Transformation Process:

    To overcome these challenges, I undertook the following process to convert this un-labelled data into a labelled format: - Scoring Mechanism: I calculated scores for each of the five personality traits based on the respondent's answers to relevant questions. For each trait, a total score was computed by summing the individual question scores, taking into account whether the question was positively or negatively keyed. - Normalization and Scaling: To ensure consistency and comparability across traits, I applied a Min-Max Scaler to normalize the scores to a range of 0 to 1. This step was crucial for creating uniform labels that could be used effectively in machine learning models. - Label Assignment: Based on the scaled scores, I assigned labels to each respondent, categorizing them from the highest to lowest for each personality trait.

    Application of the Transformed Data:

    The labelled data can played a pivotal role in training various machine learning algorithms to predict personality traits based on new user responses. By transforming the dataset, the user can : Develop a Supervised Learning Model: The labelled data enabled user to use classification algorithms, such as Logistic Regression and Support Vector Machines, to predict personality traits with high accuracy. Clustering for Insights: The user can also utilize clustering algorithms on the scaled data to uncover patterns and group users with similar personality profiles, enhancing the interpretability of the model outputs.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2025). WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized Antelope Damage Claims) [Dataset]. https://catalog.data.gov/dataset/wlci-important-agricultural-lands-assessment-input-raster-normalized-antelope-damage-claim

WLCI - Important Agricultural Lands Assessment (Input Raster: Normalized Antelope Damage Claims)

Explore at:
Dataset updated
Oct 30, 2025
Dataset provided by
U.S. Geological Survey
Description

The values in this raster are unit-less scores ranging from 0 to 1 that represent normalized dollars per acre damage claims from antelope on Wyoming lands. This raster is one of 9 inputs used to calculate the "Normalized Importance Index."

Search
Clear search
Close search
Google apps
Main menu