13 datasets found
  1. FastQuantileLayerKeras

    • kaggle.com
    zip
    Updated Jan 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik (2021). FastQuantileLayerKeras [Dataset]. https://www.kaggle.com/snippsy/fastquantilelayerkeras
    Explore at:
    zip(9909 bytes)Available download formats
    Dataset updated
    Jan 14, 2021
    Authors
    Erik
    Description
  2. Perovskite Solar Cells Ageing Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noor Titan Putri Hartono; Noor Titan Putri Hartono; Hans Köbler; Hans Köbler; Paolo Graniero; Paolo Graniero; Mark Khenkin; Mark Khenkin; Rutger Schlatmann; Rutger Schlatmann; Carolin Ulbrich; Carolin Ulbrich; Antonio Abate; Antonio Abate (2023). Perovskite Solar Cells Ageing Dataset [Dataset]. http://doi.org/10.5281/zenodo.8185883
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Noor Titan Putri Hartono; Noor Titan Putri Hartono; Hans Köbler; Hans Köbler; Paolo Graniero; Paolo Graniero; Mark Khenkin; Mark Khenkin; Rutger Schlatmann; Rutger Schlatmann; Carolin Ulbrich; Carolin Ulbrich; Antonio Abate; Antonio Abate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains cleaned 2,245 ageing test traces (time vs. MPPT PCE/ maximum power point tracking power conversion efficiency) for perovskite solar cells with various device stacks and architectures in the pickle (.pkl) format.

    The dataset can be loaded with the following commands on Python.

    import pickle5 as pickle
    import pandas as pd 
    import numpy as np
    
    with open('20230303_mySeriesDrop.pkl', "rb") as fh:
      mySeriesDrop = pickle.load(fh)

    The following command can be used to call a specific row (row 0) within the dataset.

    mySeriesDrop[0]

    The next steps to use the dataset is using scaling/ normalisation (for instance using sklearn.preprocessing.MaxAbsScaler) and smoothing (for instance using Savitzky-Golay filter).

    The code to run the complete analysis, including self-organising map clustering, can be accessed here: https://doi.org/10.5281/zenodo.8181602.

  3. Diabetes_Dataset_1.1

    • kaggle.com
    zip
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/code
    Explore at:
    zip(779755 bytes)Available download formats
    Dataset updated
    Nov 2, 2023
    Authors
    KIRANMAYI G 777
    Description

    import pandas as pd import numpy as np

    PERFORMING EDA

    data.head() data.info()

    attributes_data = data.iloc[:, 1:] attributes_data

    attributes_data.describe() attributes_data.corr()

    import seaborn as sns import matplotlib.pyplot as plt

    Calculate correlation matrix

    correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

    Create a heatmap

    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

    CHECKING IF DATASET IS LINEAR OR NON-LINEAR

    Calculate correlations between target and predictor columns

    correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

    Create a bar chart

    plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

    CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

    Count the number of null values in each column

    print(data.isnull().sum())

    to check for missing values in all columns

    print(data.isna().sum())

    LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

    X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    gridsearchcv is used to find the optimal combination of hyperparameters for a given model

    So, in the end, we can select the best parameters from the listed hyperparameters.

    parameters = {"alpha": np.arange(0.00001, 10, 500)}
    kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

    column_names = list(data) column_names = column_names[1:] column_names

    lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

    RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

    num_features_selected = len(rfecv.rankin_)

    Cross-validation scores

    cv_scores = rfecv.ranking_

    Plotting the number of features vs. cross-validation score

    plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

    print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

    PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

    X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

    df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

    scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

    principal.components_

    plt.figure(figsize=(10,10))

    plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

    print(principal.explained_variance_ratio_)

    T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

    tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

    df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")

  4. Data Set for Probabilistic Indoor Temperature Forecasting

    • zenodo.org
    bin
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert; Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert (2024). Data Set for Probabilistic Indoor Temperature Forecasting [Dataset]. http://doi.org/10.5281/zenodo.11911791
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert; Roman Kempf; Marcel Arpogaus; Tim Baur; Gunnar Schubert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Dataset Manifest

    This text provides a description of the dataset used for model training and evaluation in our study "A Tutorial on Deep Learning for Probabilistic Indoor Temperature Forecasting". The dataset consists of various simulated thermal and environmental parameters for different room configurations. Below, you will find a table detailing each column in the dataset along with its description and unit of measurement.

    1.1. Columns Description

    Column NameDescriptionUnit
    timeTime stamp of the measurement-
    ZweiPersonenBuero.TAirAir temperature inside a two-person office°C
    heatStat.Heat.Q_flowHeating rate in the roomW
    weaDat.AirPressureAtmospheric pressurePa
    weaDat.AirTempOutside air temperature°C
    weaDat.SkyRadiationLongwave sky radiationW/m²
    weaDat.TerrestrialRadiationTerrestrial radiationW/m²
    weaDat.WaterInAirAbsolute humidityg/kg
    VAirAir volume in the room
    AExt0Exterior wall area facing the south
    AExt1Exterior wall area facing the north
    AIntTotal interior wall area
    AFloorFloor area of the room
    AWin0Window area facing the south
    AWin1Window area facing the north
    azi0Azimuth (direction) of the first exterior wallrad
    azi1Azimuth (direction) of the second exterior wallrad
    idUnique identifier for the room configuration-
    is_holidayIndicator whether the day is a holiday (1 for yes, 0 for no)-

    1.2. Note on Multi-Value Columns

    For rooms with multiple exterior walls (rooms 15-30):

    • AExt: {Exterior wall 1 area, Exterior wall 2 area}
    • AWin: {Window area on exterior wall 1, Window area on exterior wall 2}
    • azi: {Azimuth of exterior wall 1, Azimuth of exterior wall 2}

    Example:

    • AExt = {10, 15}
    • AWin = {2, 0}
    • azi = {0, 3.1415}

    This indicates two exterior walls with areas of 10 m² and 15 m² facing south (0 rad) and north (3.1415 rad), respectively. The south-facing wall has a window of 2 m², while the north-facing wall has no window.

    1.3. Data Sources

    • Room Model: Simulated using the reduced-order package of the Modelica Buildings Library.
    • Weather Data: Provided by the German Meteorological Service (DWD) in Test Reference Year (TRY) format.

    This comprehensive dataset provides crucial parameters required to train and evaluate thermal models for different room configurations. The simulation data ensures a diverse range of environmental and occupancy conditions, enhancing the robustness of the models.

    1.4. Data scaling

    The data set contains the raw data as well as the scaled data used for training and testing the model. The scaling was carried out using the StandardScaler package.

    1.5. Weather data license

    This data set contains weather data recorded by the DWD under license „Datenlizenz Deutschland – Namensnennung – Version 2.0" (URL). The data is provided by "Bundesinstitut für Bau-, Stadt- und Raumforschung". The data can be downloaded from here. We use data from the year 2015 from Heilbronn. We have added the weather data to the data set unchanged.

  5. h

    Facial_expresson

    • huggingface.co
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraj Malhari Jadhav (2024). Facial_expresson [Dataset]. https://huggingface.co/datasets/surajjadhav11/Facial_expresson
    Explore at:
    Dataset updated
    Aug 11, 2024
    Authors
    Suraj Malhari Jadhav
    Description

    To use Facial expresssion dataset 1.Just clone it. 2.create a python v3.9 envoirnment . 3.install tensorflow,cv2,sklearn and keras in the system using pip 4.Run in same kernal enviornment (its takes time). 5.process start collecting and preprocessing datasets of facial expressions captured in different contexts.

  6. Household Energy Consumption

    • kaggle.com
    zip
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samharison (2025). Household Energy Consumption [Dataset]. https://www.kaggle.com/samxsam/household-energy-consumption
    Explore at:
    zip(748210 bytes)Available download formats
    Dataset updated
    Apr 5, 2025
    Authors
    Samharison
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🏡 Household Energy Consumption - April 2025 (90,000 Records)

    📌 Overview

    This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.

    Column NameData Type CategoryDescription
    Household_IDCategorical (Nominal)Unique identifier for each household
    DateDatetimeThe date of the energy usage record
    Energy_Consumption_kWhNumerical (Continuous)Total energy consumed by the household in kWh
    Household_SizeNumerical (Discrete)Number of individuals living in the household
    Avg_Temperature_CNumerical (Continuous)Average daily temperature in degrees Celsius
    Has_ACCategorical (Binary)Indicates if the household has air conditioning (Yes/No)
    Peak_Hours_Usage_kWhNumerical (Continuous)Energy consumed during peak hours in kWh

    📂 Dataset Summary

    • Rows: 90,000
    • Time Range: April 1, 2025 – April 30, 2025
    • Data Granularity: Daily per household
    • Location: Simulated global coverage
    • Format: CSV (Comma-Separated Values)

    📚 Libraries Used for Working with household_energy_consumption_2025.csv

    🔍 1. Data Manipulation & Analysis

    LibraryPurpose
    pandasReading, cleaning, and transforming tabular data
    numpyNumerical operations, working with arrays

    📊 2. Data Visualization

    LibraryPurpose
    matplotlibCreating static plots (line, bar, histograms, etc.)
    seabornStatistical visualizations, heatmaps, boxplots, etc.
    plotlyInteractive charts (time series, pie, bar, scatter, etc.)

    📈 3. Machine Learning / Modeling

    LibraryPurpose
    scikit-learnPreprocessing, regression, classification, clustering
    xgboost / lightgbmGradient boosting models for better accuracy

    🧹 4. Data Preprocessing

    LibraryPurpose
    sklearn.preprocessingEncoding categorical features, scaling, normalization
    datetime / pandasDate-time conversion and manipulation

    🧪 5. Model Evaluation

    LibraryPurpose
    sklearn.metricsAccuracy, MAE, RMSE, R² score, confusion matrix, etc.

    ✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.

    📈 Potential Use Cases

    This dataset is ideal for a wide variety of analytics and machine learning projects:

    🔮 Forecasting & Time Series Analysis

    • Predict future household energy consumption based on previous trends and weather conditions.
    • Identify seasonal and daily consumption patterns.

    💡 Energy Efficiency Analysis

    • Analyze differences in energy consumption between households with and without air conditioning.
    • Compare energy usage efficiency across varying household sizes.

    🌡️ Climate Impact Studies

    • Investigate how temperature affects electricity usage across households.
    • Model the potential impact of climate change on residential energy demand.

    🔌 Peak Load Management

    • Build models to predict and manage energy demand during peak hours.
    • Support research on smart grid technologies and dynamic pricing.

    🧠 Machine Learning Projects

    • Supervised learning (regression/classification) to predict energy consumption.
    • Clustering households by usage patterns for targeted energy programs.
    • Anomaly detection in energy usage for fault detection.

    🛠️ Example Starter Projects

    • Time-series forecasting using Facebook Prophet or ARIMA
    • Regression modeling using XGBoost or LightGBM
    • Classification of AC vs. non-AC household behavior
    • Energy-saving recommendation systems
    • Heatmaps of temperature vs. energy usage
  7. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  8. AE Credit ID Encoded Dataset [FP16]

    • kaggle.com
    zip
    Updated May 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Chiu (2022). AE Credit ID Encoded Dataset [FP16] [Dataset]. https://www.kaggle.com/datasets/kingychiu/ae-credit-id-encoded-dataset-fp16/discussion
    Explore at:
    zip(4242611283 bytes)Available download formats
    Dataset updated
    May 26, 2022
    Authors
    Anthony Chiu
    Description

    Dataset Creation Code:

    import pandas as pd
    import numpy as np
    import gc
    from sklearn.preprocessing import LabelEncoder
    import pickle
    pickle.HIGHEST_PROTOCOL = 4
    
    BASE_PATH = "./datasets/amex-default-prediction"
    # Index file
    train_data_index = pd.read_csv(f"{BASE_PATH}/train_labels.csv")
    test_data_index = pd.read_csv(f"{BASE_PATH}/sample_submission.csv")
    
    print(train_data_index.shape, test_data_index.shape)
    
    all_ids = np.concatenate([train_data_index["customer_ID"], test_data_index["customer_ID"]])
    print(len(all_ids))
    
    # Train an id encoder and save it.
    id_encoder = LabelEncoder()
    id_encoder.fit(all_ids)
    np.save("id_encodings.npy", id_encoder.classes_)
    
    # Make sure we can load it back
    loaded_encoder = LabelEncoder()
    loaded_encoder.classes_ = np.load("id_encodings.npy", allow_pickle=True)
    assert (id_encoder.classes_ == loaded_encoder.classes_).all()
    
    # Make sure we can reverse it (1-index)
    print(loaded_encoder.inverse_transform([1, 2]))
    print(train_data_index["customer_ID"].values[0: 2])
    del loaded_encoder
    
    
    
    train_data_index["customer_ID"] = id_encoder.transform(train_data_index["customer_ID"])
    test_data_index["customer_ID"] = id_encoder.transform(test_data_index["customer_ID"])
    
    # Encode the index files
    train_data_index.to_pickle("id_encoded_train_labels.pkl", protocol=4)
    test_data_index.to_pickle("id_encoded_sample_submission.pkl", protocol=4)
    
    
    del train_data_index
    del test_data_index
    gc.collect()
    
    # Test files are too large for a Kaggle Notebook
    main_train = pd.read_csv(
      f"{BASE_PATH}/train_data.csv"
    )
    main_test = pd.read_csv(
      f"{BASE_PATH}/test_data.csv"
    )
    
    main_files = [
      main_train, 
      main_test,
    ]
    for main_file in main_files:
      print(main_file.shape)
      main_file["customer_ID"] = id_encoder.transform(main_file["customer_ID"])
    
    main_train.to_pickle("id_encoded_train_data.pkl", protocol=4)
    main_test.to_pickle("id_encoded_test_data.pkl", protocol=4)
    
    def reduce_mem_usage(df, use_fp16=False):
      """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.    
      """
      start_mem = df.memory_usage().sum() / 1024**2
      print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
      
      for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
          c_min = df[col].min()
          c_max = df[col].max()
          if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
              df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
              df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
              df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
              df[col] = df[col].astype(np.int64) 
          else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
              if use_fp16:
                df[col] = df[col].astype(np.float16)
              else:
                df[col] = df[col].astype(np.float32)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
              df[col] = df[col].astype(np.float32)
            else:
              df[col] = df[col].astype(np.float64)
        else:
          df[col] = df[col].astype('category')
    
      end_mem = df.memory_usage().sum() / 1024**2
      print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
      print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
      
      return df
    
    
    main_train = pd.read_pickle("id_encoded_train_data.pkl")
    main_test = pd.read_pickle("id_encoded_test_data.pkl")
    
    reduce_mem_usage(main_train)
    reduce_mem_usage(main_test)
    main_train.to_pickle("id_encoded_fp32_train_data.pkl", protocol=4)
    main_test.to_pickle("id_encoded_fp32_test_data.pkl", protocol=4)
    
    main_train = pd.read_pickle("id_encoded_train_data.pkl")
    main_test = pd.read_pickle("id_encoded_test_data.pkl")
    
    reduce_mem_usage(main_train, use_fp16=True)
    reduce_mem_usage(main_test, use_fp16=True)
    main_train.to_pickle("id_encoded_fp16_train_data.pkl", protocol=4)
    main_test.to_pickle("id_encoded_fp16_test_data.pkl", protocol=4)
    
    
  9. GoEmotions (UA) – Emotion Classification Dataset

    • kaggle.com
    zip
    Updated Nov 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksii Chumak (2025). GoEmotions (UA) – Emotion Classification Dataset [Dataset]. https://www.kaggle.com/datasets/oleksiichumak/goemotions-ua-emotion-classification-dataset
    Explore at:
    zip(4527621 bytes)Available download formats
    Dataset updated
    Nov 30, 2025
    Authors
    Oleksii Chumak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    GoEmotions Ukrainian Dataset

    Ukrainian translation of the GoEmotions dataset for emotion classification in text.

    Dataset Description

    This dataset is a high-quality Ukrainian translation of Google's GoEmotions dataset, which contains Reddit comments labeled with 28 emotion categories.

    Translation Methodology

    • Model: Helsinki-NLP/opus-mt-en-uk - specialized English-Ukrainian translation model
    • Post-processing: Manual quality tuning and refinement to ensure natural Ukrainian phrasing
    • Quality: 100% Ukrainian text with natural, context-aware translations

    Dataset Statistics

    • Total samples: 54,263 Reddit comments
    • Language: Ukrainian (translated from English)
    • Emotion categories: 28 + neutral
    • Splits: Train (43,410), Validation (5,426), Test (5,427)
    • Task type: Multi-label classification (texts can have multiple emotions)

    Emotion Categories

    The dataset includes 28 emotion categories:

    CategoryUkrainianCategoryUkrainian
    admirationзахопленняamusementрозвага
    angerгнівannoyanceроздратування
    approvalсхваленняcaringтурбота
    confusionрозгубленістьcuriosityцікавість
    desireбажанняdisappointmentрозчарування
    disapprovalнесхваленняdisgustвідраза
    embarrassmentзбентеженняexcitementзбудження
    fearстрахgratitudeвдячність
    griefгореjoyрадість
    loveлюбовnervousnessнервозність
    optimismоптимізмprideгордість
    realizationусвідомленняreliefполегшення
    remorseкаяттяsadnessсум
    surpriseздивуванняneutralнейтрально

    File Structure

    CSV Format

    The dataset is provided in CSV format with the following columns:

    text,text_uk,labels,id,split
    
    • text: Original English text
    • text_uk: Ukrainian translation
    • labels: List of emotion label indices (0-27, multi-label)
    • id: Unique identifier
    • split: Data split (train/validation/test)

    Example

    text,text_uk,labels,id,split
    "My favourite food is anything I didn't have to cook myself.","Моя улюблена їжа - це все, що я не мусив сам готувати.",[27],eebbqej,train
    

    Usage

    Loading the Dataset

    import pandas as pd
    
    # Load dataset
    df = pd.read_csv('goemotions_uk.csv')
    
    # Parse labels
    import ast
    df['labels'] = df['labels'].apply(ast.literal_eval)
    
    # Split by data split
    train_df = df[df['split'] == 'train']
    val_df = df[df['split'] == 'validation']
    test_df = df[df['split'] == 'test']
    

    Multi-label Classification

    from sklearn.preprocessing import MultiLabelBinarizer
    
    # Convert labels to multi-hot encoding
    mlb = MultiLabelBinarizer(classes=list(range(28)))
    mlb.fit([list(range(28))])
    
    train_labels = mlb.transform(train_df['labels'])
    val_labels = mlb.transform(val_df['labels'])
    test_labels = mlb.transform(test_df['labels'])
    

    With Transformers

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    # Use multilingual models
    model_name = "xlm-roberta-base" # or "TurkuNLP/bert-base-ukrainian-cased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
      model_name, 
      num_labels=28,
      problem_type="multi_label_classification"
    )
    
    # Tokenize Ukrainian text
    encodings = tokenizer(
      train_df['text_uk'].tolist(),
      truncation=True,
      padding=True,
      max_length=128
    )
    

    Applications

    This dataset is suitable for:

    • Emotion detection in Ukrainian social media and text
    • Sentiment analysis with fine-grained emotional categories
    • Multi-label text classification research
    • Ukrainian NLP model development and evaluation
    • Cross-lingual emotion recognition studies

    Citation

    If you use this dataset, please cite the original GoEmotions paper:

    @inproceedings{demszky2020goemotions,
     title={{GoEmotions: A Dataset of Fine-Grained Emotions}},
     author={Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
     booktitle={58th Annual Meeting of the Association for Computational Linguistics (ACL)},
     year={2020}
    }
    

    Lice...

  10. working with pipeline

    • kaggle.com
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiza Aslam1 (2025). working with pipeline [Dataset]. https://www.kaggle.com/datasets/fizaaslam12/working-with-pipeline
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fiza Aslam1
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🚀 Feature Engineering with Scikit-Learn (Titanic Case Study)

    This dataset + notebooks demonstrate feature engineering and ML pipelines on the Titanic dataset.
    It includes both manual preprocessing (without pipelines) and end-to-end pipelines using Scikit-Learn.

    📌 About

    Feature Engineering is a crucial step in Machine Learning.
    In this project, I show: - Handling missing values with SimpleImputer - Encoding categorical variables with OneHotEncoder - Building models manually vs using Pipeline - Saving models and pipelines with pickle - Making predictions with and without pipelines

    📂 Content

    • train.csv → Titanic dataset
    • withpipeline.ipynb → End-to-end pipeline workflow
    • withoutpipeline.ipynb → Manual preprocessing workflow
    • predictusingpipeline.ipynb → Predictions with saved pipeline (pipe.pkl)
    • predictwithoutpipeline.ipynb → Predictions with classifier + encoders
    • models/
      • pipe.pkl → Complete ML pipeline (recommended for predictions)
      • clf.pkl → Classifier without pipeline
      • ohe_sex.pkl, ohe_embarked.pkl → Encoders for categorical features

    ⚡ Usage

    1️⃣ Load and Use Pipeline

    import pickle
    
    pipe = pickle.load(open("/kaggle/input/featureengineering/models/pipe.pkl", "rb"))
    sample = [[22, 1, 0, 7.25, 'male', 'S']]
    print(pipe.predict(sample))
    Predict with pipeline
    import pickle
    
    clf = pickle.load(open("/kaggle/input/featureengineering/models/clf.pkl", "rb"))
    ohe_sex = pickle.load(open("/kaggle/input/featureengineering/models/ohe_sex.pkl", "rb"))
    ohe_embarked = pickle.load(open("/kaggle/input/featureengineering/models/ohe_embarked.pkl", "rb"))
    
    # Preprocess input manually using the encoders, then predict with clf
    🎯 Inspiration
    
    Learn difference between manual feature engineering and pipeline-based workflows
    
    Understand how to avoid data leakage using Pipeline
    
    Explore cross-validation with pipelines
    
    Practice model persistence and deployment strategies
    
    ✅ Best Practice: Use pipe.pkl (pipeline) for predictions — it automatically handles preprocessing + modeling in one step!
    
    
    ---
    
    👉 This version is **Kaggle-friendly** (short, structured, with code examples). 
    Would you like me to also create a **shorter LinkedIn-style announcement post** you can use to share once your Kaggle dataset is live?
    
  11. Daily Machine Learning Practice

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astrid Villalobos (2025). Daily Machine Learning Practice [Dataset]. https://www.kaggle.com/datasets/astridvillalobos/daily-machine-learning-practice
    Explore at:
    zip(1019861 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Astrid Villalobos
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Daily Machine Learning Practice – 1 Commit per Day

    Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/

    Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.

    Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.

    **Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack

    Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle

    Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.

  12. CroppedYaleFaces

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). CroppedYaleFaces [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/croppedyalefaces
    Explore at:
    zip(58366379 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    Omar Rehan
    Description

    Cropped Yale Face Dataset (Grayscale Images)

    The Cropped Yale Face Dataset is a widely used benchmark in computer vision and machine learning for face recognition tasks. It consists of grayscale images of human faces captured under varying lighting conditions and expressions. The dataset is well-suited for research in facial recognition, image preprocessing, and machine learning model evaluation.

    Dataset Overview

    FeatureDescription
    Number of subjects38 individuals
    Number of images2,414 images
    Image size192 × 168 pixels
    ColorGrayscale (single channel)
    VariationsLighting conditions, facial expressions, and slight head rotations
    Format.pgm images (can be converted to .png or .jpg)
    Common usageFace recognition, PCA/LDA experiments, image classification

    Example of Dataset Structure

    CroppedYale/
    ├── yaleB01/
    │  ├── yaleB01_P00A+000E+00.pgm
    │  ├── yaleB01_P00A+000E+05.pgm
    │  └── ...
    ├── yaleB02/
    │  └── ...
    └── ...
    
    • Each folder corresponds to a single subject.
    • File naming convention: yaleB<subject_id>_P<pose>A<ambient>E<expression>.pgm.

    Example Use Cases

    1. Face Recognition

    The dataset is perfect for evaluating facial recognition algorithms under controlled lighting and expression variations.

    from sklearn.decomposition import PCA
    from sklearn.svm import SVC
    import numpy as np
    
    # Load images and flatten
    X = images.reshape(len(images), -1)
    y = labels
    
    # Reduce dimensions using PCA
    pca = PCA(n_components=100)
    X_pca = pca.fit_transform(X)
    
    # Train classifier
    clf = SVC(kernel='linear')
    clf.fit(X_pca, y)
    

    2. Dimensionality Reduction

    Due to its moderate image size, the dataset is ideal for testing dimensionality reduction methods like PCA, LDA, or t-SNE.

    from sklearn.manifold import TSNE
    import matplotlib.pyplot as plt
    
    X_embedded = TSNE(n_components=2).fit_transform(X_pca)
    plt.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
    plt.show()
    

    3. Lighting & Expression Robustness

    Researchers can use this dataset to study the effect of lighting conditions and facial expressions on recognition accuracy.

    • yaleB01_P00A+000E+00.pgm → Normal expression
    • yaleB01_P00A+000E+05.pgm → Smiling expression
    • yaleB01_P00A+010E+00.pgm → Slightly rotated face

    Key Advantages

    • Controlled environment: Minimal background noise, making it easier to focus on the face features.
    • Diverse lighting conditions: Excellent for testing illumination-invariant algorithms.
    • Compact size: Easy to load and experiment with on most machines without high computational cost.
    • Grayscale: Simplifies preprocessing while still retaining critical facial features.
  13. Nike, Adidas and Converse Shoes Images

    • kaggle.com
    zip
    Updated Aug 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iron486 (2022). Nike, Adidas and Converse Shoes Images [Dataset]. https://www.kaggle.com/datasets/die9origephit/nike-adidas-and-converse-imaged/code
    Explore at:
    zip(16354002 bytes)Available download formats
    Dataset updated
    Aug 3, 2022
    Authors
    Iron486
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6372737%2F2d0c8c299f63bb8a5823683346ba1ba8%2FImage2.jpg?generation=1659570752665846&alt=media">

    The dataset contains 2 folders: one with the test data and the other one with train data. The test-train-split ratio is 0.14, with the test dataset containing 114 images and the train dataset containing 711. The images have a resolution of 240x240 pixels in RGB color model. Both the folders contain 3 classes:

    • Adidas
    • Converse
    • Nike ** ** ### Inspiration

    This dataset is ideal for performing multiclass classification with deep neural networks like CNNs or simpler machine learning classification models. You can use Tensorflow, his high-level API keras, Sklearn, PyTorch or other deep/machine learning libraries to building the model from scratch or, as an alternative, fetching pretrained models as well as fine-tuning them. It is also possible to modify the size of the images or preprocessing them using OpenCV , and check if the accuracy of the model improves.
    Remember to upvote if you found the dataset useful :).

    Collection methodology

    The dataset was obtained downloading images from Google images.

    The images with a .webp format were transformed into .jpg images. The obtained images were randomly shuffled and resized so that all the images had a resolution of 240x240 pixels. Then, they were split into train and test datasets and saved.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Erik (2021). FastQuantileLayerKeras [Dataset]. https://www.kaggle.com/snippsy/fastquantilelayerkeras
Organization logo

FastQuantileLayerKeras

Equivalent to sklearn.preprocessing.QuantileTransform as Keras layer.

Explore at:
zip(9909 bytes)Available download formats
Dataset updated
Jan 14, 2021
Authors
Erik
Description
Search
Clear search
Close search
Google apps
Main menu