9 datasets found

House Prices: Advanced Regression 'solution' file
kaggle.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

Content

This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

Acknowledgements

Ames Housing Dataset (on kaggle) by @prevek18

First place is meaningless in this way! by @diegojohnson
A
‘Game of Thrones - Classification (Decision) Tree’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Game of Thrones - Classification (Decision) Tree’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-game-of-thrones-classification-decision-tree-131c/99b09ebd/?iid=062-105&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Game of Thrones - Classification (Decision) Tree’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/dalmacyali1905/game-of-thrones-classification-decision-tree on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Dataset is split into two parts as the train (80%) and test (20%).

By using the train data set, the model was constructed; *Decision Tree (also apply pruning) *Bagging *Random Forest.

For the last two models, cross-validation for hyperparameter tuning has been applied and a final model has been obtained.

After deciding on the final model, performance measures have been compared and one model has been suggested as a final model.

--- Original source retains full ownership of the source dataset ---
f
Comparison of XGBoost hyperparameter tuning optimization results.
plos.figshare.com
xls
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenguang Li; Yan Peng; Ke Peng (2024). Comparison of XGBoost hyperparameter tuning optimization results. [Dataset]. http://doi.org/10.1371/journal.pone.0311222.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311222.t005
Dataset updated
Sep 30, 2024
Dataset provided by
PLOS ONE
Authors
Wenguang Li; Yan Peng; Ke Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of XGBoost hyperparameter tuning optimization results.
XGB Model | Multi-Class Obesity Detection
kaggle.com
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNets (2024). XGB Model | Multi-Class Obesity Detection [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/xgb-model-multi-class-obesity-detection/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DeepNets
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description: Best XGBoost Model for Obesity Prediction

The dataset comprises a trained XGBoost model, "best_xgb_model.pkl," specifically designed for predicting obesity levels based on individual characteristics and behaviors. This model has undergone thorough optimization and hyperparameter tuning to achieve the highest accuracy in predicting obesity across different levels.

Features: - The dataset does not include raw features but encapsulates the learned patterns and relationships within the XGBoost model.

Target Variable: - Obesity Level: The model is trained to predict different levels of obesity based on a set of input features.

Model Characteristics: - Algorithm: Extreme Gradient Boosting (XGBoost) was selected as the algorithm of choice for its superior performance in predicting obesity levels. - Hyperparameter Tuning: The model's hyperparameters have been carefully tuned to achieve optimal performance in terms of accuracy, precision, recall, and F1 score.

Use Case: - The dataset is intended for deployment in applications where real-time predictions or batch predictions of obesity levels are required. - Health professionals, researchers, and organizations focused on obesity management can benefit from integrating this model into their systems.

File Information: - File Format: best_xgb_model.pkl - Model Loading: The model can be loaded using Python's joblib or pickle libraries.

Note: This dataset serves as a valuable tool for anyone seeking a pre-trained, optimized model for obesity prediction. Users can seamlessly integrate the model into their applications, making informed decisions related to health and lifestyle based on predicted obesity levels.
Data from: Red Wine Quality
kaggle.com
zip
Updated Nov 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Explore at:
zip(26176 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
UCI Machine Learning
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

Content

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Tips

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

KNIME is a great tool (GUI) that can be used for this.
1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
- $quality$ > 6.5 => "good"
- TRUE => "bad"
3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
6- Partitioning Node test data split output to input Decision Tree predictor Node
7- Decision Tree learner Node output to input Decision Tree Node input
8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

Inspiration

Use machine learning to determine which physiochemical properties make a wine 'good'!

Acknowledgements

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
4x Satellite Image Super-Resolution
kaggle.com
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristobal Tudela (2025). 4x Satellite Image Super-Resolution [Dataset]. https://www.kaggle.com/datasets/cristobaltudela/4x-satellite-image-super-resolution/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cristobal Tudela
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description

This dataset consists of paired high-resolution (HR) and low-resolution (LR) satellite images designed for 4x super-resolution tasks. The images are organized into two directories:

HR_0.5m: Contains GeoTIFF files with a spatial resolution of 0.5 meters per pixel (ground truth for super-resolution).

LR_2m: Contains corresponding low-resolution GeoTIFF files with a resolution of 2 meters per pixel (input data for upscaling).

All images are geographically aligned and cover the same regions, ensuring pixel-to-pixel correspondence between LR and HR pairs.

Recommended Dataset Split

To ensure robust model training and evaluation, we propose the following 75-15-10 split: - Training Set (75%) Used to train the super-resolution model - Validation Set (15%) Used for hyperparameter tuning - Test Set (10%) Reserved for final evaluation (unseen data to measure model generalization)

Split Methodology: - Stratified Sampling: If images represent diverse terrains (urban, rural, water), ensure each subset reflects this distribution. - Non-overlapping Regions: Prevent data leakage by splitting across geographically distinct areas (e.g., tiles from different zones).
MetaMath QA
kaggle.com
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.

Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.

Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Data from: Red wine DataSet
kaggle.com
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraj_kumar_Gupta (2023). Red wine DataSet [Dataset]. https://www.kaggle.com/datasets/soorajgupta7/red-wine-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Suraj_kumar_Gupta
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Datasets Description:

The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol

The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)

Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
Mammographic mass data set for breast cancer
kaggle.com
Updated Sep 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jimit Dand (2021). Mammographic mass data set for breast cancer [Dataset]. https://www.kaggle.com/jimitdand/mammographic-mass-data-set-for-breast-cancer/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jimit Dand
Description
Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. To reduce the high number of unnecessary breast biopsies, several computer-aided diagnoses (CAD) systems have been proposed in the last years. These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen in a mammogram or to perform a short-term follow-up examination instead.

This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient's age. It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes together with the ground truth (the severity field).

Attribute Information: 1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!) 2. Age: patient's age in years (integer) 3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal) 4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal) 5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal) 6. Severity: benign=0 or malignant=1 (binominal, goal field!)

Evaluation Task: Download the dataset from attached file and perform the following tasks: 1. Build Statistical Classification model to detect severity 2. What considerations have been used for model selection? 3. What features would you want to create for your prediction model based on data provided? 4. How have you performed hyper-parameter tuning and model optimization? What are the reasons for your decision choices for these steps? 5. What is your model evaluation criteria? What are the assumptions and limitations of your approach? 6. Determine whether the data is normally distributed visually and statistically. 7. Comment on EDA of variables in data. 8. How are you detecting and treating outliers in the dataset for better convergence? 9. What techniques have been used for treating missing values to prepare features for model building? 10. What is the distribution of target with respect to categorical columns? 11. Comment on any other observations or recommendations based on your analysis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity

House Prices: Advanced Regression 'solution' file

(for use offline, without the risk of ruining the public leaderboard)

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 11, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Carl McBride Ellis

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

Content

This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

Acknowledgements

Ames Housing Dataset (on kaggle) by @prevek18
First place is meaningless in this way! by @diegojohnson

Clear search

Close search

Google apps

Main menu

House Prices: Advanced Regression 'solution' file

Context

Content

Acknowledgements

‘Game of Thrones - Classification (Decision) Tree’ analyzed by Analyst-2

Comparison of XGBoost hyperparameter tuning optimization results.

XGB Model | Multi-Class Obesity Detection

Data from: Red Wine Quality

Context

Content

Tips

Inspiration

Acknowledgements

Relevant publication

4x Satellite Image Super-Resolution

MetaMath QA

MetaMath QA

Mathematical Questions for Large Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Data Dictionary

Preparing data for analysis

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: Red wine DataSet

Mammographic mass data set for breast cancer

House Prices: Advanced Regression 'solution' file

(for use offline, without the risk of ruining the public leaderboard)

Context

Content

Acknowledgements