33 datasets found

ML Performance on SkLearn Breast Cancer Data
kaggle.com
zip
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Masood ullah (2023). ML Performance on SkLearn Breast Cancer Data [Dataset]. https://www.kaggle.com/datasets/masoodullah/ml-performance-on-sklearn-breast-cancer-data
Explore at:
zip(2185 bytes)Available download formats
Dataset updated
Aug 24, 2023
Authors
Masood ullah
Description
Get ready for an exciting adventure into the world of machine-learning models on Kaggle! Our dataset is like a puzzle waiting to be solved. We've designed it carefully, and it's all about Breast Cancer data. Imagine exploring a treasure trove of numbers that reveal how different models perform. See the magic of advanced methods and colorful graphs that show accuracy, precision, recall, and F1-score. This dataset isn't just numbers – it's an opportunity to challenge yourself, find hidden patterns, and prove your data skills. We've made it just for you, so you can uncover the secrets of machine learning and shine on Kaggle!

The Column Description includes,

Model Name: The name of the machine learning model used for prediction.

Hyperparameters: The configuration settings used for the model, showcase the versatility of model tuning.

Accuracy: The proportion of correctly predicted instances out of the total instances, indicating overall model performance.

Precision: The ratio of correctly predicted positive instances to all instances predicted as positive, reflecting model's accuracy in positive predictions.

Recall: The ratio of correctly predicted positive instances to all actual positive instances, measuring model's ability to capture positive cases.

F1-Score: The harmonic mean of precision and recall, providing a balanced assessment of model performance.

Classification Report: A comprehensive summary of precision, recall, F1-score, and support for both classes (0 and 1), offering insights into class-specific performance.

FPR (False Positive Rate): The ratio of incorrectly predicted negative instances to all actual negative instances, revealing the cost of false positives.

TPR (True Positive Rate): Synonymous with recall, indicating the model's ability to identify positive instances.

ROC AUC: The Area Under the Receiver Operating Characteristic curve, illustrating the trade-off between true positive rate and false positive rate.

Precision-Recall Curve: A graphical representation of precision and recall values across different thresholds, aiding in model selection based on specific requirements.

PR AUC (Precision-Recall AUC): The Area Under the Precision-Recall curve, a valuable metric for imbalanced datasets.
h
Data from: imdb
huggingface.co
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-learn (2025). imdb [Dataset]. https://huggingface.co/datasets/scikit-learn/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2025
Dataset authored and provided by
scikit-learn
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Digits_csv
kaggle.com
zip
Updated Dec 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
parsiya maha (2023). Digits_csv [Dataset]. https://www.kaggle.com/datasets/parsiyamaha/digits-csv
Explore at:
zip(80190 bytes)Available download formats
Dataset updated
Dec 20, 2023
Authors
parsiya maha
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is related to sklearn library in python.

we have 1796 sample image.

classes of data = 0 1 2 3 ... 7 8 9

image size = 64 -> (8,8)

you can import this datasets from :

from sklearn.datasets import load_digits dataset = load_digits() x = datasets.data y = datasets.target
c
AUTO-SKLEARN github.com/automl/AUTO-SKLEARN Price Prediction Data
coinbase.com
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). AUTO-SKLEARN github.com/automl/AUTO-SKLEARN Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/base-auto-sklearn-githubcomautomlauto-sklearn-1135
Explore at:
Dataset updated
Dec 2, 2025
Variables measured
Growth Rate, Predicted Price
Measurement technique
User-defined projections based on compound growth. This is not a formal financial forecast.
Description
This dataset contains the predicted prices of the asset AUTO-SKLEARN github.com/automl/AUTO-SKLEARN over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Z
One Classifier Ignores a Feature
data.niaid.nih.gov
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maier, Karl (2022). One Classifier Ignores a Feature [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6502642
Explore at:
Dataset updated
Apr 29, 2022
Authors
Maier, Karl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.

The original data set was created and split using this Python code:

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm

clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)

X_explain = X_test y_explain = y_test
Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy
figshare.com
bin
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennart Purucker (2023). Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.23613627.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23613627.v1
Dataset updated
Jul 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lennart Purucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.

The files of this figshare item include data that was collected for the paper:

Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.

The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.

In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.

The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.

Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.

Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.

The link resolves to a directory containing the following:

example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.

The size after unzipping the entire file is:

metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB

We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.

The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
m
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
data.mendeley.com
Updated Nov 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TaeKeun Yoo (2020). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.1
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.1
Dataset updated
Nov 18, 2020
Authors
TaeKeun Yoo
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD1, Ik Hee Ryu, MD, MS2, Tae Keun Yoo, MD2, Jung Sub Kim MD2, In Sik Lee, MD, PhD2, Jin Kook Kim MD2, Wakako Ando CO3, Nobuyuki Shoji, MD, PhD3, Tomofusa, Yamauchi, MD, PhD4, Hitoshi Tabuchi, MD, PhD4. Author Affiliation: 1Visual Physiology, School of Allied Health Sciences, Kitasato University, Kanagawa, Japan, 2B&VIIT Eye Center, Seoul, Korea, 3Department of Ophthalmology, School of Medicine, Kitasato University, Kanagawa, Japan, 4Department of Ophthalmology, Tsukazaki Hospital, Hyogo, Japan.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data

For a simple validation test, we split data to 8:2

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
h
vgrgrgerger
huggingface.co
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fhffht (2024). vgrgrgerger [Dataset]. https://huggingface.co/datasets/long88889/vgrgrgerger
Explore at:
Dataset updated
Sep 2, 2024
Dataset authored and provided by
fhffht
Description
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

นำเข้าข้อมูล Iris Data Set

from sklearn.datasets import load_iris iris = load_iris() iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names) iris_df['species'] = iris.target iris_df['species'] = iris_df['species'].apply(lambda x: iris.target_names[x])

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

Diabetes_Dataset_1.1
kaggle.com
zip
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/code
Explore at:
zip(779755 bytes)Available download formats
Dataset updated
Nov 2, 2023
Authors
KIRANMAYI G 777
Description
import pandas as pd import numpy as np

PERFORMING EDA

data.head() data.info()

attributes_data = data.iloc[:, 1:] attributes_data

attributes_data.describe() attributes_data.corr()

import seaborn as sns import matplotlib.pyplot as plt

Calculate correlation matrix

correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

Create a heatmap

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

CHECKING IF DATASET IS LINEAR OR NON-LINEAR

Calculate correlations between target and predictor columns

correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

Create a bar chart

plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

Count the number of null values in each column

print(data.isnull().sum())

to check for missing values in all columns

print(data.isna().sum())

LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

column_names = list(data) column_names = column_names[1:] column_names

lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

num_features_selected = len(rfecv.rankin_)

Cross-validation scores

cv_scores = rfecv.ranking_

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

principal.components_

plt.figure(figsize=(10,10))

plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

print(principal.explained_variance_ratio_)

T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
h
1234567
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MUKTINATH KUMAR, 1234567 [Dataset]. https://huggingface.co/datasets/mcurry20/1234567
Explore at:
Authors
MUKTINATH KUMAR
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
import pandas as pd from sklearn.model_selection import KFold from sklearn.metrics import accuracy_score, f1_score import re import math from collections import defaultdict, Counter

Load and preprocess data

def load_data(file_path): data = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: label, text = line.strip().split('\t') label = label.lower() text = re.sub(r'[^\w\s]', '', text.lower()) # remove punctuation… See the full description on the dataset page: https://huggingface.co/datasets/mcurry20/1234567.
Prediction of Personality Traits using the Big 5 Framework
zenodo.org
csv, text/x-python
Updated Feb 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
Explore at:
text/x-python, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7596072
Dataset updated
Feb 2, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Neelima Brahmbhatt; Neelima Brahmbhatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

1. Acquire Personality Dataset

The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

2. Data preprocessing

After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

3. Feature Extraction

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

EXT1 I am the life of the party. EXT2 I don't talk a lot. EXT3 I feel comfortable around people. EXT4 I am quiet around strangers. EST1 I get stressed out easily. EST2 I get irritated easily. EST3 I worry about things. EST4 I change my mood a lot. AGR1 I have a soft heart. AGR2 I am interested in people. AGR3 I insult people. AGR4 I am not really interested in others. CSN1 I am always prepared. CSN2 I leave my belongings around. CSN3 I follow a schedule. CSN4 I make a mess of things. OPN1 I have a rich vocabulary. OPN2 I have difficulty understanding abstract ideas. OPN3 I do not have a good imagination. OPN4 I use difficult words.

4. Training the Model

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

5. Personality Prediction Output

After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Open Australian Legal Embeddings
kaggle.com
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Umar Butler
Area covered
Australia
Description
Open Australian Legal Embeddings ‍⚖️

The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

Usage 👩‍💻

The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

from datasets import load_dataset from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

Sample the first 100,000 embeddings.

sample = list(itertools.islice(oale, 100000))

Embed a query.

query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

Identify the most similar embedding to the query.

similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

Print the most similar text.

print(most_similar['text']) ```

To speed up the loading of the Embeddings, you may wish to install orjson.

Structure 🗂️

The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

Creation 🧪

All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
rsna_small_for_faster_experimentation
kaggle.com
zip
Updated Dec 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Hasan goni (2022). rsna_small_for_faster_experimentation [Dataset]. https://www.kaggle.com/datasets/hasangoni/rsna-small-for-faster-experimentation/code
Explore at:
zip(177242332 bytes)Available download formats
Dataset updated
Dec 27, 2022
Authors
Mohammed Hasan goni
Description
Original dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch
o
Super.Complex: A supervised machine learning pipeline for molecular complex...
explore.openaire.eu
Updated Jun 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghana V. Palukuri; Edward M. Marcotte (2021). Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks: Experiment data [Dataset]. http://doi.org/10.5281/zenodo.4814943
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4814943
Dataset updated
Jun 2, 2021
Authors
Meghana V. Palukuri; Edward M. Marcotte
Description
Details of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.'. For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0 Supporting code is available on github at: https://github.com/marcottelab/super.complex Details of files provided for each experiment are given below: Toy network experiment Input data: Toy network, available as a weighted edge list. Format: node1 node2 edge-weight All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Intermediate output results: Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively) Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively) Output results: Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename) Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community. Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges. hu.MAP experiment: Input data: hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Intermediate output results: Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively) Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively) Output results: Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename) Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where...
h
roots-tsne-data
huggingface.co
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Christopher Akiki
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
What follows is research code. It is by no means optimized for speed, efficiency, or readability.

Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
Airline-Delay-Prediction
kaggle.com
zip
Updated Apr 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Mostafa (2025). Airline-Delay-Prediction [Dataset]. https://www.kaggle.com/datasets/ahmed4mostafa/air-line
Explore at:
zip(22905 bytes)Available download formats
Dataset updated
Apr 5, 2025
Authors
Ahmed Mostafa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Airline Delay Prediction Dataset A Machine Learning-Ready Dataset for Flight Delay Analysis and Predictive Modeling

📌 Dataset Overview This dataset provides historical flight data curated to analyze and predict airline delays using machine learning. It includes key features such as flight schedules, weather conditions, and delay causes, making it ideal for:

🚀 ML model training (binary classification: delayed/not delayed).

📈 Trend analysis (e.g., weather impact, airline performance).

🎯 Academic research or industry applications.

📂 Data Specifications Format: CSV (ready for pandas/scikit-learn).

Size: [X] thousand records (covers [Year Range]).

Variables:

Flight details: Departure/arrival times, airline, aircraft type.

Delay causes: Weather, technical issues, security, etc.

Weather data: Temperature, visibility, wind speed.

Target variable: Delay status (e.g., Delayed: Yes/No or Delay_minutes).

🎯 Potential Use Cases 1.Predictive Modeling: from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier().fit(X_train, y_train) 2.Airline Performance Benchmarking.

3.Weather-Delay Correlation Analysis. 🔍 Why Use This Dataset? Clean & Preprocessed: Minimal missing values, outliers handled.

Feature-Rich: Combines flight + weather data for robust analysis.

Benchmark Ready: Compatible with Kaggle kernels for easy experimentation.
Llama 3.1 8B Correct Labels
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
Explore at:
zip(11853454078 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Jatin Mehra_666
Description
training Code ```Python

from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

Fill missing Misconception values with 'NA'

train.Misconception = train.Misconception.fillna('NA')

Create a combined target label (Category:Misconception)

train['target'] = train.Category + ":" + train.Misconception

Encode target labels to numerical format

le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

Merge 'is_correct' flag into the main training DataFrame

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

from datasets import Dataset

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

COLS = ['text', 'label']

Create clean DataFrame with the full training data

train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

Ensure labels are proper integers

train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

Reset index to ensure clean DataFrame structure

train_df_clean = train_df_clean.reset_index(drop=True)

Create dataset with the full training data

train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

Apply tokenization to the full dataset

train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

Add a new padding token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Resize the model's token embeddings to match the new tokenizer

model.resize_token_embeddings(len(tokenizer))

Set the pad token id in the model's config

model.config.pad_token_id = tokenizer.pad_token_id

2. Clear HF cache after loading

import os from huggingface_hub import scan_cache_dir

Then clear cache to free ~16GB

cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

--- Training Arguments ---

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

Ensure temp directories exist

os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

--- Training Arguments (FIXED) ---

training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
)

--- Custom Metric Computation (MAP@3) ---

def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

# Get top 3 predicted class indi...
Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...
zenodo.org
application/gzip
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi (2024). Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2 Data Science Libraries in Python [Dataset]. http://doi.org/10.5281/zenodo.11584961
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11584961
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Daniel Prates; Arthur Bonifácio; Ghizlane El Boussaidi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package

This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

Requirements

We recommend the following requirements to replicate our study:

Internet access

At least 100GB of space

Docker installed

Git installed

Package Structure

We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

data-analysis, an R-based Container we used to run our data analysis.

data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.

database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.

storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.

docker-compose.yml, the Docker file that configures all containers used in the package.

In the remainder of this document, we describe how to set up each container properly.

Using VSCode to Setup the Package

We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

You first need to set up the containers

$ cd /replication/package/folder $ docker-compose build $ docker-compose up # Wait docker creating and running all containers

Then, you can open them in Visual Studio Code:

Open VSCode in project root folder

Access the command palette and select "Dev Container: Reopen in Container"

Select either Data Collection or Data Analysis.

Start working

If you want/need a more customized organization, the remainder of this file describes it in detail.

Longest Road: Manual Package Setup

Database Setup

The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

Build an image:

$ cd ./database $ docker build --tag 'dabc-database' . $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB

Create and enter inside the container:

$ docker run -it --name dabc-database-1 dabc-database $ docker exec -it dabc-database-1 /bin/bash root# psql -U postgres -h localhost -d jupyter-notebooks jupyter-notebooks=# \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+------- public | Cell | table | root public | Code_cell | table | root public | Md_cell | table | root public | Notebook | table | root public | Notebook_features | table | root public | Notebook_metadata | table | root public | repository | table | root

If you got the tables list as above, your database is properly setup.

It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

Data Collection Setup

This container is responsible for collecting the data to answer our research questions. It has the following structure:

dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.

dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.

Makefile, commands to set up and run both dabcs.py and dabcs-clients.py

matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.

storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.

requirements.txt, Python dependencies adopted in this module.

Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

$ cd ./data-collection $ docker build --tag "data-collection" . $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection $ docker exec -it data-collection-1 /bin/bash $ ls Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py

If you see project files, it means the container is configured accordingly.

Data Analysis Setup

We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

dependencies.R, an R script containing the dependencies used in our data analysis.

data-analysis.Rmd, the R notebook we used to perform our data analysis

datasets, a docker volume pointing to the storage directory.

Execute the following commands to run this container:

$ cd ./data-analysis $ docker build --tag "data-analysis" . $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis $ docker exec -it data-analysis-1 /bin/bash $ ls data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile

If you see project files, it means the container is configured accordingly.

A note on storage shared folder

As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

$ make unzip # extract files $ ls clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv $ make zip # compress files $ ls csv-files.tar.gz Makefile
Aluminum alloy industrial materials defect
figshare.com
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27922929.v3
Dataset updated
Dec 3, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ying Han; Yugang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200，patience: 50，batch: 16，imgsz: 640，pretrained: true，optimizer: SGD，close_mosaic: 10，iou: 0.7，momentum: 0.937，weight_decay: 0.0005，box: 7.5，cls: 0.5，dfl: 1.5，pose: 12.0，kobj: 1.0，save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.

Facebook

Twitter

Click to copy link

Link copied

Cite

Masood ullah (2023). ML Performance on SkLearn Breast Cancer Data [Dataset]. https://www.kaggle.com/datasets/masoodullah/ml-performance-on-sklearn-breast-cancer-data

ML Performance on SkLearn Breast Cancer Data

Unveiling Model Mastery: ML Performance on Breast Cancer Data

Explore at:

zip(2185 bytes)Available download formats

Dataset updated

Aug 24, 2023

Authors

Masood ullah

Description

Get ready for an exciting adventure into the world of machine-learning models on Kaggle! Our dataset is like a puzzle waiting to be solved. We've designed it carefully, and it's all about Breast Cancer data. Imagine exploring a treasure trove of numbers that reveal how different models perform. See the magic of advanced methods and colorful graphs that show accuracy, precision, recall, and F1-score. This dataset isn't just numbers – it's an opportunity to challenge yourself, find hidden patterns, and prove your data skills. We've made it just for you, so you can uncover the secrets of machine learning and shine on Kaggle!

The Column Description includes,

Model Name: The name of the machine learning model used for prediction.
Hyperparameters: The configuration settings used for the model, showcase the versatility of model tuning.
Accuracy: The proportion of correctly predicted instances out of the total instances, indicating overall model performance.
Precision: The ratio of correctly predicted positive instances to all instances predicted as positive, reflecting model's accuracy in positive predictions.
Recall: The ratio of correctly predicted positive instances to all actual positive instances, measuring model's ability to capture positive cases.
F1-Score: The harmonic mean of precision and recall, providing a balanced assessment of model performance.
Classification Report: A comprehensive summary of precision, recall, F1-score, and support for both classes (0 and 1), offering insights into class-specific performance.
FPR (False Positive Rate): The ratio of incorrectly predicted negative instances to all actual negative instances, revealing the cost of false positives.
TPR (True Positive Rate): Synonymous with recall, indicating the model's ability to identify positive instances.
ROC AUC: The Area Under the Receiver Operating Characteristic curve, illustrating the trade-off between true positive rate and false positive rate.
Precision-Recall Curve: A graphical representation of precision and recall values across different thresholds, aiding in model selection based on specific requirements.
PR AUC (Precision-Recall AUC): The Area Under the Precision-Recall curve, a valuable metric for imbalanced datasets.

Clear search

Close search

Google apps

Main menu

ML Performance on SkLearn Breast Cancer Data

Data from: imdb

Digits_csv

AUTO-SKLEARN github.com/automl/AUTO-SKLEARN Price Prediction Data

One Classifier Ignores a Feature

Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data

For a simple validation test, we split data to 8:2

Optimal parameter search could be performed in this section

vgrgrgerger

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

Diabetes_Dataset_1.1

Calculate correlation matrix

Create a heatmap

Calculate correlations between target and predictor columns

Create a bar chart

Count the number of null values in each column

to check for missing values in all columns

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

Cross-validation scores

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10,10))

1234567

Prediction of Personality Traits using the Big 5 Framework

Open Australian Legal Embeddings

Open Australian Legal Embeddings ‍⚖️

Usage 👩‍💻

Sample the first 100,000 embeddings.

Embed a query.

Identify the most similar embedding to the query.

Print the most similar text.

Structure 🗂️

Creation 🧪

rsna_small_for_faster_experimentation

Super.Complex: A supervised machine learning pipeline for molecular complex...

roots-tsne-data

Airline-Delay-Prediction

Llama 3.1 8B Correct Labels

Fill missing Misconception values with 'NA'

Create a combined target label (Category:Misconception)

Encode target labels to numerical format

Merge 'is_correct' flag into the main training DataFrame

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

Create clean DataFrame with the full training data

Ensure labels are proper integers

Reset index to ensure clean DataFrame structure

Create dataset with the full training data

Apply tokenization to the full dataset

Add a new padding token

Resize the model's token embeddings to match the new tokenizer

Set the pad token id in the model's config

2. Clear HF cache after loading

Then clear cache to free ~16GB

--- Training Arguments ---

Ensure temp directories exist

--- Training Arguments (FIXED) ---

--- Custom Metric Computation (MAP@3) ---

Replication Package: Unboxing Default Argument Breaking Changes in 1 + 2...