Facebook
TwitterGet ready for an exciting adventure into the world of machine-learning models on Kaggle! Our dataset is like a puzzle waiting to be solved. We've designed it carefully, and it's all about Breast Cancer data. Imagine exploring a treasure trove of numbers that reveal how different models perform. See the magic of advanced methods and colorful graphs that show accuracy, precision, recall, and F1-score. This dataset isn't just numbers – it's an opportunity to challenge yourself, find hidden patterns, and prove your data skills. We've made it just for you, so you can uncover the secrets of machine learning and shine on Kaggle!
The Column Description includes,
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is related to sklearn library in python.
we have 1796 sample image.
classes of data = 0 1 2 3 ... 7 8 9
image size = 64 -> (8,8)
you can import this datasets from :
from sklearn.datasets import load_digits dataset = load_digits() x = datasets.data y = datasets.target
Facebook
TwitterThis dataset contains the predicted prices of the asset AUTO-SKLEARN github.com/automl/AUTO-SKLEARN over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.
The files of this figshare item include data that was collected for the paper:
Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.
The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.
In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.
The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.
Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.
Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.
The link resolves to a directory containing the following:
example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC.
metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.
The size after unzipping the entire file is:
metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB
We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.
The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD1, Ik Hee Ryu, MD, MS2, Tae Keun Yoo, MD2, Jung Sub Kim MD2, In Sik Lee, MD, PhD2, Jin Kook Kim MD2, Wakako Ando CO3, Nobuyuki Shoji, MD, PhD3, Tomofusa, Yamauchi, MD, PhD4, Hitoshi Tabuchi, MD, PhD4. Author Affiliation: 1Visual Physiology, School of Allied Health Sciences, Kitasato University, Kanagawa, Japan, 2B&VIIT Eye Center, Seoul, Korea, 3Department of Ophthalmology, School of Medicine, Kitasato University, Kanagawa, Japan, 4Department of Ophthalmology, Tsukazaki Hospital, Hyogo, Japan.
We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.
This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).
This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.
Python version:
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')
dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()
y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Facebook
Twitterimport pandas as pd import seaborn as sns import matplotlib.pyplot as plt
นำเข้าข้อมูล Iris Data Set
from sklearn.datasets import load_iris iris = load_iris() iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names) iris_df['species'] = iris.target iris_df['species'] = iris_df['species'].apply(lambda x: iris.target_names[x])
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
Twitterimport pandas as pd import numpy as np
PERFORMING EDA
data.head() data.info()
attributes_data = data.iloc[:, 1:] attributes_data
attributes_data.describe() attributes_data.corr()
import seaborn as sns import matplotlib.pyplot as plt
correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()
CHECKING IF DATASET IS LINEAR OR NON-LINEAR
correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')
plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()
CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM
print(data.isnull().sum())
print(data.isna().sum())
LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold
X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42)
lassoReg = Lasso()
lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold)
lasso_cv.fit(X, y)
print("Best Params {}".format(lasso_cv.best_params_))
column_names = list(data) column_names = column_names[1:] column_names
lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()
RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)
num_features_selected = len(rfecv.rankin_)
cv_scores = rfecv.ranking_
plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()
print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])
PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]
df1=pd.DataFrame(data = data,columns=data.columns) print(df1)
scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)
principal.components_
plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')
print(principal.explained_variance_ratio_)
T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns
tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)
df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
import pandas as pd from sklearn.model_selection import KFold from sklearn.metrics import accuracy_score, f1_score import re import math from collections import defaultdict, Counter
Load and preprocess data
def load_data(file_path): data = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: label, text = line.strip().split('\t') label = label.lower() text = re.sub(r'[^\w\s]', '', text.lower()) # remove punctuation… See the full description on the dataset page: https://huggingface.co/datasets/mcurry20/1234567.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterThe Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.
Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.
The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.
To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.
The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise
from datasets import load_dataset from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '
oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).
sample = list(itertools.islice(oale, 100000))
query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)
similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]
print(most_similar['text']) ```
To speed up the loading of the Embeddings, you may wish to install orjson.
The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.
The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).
All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format:
perl
Title: {title}
Jurisdiction: {jurisdiction}
Type: {type}
{text}
The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.
The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.
The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
Facebook
TwitterOriginal dataset can be found in this competion. kindly I have found png ROI images here. Then I just created a subset of those dataset, only 10 % of the data to get faster iteration per epoch
Facebook
TwitterDetails of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.'. For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0 Supporting code is available on github at: https://github.com/marcottelab/super.complex Details of files provided for each experiment are given below: Toy network experiment Input data: Toy network, available as a weighted edge list. Format: node1 node2 edge-weight All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Intermediate output results: Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community. Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively) Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively) Output results: Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename) Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community. Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges. hu.MAP experiment: Input data: hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Intermediate output results: Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex. Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively) Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively) Output results: Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename) Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
What follows is research code. It is by no means optimized for speed, efficiency, or readability.
Data loading, tokenizing and sharding
import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc
fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Airline Delay Prediction Dataset A Machine Learning-Ready Dataset for Flight Delay Analysis and Predictive Modeling
📌 Dataset Overview This dataset provides historical flight data curated to analyze and predict airline delays using machine learning. It includes key features such as flight schedules, weather conditions, and delay causes, making it ideal for:
🚀 ML model training (binary classification: delayed/not delayed).
📈 Trend analysis (e.g., weather impact, airline performance).
🎯 Academic research or industry applications.
📂 Data Specifications Format: CSV (ready for pandas/scikit-learn).
Size: [X] thousand records (covers [Year Range]).
Variables:
Flight details: Departure/arrival times, airline, aircraft type.
Delay causes: Weather, technical issues, security, etc.
Weather data: Temperature, visibility, wind speed.
Target variable: Delay status (e.g., Delayed: Yes/No or Delay_minutes).
🎯 Potential Use Cases 1.Predictive Modeling: from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier().fit(X_train, y_train) 2.Airline Performance Benchmarking.
3.Weather-Delay Correlation Analysis. 🔍 Why Use This Dataset? Clean & Preprocessed: Minimal missing values, outliers handled.
Feature-Rich: Combines flight + weather data for robust analysis.
Benchmark Ready: Compatible with Kaggle kernels for easy experimentation.
Facebook
Twittertraining Code ```Python
from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')
train.Misconception = train.Misconception.fillna('NA')
train['target'] = train.Category + ":" + train.Misconception
le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()
idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers
train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)
tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)
def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )
train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )
from datasets import Dataset
COLS = ['text', 'label']
train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'
train_df_clean['label'] = train_df_clean['label'].astype(np.int64)
train_df_clean = train_df_clean.reset_index(drop=True)
train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)
def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)
train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
import os from huggingface_hub import scan_cache_dir
cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil
os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)
training_args = TrainingArguments(
output_dir=f"{TEMP_DIR}/training_output/",
do_train=True,
do_eval=False,
save_strategy="no",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5,
logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500,
bf16=True,
fp16=False,
report_to="none",
warmup_ratio=0.1,
lr_scheduler_type="cosine",
dataloader_pin_memory=False,
gradient_checkpointing=True,
)
def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()
# Get top 3 predicted class indi...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis, an R-based Container we used to run our data analysis.data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.docker-compose.yml, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile, commands to set up and run both dabcs.py and dabcs-clients.pymatroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R, an R script containing the dependencies used in our data analysis.data-analysis.Rmd, the R notebook we used to perform our data analysisdatasets, a docker volume pointing to the storage directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage shared folder
As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200,patience: 50,batch: 16,imgsz: 640,pretrained: true,optimizer: SGD,close_mosaic: 10,iou: 0.7,momentum: 0.937,weight_decay: 0.0005,box: 7.5,cls: 0.5,dfl: 1.5,pose: 12.0,kobj: 1.0,save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.
Facebook
TwitterGet ready for an exciting adventure into the world of machine-learning models on Kaggle! Our dataset is like a puzzle waiting to be solved. We've designed it carefully, and it's all about Breast Cancer data. Imagine exploring a treasure trove of numbers that reveal how different models perform. See the magic of advanced methods and colorful graphs that show accuracy, precision, recall, and F1-score. This dataset isn't just numbers – it's an opportunity to challenge yourself, find hidden patterns, and prove your data skills. We've made it just for you, so you can uncover the secrets of machine learning and shine on Kaggle!
The Column Description includes,