Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Facebook
TwitterThis child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview
This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.
Key Points
- Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
- Multiple Feature Files: Combine them by matching on id or Group.
- Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
- Goal: Predict which transactions in test_share.csv might be fraudulent.
train.csv
Target column (0 = Clean, 1 = Fraud). test_share.csv
train.csv but without the Target column.Geo_scores.csv
Lambda_wts.csv
Group.Qset_tats.csv
instance_scores.csv
Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group. train.csv (Target ~1% is fraud). train.csv. test_share.csv or your own external data. Possible Tools:
- Python: pandas, NumPy, scikit-learn
- Imbalance Handling: SMOTE, Random Oversampler, or class weights
- Metrics: Precision, Recall, F1-score, ROC-AUC, etc.
Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!
fraud-detection classification imbalanced-data financial-transactions machine-learning python beginner-friendlyLicense: CC BY-NC-SA 4.0
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Models from experiments referenced in the paper "Training CNNs with Low-Rank Filters for Efficient Image Classification", https://arxiv.org/abs/1511.06744
Model names differ from those in the paper, but the csv files for each set of experiments relates the paper's name for the model and the real name of the model here:
Facebook
TwitterThis data package includes data and scripts from the manuscript “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification”. The study addressed common challenges faced in environmental sensing and modeling, including uncertain input data, missing sensor observations, and high-dimensional datasets with interrelated but redundant variables. Point-scaled meteorological and soil sensor observations were perturbed with noises and missing values, and denoising autoencoder (DAE) neural networks were developed to reconstruct the perturbed data and further predict evapotranspiration. This study concluded that (1) the reconstruction quality of each variable depends on its cross-correlation and alignment to the underlying data structure, (2) uncertainties from the models were overall stronger than those from the data corruption, and (3) there was a tradeoff between reducing bias and reducing variance when evaluating the uncertainty of the machine learning models. This package includes: (1) Four ipython scripts (.ipynb): “DAE_train.ipynb” trains and evaluates DAE neural networks, “DAE_predict.ipynb” makes predictions from the trained DAE models, “ET_train.ipynb” trains and evaluates ET prediction neural networks, and “ET_predict.ipynb” makes predictions from trained ET models. (2) One python file (.py): “methods.py” includes all user-defined functions and python codes used in the ipython scripts. (3) A “sub_models” folder that includes five trained DAE neural networks (in pytorch format, .pt), which could be used to ingest input data before being fed to the downstream ET models in ‘ET_train.ipynb” or ‘ET_predict.ipynb’. (4) Two data files (.csv). Daily meteorological, vegetation, and soil data is in “df_data.csv”, where “df_meta.csv” contains the location and time information of “df_data.csv”. Each row (index) in “df_meta.csv” corresponds to each row in “df_data.csv”. These data files are formatted to follow the data structure requirements and be directly used in the ipython scripts, and they have been shuffled chronologically to train machine learning models. The meteorological and soil data was collected using point sensors between 2019-2023 at (4.a) Three shrub-dominated field sites in East River, Colorado (named “ph1”, “ph2” and “sg5” in “df_meta.csv”, where “ph1” and “ph2” were located at PumpHouse Hillslopes, and “sg5” was at Snodgrass Mountain meadow) and (4.b) One outdoor, mesoscale, and herbaceous-dominated experiment in Berkeley, California (named “tb” in “df_meta.csv”, short for Smartsoils Testbed at Lawrence Berkeley National Lab). - See "df_data_dd.csv" and "df_meta_dd.csv" for variable descriptions and the Methods section for additional data processing steps. See "flmd.csv" and "README.txt" for brief file descriptions. - All ipython scripts and python files are written in and require PYTHON language software.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.
This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.
The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.
Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.
In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).
The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
During the time when Machine Learning and Deep Learning are booming so much , it is very important to understand that all this knowledge is not of any use if we cant apply it to different areas and impact the humanity.
This dataset will help you apply your existing knowledge to great use. Applying Knowledge to field of Medical Science and making the task of Physician easy is the main purpose of this dataset. This dataset has 132 parameters on which 42 different types of diseases can be predicted.
All the best !
Complete Dataset consists of 2 CSV files . One of them is training and other is for testing your model.
Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis.
These symptoms are mapped to 42 diseases you can classify these set of symptoms to.
You are required to train your model on training data and test it on testing data
Just make your best effort to make world a better place by applying all the knowledge you have to different fields.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset will help you apply your existing knowledge to great use. This dataset has 132 parameters on which 42 different types of diseases can be predicted. This dataset consists of 2 CSV files. One of them is for training and the other is for testing your model. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and the last column is the prognosis. These symptoms are mapped to 42 diseases you can classify these sets of symptoms. You are required to train your model on training data and test it on testing data.
Machine Learning
medicine,disease,Healthcare,ML,Machine Learning
4962
$109.00
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets and experiment results presented in our arxiv paper:
B. Hoffman, M. Cusimano, V. Baglione, D. Canestrari, D. Chevallier, D. DeSantis, L. Jeantet, M. Ladds, T. Maekawa, V. Mata-Silva, V. Moreno-González, A. Pagano, E. Trapote, O. Vainio, A. Vehkaoja, K. Yoda, K. Zacarian, A. Friedlaender, "A benchmark for computational analysis of animal behavior, using animal-borne tags," 2023.
Standardized code to implement, train, and evaluate models can be found at https://github.com/earthspecies/BEBE/.
Please note the licenses in each dataset folder.
Zip folders beginning with "formatted": These are the datasets we used to run the experiments reported in the benchmark paper.
Zip folders beginning with "raw": These are the unprocessed datasets used in BEBE. Code to process these raw datasets into the formatted ones used by BEBE can be found at https://github.com/earthspecies/BEBE-datasets/.
Zip folders beginning with "experiments": Results of the cross-validation experiments reported in the paper, as well as hyperparameter optimization. Confusion matrices for all experiments can also be found here. Note that dt, rf, and svm refer to the feature set from Nathan et al., 2012.
Results used in Fig. 4 of arxiv paper (deep neural networks vs. classical models){dataset}_ harnet_nogyr{dataset}_CRNN{dataset}_CNN{dataset}_dt{dataset}_rf{dataset}_svm{dataset}_wavelet_dt{dataset}_wavelet_rf{dataset}_wavelet_svm
Results used in Fig. 5D of arxiv paper (full data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_nogyr{dataset}_harnet_random_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_RNN_nogyr{dataset}_CRNN_nogyr{dataset}_rf_nogyrOtherwise:{dataset}_harnet_nogyr{dataset}_harnet_unfrozen_nogyr{dataset}_harnet_random_nogyr{dataset}_RNN_nogyr{dataset}_CRNN{dataset}_rf
Results used in Fig. 5E of arxiv paper (reduced data setting)If dataset contains gyroscope (HAR, jeantet_turtles, vehkaoja_dogs):{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data_nogyr{dataset}_rf_low_data_nogyr
Otherwise:{dataset}_harnet_low_data_nogyr{dataset}_harnet_random_low_data_nogyr{dataset}_harnet_unfrozen_low_data_nogyr{dataset}_RNN_low_data_nogyr{dataset}_wavelet_RNN_low_data_nogyr{dataset}_CRNN_low_data{dataset}_rf_low_data
CSV files: we also include summaries of the experimental results in experiments_summary.csv, experiments_by_fold_individual.csv, experiments_by_fold_behavior.csv.
experiments_summary.csv - results averaged over individuals and behavior classesdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperf1_mean (float): mean of macro-averaged F1 score, averaged over individuals in test foldsf1_std (float): standard deviation of macro-averaged F1 score, computed over individuals in test foldsprec_mean, prec_std (float): analogous for precisionrec_mean, rec_std (float): analogous for recallexperiments_by_fold_individual.csv - results per individual in the test foldsdataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexindividual (int): individuals are numbered zero-indexed, starting from fold 1f1 (float): macro-averaged f1 score for this individualprecision (float): macro-averaged precision for this individualrecall (float): macro-averaged recall for this individual
experiments_by_fold_behavior.csv - results per behavior class, for each test folddataset (str): name of datasetexperiment (str): name of model with experiment setting fig4 (bool): True if dataset+experiment was used in figure 4 of arxiv paperfig5d (bool): True if dataset+experiment was used in figure 5d of arxiv paperfig5e (bool): True if dataset+experiment was used in figure 5e of arxiv paperfold (int): test fold indexbehavior_class (str): name of behavior classf1 (float): f1 score for this behavior, averaged over individuals in the test foldprecision (float): precision for this behavior, averaged over individuals in the test foldrecall (float): recall for this behavior, averaged over individuals in the test foldtrain_ground_truth_label_counts (int): number of timepoints labeled with this behavior class, in the training set
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:
1. Acquire Personality Dataset
The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.
2. Data preprocessing
After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.
3. Feature Extraction
The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree
EXT1 I am the life of the party.
EXT2 I don't talk a lot.
EXT3 I feel comfortable around people.
EXT4 I am quiet around strangers.
EST1 I get stressed out easily.
EST2 I get irritated easily.
EST3 I worry about things.
EST4 I change my mood a lot.
AGR1 I have a soft heart.
AGR2 I am interested in people.
AGR3 I insult people.
AGR4 I am not really interested in others.
CSN1 I am always prepared.
CSN2 I leave my belongings around.
CSN3 I follow a schedule.
CSN4 I make a mess of things.
OPN1 I have a rich vocabulary.
OPN2 I have difficulty understanding abstract ideas.
OPN3 I do not have a good imagination.
OPN4 I use difficult words.
4. Training the Model
Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package
5. Personality Prediction Output
After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C-H activation process in alkanes. To investigate the structure-property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy ΔE of an alkane C-H activation, we have applied a Δ-Machine Learning method trained on various features to study 1,743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (R2 = 0.816), and SOAPs (R2 = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant-intermediate pairs with an average ΔE of 33.2 kJ mol-1, remarkably lower than the average ΔE of the original data set of 68.0 kJ mol-1. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.
The dataset contains three file types:
Version 1.0:
xyz files of the final optimized Rh-phosphine complexes; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"
Gaussian16 log files for the optimization process; one set for the starting materials denoted as "molecule-XXXX_4-times" and one set for the intermediates after C-H activation denoted as "molecule-XXXX_6-times"
csv files containing the per molecule features used for training the different machine learning models. The name of the csv files indicates which property was predicted and which model was used
New in version 1.1 (other data is unchanged):
Gaussian16 log files for the ten newly identified bisphosphine ligands; one set for the product material denoted as "LXX_6-times-axial" and one set for the transition state for the C-H activation denoted as "LXX_C-H-activation_TS"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AimsThe population pharmacokinetic (PPK) model-based machine learning (ML) approach offers a novel perspective on individual concentration prediction. This study aimed to establish a PPK-based ML model for predicting tacrolimus (TAC) concentrations in Chinese renal transplant recipients.MethodsConventional TAC monitoring data from 127 Chinese renal transplant patients were divided into training (80%) and testing (20%) datasets. A PPK model was developed using the training group data. ML models were then established based on individual pharmacokinetic data derived from the PPK basic model. The prediction performances of the PPK-based ML model and Bayesian forecasting approach were compared using data from the test group.ResultsThe final PPK model, incorporating hematocrit and CYP3A5 genotypes as covariates, was successfully established. Individual predictions of TAC using the PPK basic model, postoperative date, CYP3A5 genotype, and hematocrit showed improved rankings in ML model construction. XGBoost, based on the TAC PPK, exhibited the best prediction performance.ConclusionThe PPK-based machine learning approach emerges as a superior option for predicting TAC concentrations in Chinese renal transplant recipients.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
DATASET
This dataset is part of the research work titled "A dataset to train intrusion detection systems based on machine learning models for electrical substations". The dataset has been meticulously curated to support the development and evaluation of machine learning models tailored for detecting cyber intrusions in the context of electrical substations. It is intended to facilitate research and advancements in cybersecurity for critical infrastructure, specifically focusing on real-world scenarios within electrical substation environments. We encourage its use for experimentation and benchmarking in related areas of study.
The following sections list the content of the dataset generated.
Data
raw
iec6180
attack-free-data
capture61850-attackfree.pcap (from real substation)
capture61850-attackfree_PTP.pcap
capture61850-attackfree_normalfault.pcap
attack-data
capture61850-floodattack_withfault.pcap
capture61850-floodattack_withoutfault.pcap
capture61850-fuzzyattack_withfault.pcap
capture61850-fuzzyattack_withoutfault.pcap
capture61850-replay.pcap
capture61850-ptpattack.pcap
iec104
attack-free-data
capture104-attackfree.pcap (from real substation)
attack-data
capture104-dosattack.pcap
capture104-floodattack.pcap
capture104-fuzzyattack.pcap
capture104-iec104starvationattack.pcap
capture104-mitmattack.pcap
capture104-ntpddosattack.pcap
capture104-portscanattack.pcap
processed
iec6180
attack-free-data
capture61850-attackfree.csv
capture61850-attackfree_PTP.csv
capture61850-attackfree_normalfault.csv
attack-data
capture61850-floodattack_withfault.csv
capture61850-floodattack_withoutfault.csv
capture61850-fuzzyattack_withfault.csv
capture61850-fuzzyattack_withoutfault.csv
capture61850-replay.csv
capture61850-ptpattack.csv
headers_iec61850[all].txt
iec104
attack-free-data
capture104-attackfree.csv
attack-data
capture104-dosattack.csv
capture104-floodattack.csv
capture104-fuzzyattack.csv
capture104-iec104starvationattack.csv
capture104-mitmattack.csv
capture104-ntpddosattack.csv
capture104-portscanattack.csv
headers_iec104[all].txt
Description
file type: it may be captured61850 or captured104 depending on whether it contains network captures of the protocol IEC61850 or IEC104.
attack: attack free (attackfree) or attack name is added to the file name.
function: optionally, if there are some details about functionality captured (normalfault) or specific protocol capture (PTP).
file extension: the type can be PCAP (network capture) or CSV (flow file).
Results
results
test1-iec104
model-test1-iec104.pkl
test1-iec104.log
test1-iec61850
model-test1-iec61850.pkl
test1-iec61850.log
test2-iec61850
model-test2-iec61850.pkl
test2-iec61850.log
Description
The outcomes of different test executions are available as follows:
test1-iec104: IEC 104 protocol for all attacks and attack free scenario
test1-iec61850: IEC 61850 protocol for fuzzy attack with fault injection and attack free scenario
test2-iec61850: IEC 61850 protocol for fuzzy attack normal operation and attack free scenario
Each test consists of the model results in Python pickle format (with a .pkl extension) and a detailed description of the execution conditions in an output log file (with a .log extension).
Source Code
Tools to process network captures from IEC61850 and IEC104 can be found at github repository.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Yelp Reviews Polarity dataset is a collection of Yelp reviews that have been labeled as positive or negative. This dataset is perfect for natural language processing tasks such as sentiment analysis
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This YELP reviews dataset is a great natural language processing dataset for anyone looking to get started with text classification. The data is split into two files: train.csv and test.csv. The training set contains 7,000 reviews with labels (0 = negative, 1 = positive), and the test set contains 3,000 unlabeled reviews.
To get started with this dataset, download the two CSV files and put them in the same directory. Then, open up train.csv in your favorite text editor or spreadsheet software (I like using Microsoft Excel). Next, take a look at the first few rows of data to get a feel for what you're working with:
text label So there is no way for me to plug it in here in the US unless I go by... 0
- This dataset could be used to train a machine learning model to classify Yelp reviews as positive or negative.
- This dataset could be used to train a machine learning model to predict the star rating of a Yelp review based on the text of the review.
- This dataset could be used to build a natural language processing system that generates fake Yelp reviews
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
File: test.csv | Column name | Description | |:--------------|:----------------------------------| | text | The text of the review. (string) | | label | The label of the review. (string) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterThis dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.
##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)
Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)
**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*
submission.csv - A sample submission file in the correct format.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The datasets used in this research work refer to the aims of Sustainable Development Goal 7. These datasets were used to train and test machine learning model based on artificial neural network and other machine learning regression models for solving the problem of prediction scores in terms of SDG 7 aims realization. Train dataset was created based on data from 2013 to 2021 and includes 261 samples. Test dataset includes 29 samples. Sources data from 2013 to 2022 are available in 10 XLSX and CSV files. Train and test datasets are available in XLSX and CSV files. Detailed description of data is available in PDF file.
Facebook
TwitterThis data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality.Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe