MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
dataset_info: features:
name: repo_name dtype: string name: path dtype: string name: copies dtype: string name: size dtype: string name: content dtype: string name: license dtype: string splits: name: train num_bytes: 3147402833.3951 num_examples: 241075 name: valid num_bytes: 17472318.29500301 num_examples: 1312 download_size: 966099631 dataset_size: 3164875151.690103 configs: config_name: default data_files: split: train path: data/train-* split: valid path: data/valid-* license:… See the full description on the dataset page: https://huggingface.co/datasets/AlaGrine/codeparrot-sklearn.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
32.579 texts in total, 14.012 NOT hateful texts and 18.567 HATEFUL texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 11.210, 1 ==> 14.853, 26.063 in total Validation set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in total Test set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/hate_speech_dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets are used in a controlled experiment, where two classifiers should be compared. train_a.csv and explain.csv are slices from the original data set. train_b.csv contains the same instances as in train_a.csv, but with feature x1 set to 0 to make it unusable to classifier B.
The original data set was created and split using this Python code:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=0.75, random_state=0) X *= 100
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) lm = LogisticRegression() lm.fit(X_train, y_train) clf_a = lm
clf_b = LogisticRegression() X2 = X.copy() X2[:, 0] = 0 X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.5, random_state=0) clf_b.fit(X2_train, y2_train)
X_explain = X_test y_explain = y_test
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.
We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.
This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).
This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.
Python version:
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')
dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()
y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
A machine learning barometer (using Random Forest Regression) to calculate equilibration pressure for majoritic garnetsUpdated 04/02/21 (21/01/21) (10/12/20):**The barometer codeThe barometer is provided as python scripts (.py) and Jupiter Notebooks (.ipynb) files. These are completely equivalent to one another and which is used depends on the users preference. Separate instructions are provided for each.data files included in this repository are: • "Majorite_database_04022021.xlsm" (Excel sheet of literature majoritic garnet compositions - inclusions (up to date as of 04/02/2021) and experiments (up to date as of 03/07/2020). This data includes all compositions that are close to majoritic, but some are borderline. Filtering as described in paper accompanying this barometer is performed in the python script prior to any data analysis or fitting) • "lit_maj_nat_030720.txt" (python script input file of experimental literature majoritic garnet compositions - taken from dataset above) • "di_incs_040221.txt" (python script input file of literature compilation of majoritic garnet inclusions observed in natural diamonds - taken from the dataset above)*The barometer as Jupiter Notebooks - including integrated Caret validation (added 21/01/2021)For those more unfamiliar with Python, running the barometer as a Notebook is somewhat more intuitive than running the scripts below. It also has the benefit of including the RFR validation in using Caret within a single integrated notebook. For success the Jupiter Notebook requires a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn, rpy2 and pickle packages + dependencies). We recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and creating a custom environment containing the required packages to run the Jupiter Notebook (as both python3 and R must be active in the environment). Instructions on this procedure can be found here (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), or to assist we have provided a copy of the environment used to produce the scripts to assist in this process (barom-spec-file.txt). An identical conda environment (called myenv) can be created, and used by:1) copying the barometer-spec-file.txt to a suitable location (i.e. your home directory)2) running the command conda create --name myenv --file barom-spec-file.txt3) entering this environmentconda activate myenv4) Running an instance of Jupyter Notebook by typingjupyter notebookTwo Notebooks are provided: • calculate_pressures_notebook.ipynb (equivalent to calculate_pressures.py described below) • rfr_majbar_10122020_notebook.ipynb (equivalent to rfr_majbar_10122020.py described below) but also including integrated Caret validation performed using the rpy2 package in a single notebook environment*The barometer as scripts (10/12/2020)The scripts below need to be run in a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn and pickle packages + dependencies). For inexperienced users we recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and running in Spyder (a GUI scripting environment provided with Anaconda.Note - if running python 3.7 (or earlier) then you will need to install pickle5 package to use the provided barometer files and comment / uncomment the appropriate lines in the “calculate_pressures.py” (lines 16/17) and “rfr_majbar_10122020.py” (lines 26/27) scripts.The user may additionally need to download and install the packages required if they are not provided with the anaconda distribution (pandas, numpy, matplotlib, scikit-learn and pickle). This will be obvious as, when run, the script will return an error similar to “No module name XXXX”. Packages can either be installed using the anaconda package manager or in the command line / terminal via commands such as: conda install -c conda-forge pickle5Appropriate command line installation commands can be obtained via searching the anaconda cloud at anaconda.org for each required package.A python script (.py) is provided to calculate pressures for any majoritic garnet using barometer calibrated in Thomson et al. (2021) • calculate_pressures.py script takes an input file of any majoritic garnet compositions (example input file is provided “example_test_data.txt" - which are inclusion compositions reported by Zedgenizov et al., 2014, Chemical Geology, 363, pp 114-124). • employs published RFR model and scaler - both provided as pickle files (pickle_model_20201210.pkl, scaler_20201210.pkl)User can simply edit the input file name in the provided .py script - and then runs the script in a suitable python3 environment (requires pandas, numpy, sklearn and pickle packages). Script initially filters data for majoritic compositions (according to criteria used for barometer calibration) and predicts pressures for these compositions. Writes out pressures and 2 x std_dev in pressure estimates alongside input data into "out_pressures_test.txt". if this script produces any errors or warnings it is likely because the serialised pickle files provided are not compatible with the python build being used (this is a common issue with serialised ML models). Please first try installing the pickle5 package and commenting/uncommenting lines 16/17. If this is unsuccessful then run the full barometer calibration script below (using the same input files as in Thomson et al. (2021) which are provided) to produce pickle files compatible with the python build on the local machine (action 5 of script below). Subsequently edit the filenames called in the “calculate_pressures.py” script (lines 22 & 27) to match the new barometer calibration files and re-run the calculate pressure script. The output (predicted pressures) for the test dataset provided (and using the published calibration) given in the output file should be similar to the following results:P (GPa) error (GPa)17.0 0.416.6 0.319.5 1.321.8 1.312.8 0.314.3 0.414.7 0.414.4 0.612.1 0.614.6 0.517.0 1.014.6 0.611.9 0.714.0 0.516.8 0.8Full RFR barometer calibration script - rfr_majbar_10122020.py The RFR barometer calibration script used and described in Thomson et al. (2021). This script performs the following actions. 1) filters input data - outputs this filtered data as a .txt file (which is the input expected for RFR validation script using R package Caret) 2) fits 1000 RFR models each using a randomly selected training dataset (70% of the input data) 3) performs leave-one-out validation 4) plots figure 5 from Thomson et al. (2021) 5) fits one single RFR barometer using all input data (saves this and the scaler as .pkl files with a datestamp for use in the "calculate_pressures.py script) 6) calculates the pressure for all literature inclusion compositions over 100 iterations with randomly distributed compositional uncertainties added - provides the mean pressure and 2 std deviations, written alongside input inclusion compositons, as a .txt output file "diout.txt" 7) plots the global distribution of majoritic inclusion pressuresThe RFR barometer can be easily updated to include (or exclude) additional experimental compositions by modification of the literature data input files providedRFR validation using Caret in R (script titled “RFR_validation_03072020.R”)Additional validation tests of RFR barometer completed using the Caret package in R. Requires the filtered experimental dataset file "data_filteredforvalidation.txt" (which is generated by the rfr_majbar_10122020.py script if required for a new dataset) performs bootstrap, K-fold and leave-one out validation. outputs validation stats for 5, 7 and 9 input variables (elements)Please email Andrew Thomson (a.r.thomson@ucl.ac.uk) if you have any questions or queries.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from argilla/ultrafeedback-binarized-preferences-cleaned using the followig processing: import random
import pandas as pd from datasets import Dataset, load_dataset from sklearn.model_selection import GroupKFold
data_df = load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned", split="train").to_pandas() rng = random.Random(42) rng = random.Random(42)
def get_assistant_text(messages): t = ""for msg in messages: if msg["role"] == "assistant":… See the full description on the dataset page: https://huggingface.co/datasets/rbiswasfc/ultrafeedback-binary-classification.
The Titanic dataset is a well-known dataset that provides information on the passengers who were onboard the fateful voyage of the RMS Titanic. The data includes details such as the passenger's name, age, gender, ticket class, fare paid, and information on their family members. The dataset also includes a column called "Survived" which indicates whether a passenger survived the disaster or not.
There are a total of 891 rows in the dataset, with 12 columns. Some of the key columns in the dataset include:
• PassengerId: a unique identifier for each passenger • Survived: a binary variable that indicates whether the passenger survived (1) or did not survive (0) the disaster • Pclass: the ticket class of the passenger (1 = first class, 2 = second class, 3 = third class) • Name: the name of the passenger • Sex: the gender of the passenger (male or female) • Age: the age of the passenger (some values are missing) • SibSp: the number of siblings or spouses the passenger had on board • Parch: the number of parents or children the passenger had on board • Ticket: the ticket number of the passenger • Fare: the fare paid by the passenger • Cabin: the cabin number of the passenger (some values are missing) • Embarked: the port at which the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
Overall, the key challenges I encountered when working on the Titanic dataset were: how to handle missing values and imbalanced classes, encode categorical variables, reduce the dimensionality of the dataset, and identify and handle noise in the data.
Here are a few tips and resources that I found helpful when getting started in the Titanic dataset competition: 1. Get familiar with the dataset 2. Pre-process the data 3. Split the data into training and test sets 4. Try out a few different algorithms 5. Tune the hyper parameters 6. Evaluate the model
Here are a few resources that I found helpful as I started Working on the competition: • Kaggle's Titanic tutorial • scikit-learn documentation. • Pandas documentation
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Single slight lightning-talkMachine learning is a central component in many data-driven research areas, but it's adoption is limited by the often complex choice of data processing, model, and hyper-parameter settings.The goal of this project is to create software tools that enable automatic machine learning, that is solving predictive analytics tasks without requiring the user to explicitlyspecify the algorithm or model hyper-parameters used for prediction.The software developed in this will enable a wider use of machine learning, by providing tools to apply machine learning without requiring knowledge of the details of the algorithms involved.The project extends the existing scikit-learn project, a machine learning library for Python, which is widely used in academic research across disciplines.The project will add features to this library to lower the amount of expert knowledge required to apply models to a new problem, and to facilitate the interaction with automated machine learning systems.The project will also create a separate software package that includes models for automatic supervised learning, with a very simple interface, requiring minimal user interaction. In contrast to existing research projects, this project focuses on creating easy-to-use tools that can be used by researchers without extensive training in machine learning or computer science.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
dataset_info: features:
name: repo_name dtype: string name: path dtype: string name: copies dtype: string name: size dtype: string name: content dtype: string name: license dtype: string splits: name: train num_bytes: 3147402833.3951 num_examples: 241075 name: valid num_bytes: 17472318.29500301 num_examples: 1312 download_size: 966099631 dataset_size: 3164875151.690103 configs: config_name: default data_files: split: train path: data/train-* split: valid path: data/valid-* license:… See the full description on the dataset page: https://huggingface.co/datasets/AlaGrine/codeparrot-sklearn.