8 datasets found

h
tips
huggingface.co
Updated Jun 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
scikit-learn (2022). tips [Dataset]. https://huggingface.co/datasets/scikit-learn/tips
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2022
Dataset authored and provided by
scikit-learn
Description
A Waiter's Tips

The following description was retrieved from Kaggle page. Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/tips.
Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy
figshare.com
bin
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennart Purucker (2023). Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.23613627.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23613627.v1
Dataset updated
Jul 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lennart Purucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.

The files of this figshare item include data that was collected for the paper:

Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.

The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.

In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.

The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.

Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.

Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.

The link resolves to a directory containing the following:

example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.

The size after unzipping the entire file is:

metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB

We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.

The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
d
Released datasets for An Improved Framework for Scaling Party Positions from...
search.dataone.org
dataverse.harvard.edu
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyen, Hung H. V. (2024). Released datasets for An Improved Framework for Scaling Party Positions from Texts with Transformer [Dataset]. http://doi.org/10.7910/DVN/LJZKLL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/LJZKLL
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Nguyen, Hung H. V.
Description
Released datasets for the pre-print: An Improved Framework for Scaling Party Positions from Texts with Transformer. Replication data can be found on the project's Github page: https://github.com/hungnguyen167/ContextScale Note: Position scores in both datasets have been rescaled by topic using MinMaxScaler (range from 0 to 1, with 0 being the most extreme left) from sklearn.
What you see is what you get: Delineating the urban jobs-housing spatial...
figshare.com
zip
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan (2021). What you see is what you get: Delineating the urban jobs-housing spatial distribution at a parcel scale by using street view imagery [Dataset]. http://doi.org/10.6084/m9.figshare.12960212.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12960212.v1
Dataset updated
Feb 12, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan
License
https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html
Description
The compressed package (Study_code.zip) contains the code files implemented by an under review paper ("What you see is what you get: Delineating urban jobs-housing spatial distribution at a parcel scale by using street view imagery based on deep learning technique").The compressed package (input_land_parcel_with_attributes.zip) is the sampled mixed "jobs-housing" attributes data of the study area with multiple probability attributes (Only working, Only living, working and living) at the land parcel scale.The compressed package (input_street_view_images.zip) is the surrounding street view data near sampled land parcels (input_land_parcel_with_attributes.zip) with the pixel size of 240*160 obtained from Tencent map (https://map.qq.com/).The compressed package (output_results.zip) contains the result vector files (Jobs-housing pattern distribution and error distribution) and file description (Readme.txt).This project uses some Python open source libraries (Numpy, Pandas, Selenium, Gdal, Pytorch and sklearn). This project complies with the GPL license.Numpy (https://numpy.org/) is an open source numerical calculation tool developed by Travis Oliphant. Used in this project for matrix operation. This library complies with the BSD license.Pandas (https://pandas.pydata.org/) is an open source library, providing high-performance, easy-to-use data structures and data analysis tools. This library complies with the BSD license.Selenium(https://www.selenium.dev/) is a suite of tools for automating web browsers.Used in this project for getting street view images.This library complies with the BSD license.Gdal(https://gdal.org/) is a translator library for raster and vector geospatial data formats.Used in this project for processing geospatial data.This library complies with the BSD license.Pytorch(https://pytorch.org/) is an open source machine learning framework that accelerates the path from research prototyping to production deployment.Used in this project for deep learning.This library complies with the BSD license.sklearn(https://scikit-learn.org/) is an open source machine learning tool for python.Used in this project for comparing precision metrics.This library complies with the BSD license.
Clustering Exercises
kaggle.com
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2022
Dataset provided by
Kaggle
Authors
Joonas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

https://i.imgur.com/ZUX61cD.png" alt="Overview">

Context

The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

For users who making hard test cases for example of clustering, I think this dataset helps them.

Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

Dataset

All csv files contain a lots of x, y and color, and you can see above figures.

If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

Stay tuned for further updates! also if any idea, you can comment me.
Replication Package: Unboxing Default Argument Breaking Changes in Scikit...
zenodo.org
application/gzip
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente (2023). Replication Package: Unboxing Default Argument Breaking Changes in Scikit Learn [Dataset]. http://doi.org/10.5281/zenodo.8132450
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8132450
Dataset updated
Aug 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package

This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

Requirements

We recommend the following requirements to replicate our study:

Internet access

At least 100GB of space

Docker installed

Git installed

Package Structure

We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

data-analysis, an R-based Container we used to run our data analysis.

data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.

database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.

storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.

docker-compose.yml, the Docker file that configures all containers used in the package.

In the remainder of this document, we describe how to set up each container properly.

Using VSCode to Setup the Package

We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

You first need to set up the containers

$ cd /replication/package/folder $ docker-compose build $ docker-compose up # Wait docker creating and running all containers

Then, you can open them in Visual Studio Code:

Open VSCode in project root folder

Access the command palette and select "Dev Container: Reopen in Container"

Select either Data Collection or Data Analysis.

Start working

If you want/need a more customized organization, the remainder of this file describes it in detail.

Longest Road: Manual Package Setup

Database Setup

The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

Build an image:

$ cd ./database $ docker build --tag 'dabc-database' . $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB

Create and enter inside the container:

$ docker run -it --name dabc-database-1 dabc-database $ docker exec -it dabc-database-1 /bin/bash root# psql -U postgres -h localhost -d jupyter-notebooks jupyter-notebooks=# \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+------- public | Cell | table | root public | Code_cell | table | root public | Md_cell | table | root public | Notebook | table | root public | Notebook_features | table | root public | Notebook_metadata | table | root public | repository | table | root

If you got the tables list as above, your database is properly setup.

It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

Data Collection Setup

This container is responsible for collecting the data to answer our research questions. It has the following structure:

dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.

dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.

Makefile, commands to set up and run both dabcs.py and dabcs-clients.py

matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.

storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.

requirements.txt, Python dependencies adopted in this module.

Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

$ cd ./data-collection $ docker build --tag "data-collection" . $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection $ docker exec -it data-collection-1 /bin/bash $ ls Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py

If you see project files, it means the container is configured accordingly.

Data Analysis Setup

We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

dependencies.R, an R script containing the dependencies used in our data analysis.

data-analysis.Rmd, the R notebook we used to perform our data analysis

datasets, a docker volume pointing to the storage directory.

Execute the following commands to run this container:

$ cd ./data-analysis $ docker build --tag "data-analysis" . $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis $ docker exec -it data-analysis-1 /bin/bash $ ls data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile

If you see project files, it means the container is configured accordingly.

A note on storage shared folder

As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

$ make unzip # extract files $ ls clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv $ make zip # compress files $ ls csv-files.tar.gz Makefile
Z
Perovskite Solar Cells Ageing Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khenkin, Mark (2023). Perovskite Solar Cells Ageing Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8185882
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
Khenkin, Mark
Köbler, Hans
Graniero, Paolo
Hartono, Noor Titan Putri
Ulbrich, Carolin
Schlatmann, Rutger
Abate, Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains cleaned 2,245 ageing test traces (time vs. MPPT PCE/ maximum power point tracking power conversion efficiency) for perovskite solar cells with various device stacks and architectures in the pickle (.pkl) format.

The dataset can be loaded with the following commands on Python.

import pickle5 as pickle import pandas as pd import numpy as np

with open('20230303_mySeriesDrop.pkl', "rb") as fh: mySeriesDrop = pickle.load(fh)

The following command can be used to call a specific row (row 0) within the dataset.

mySeriesDrop[0]

The next steps to use the dataset is using scaling/ normalisation (for instance using sklearn.preprocessing.MaxAbsScaler) and smoothing (for instance using Savitzky-Golay filter).

The code to run the complete analysis, including self-organising map clustering, can be accessed here: https://doi.org/10.5281/zenodo.8181602.
m
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
data.mendeley.com
narcis.nl
Updated Jan 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TaeKeun Yoo (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.2
Dataset updated
Jan 11, 2021
Authors
TaeKeun Yoo
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

scikit-learn (2022). tips [Dataset]. https://huggingface.co/datasets/scikit-learn/tips

tips

scikit-learn/tips

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 25, 2022

Dataset authored and provided by

scikit-learn

Description

A Waiter's Tips

The following description was retrieved from Kaggle page. Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/tips.

Clear search

Close search

Google apps

Main menu

tips

Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy

Released datasets for An Improved Framework for Scaling Party Positions from...

What you see is what you get: Delineating the urban jobs-housing spatial...

Clustering Exercises

Overview

Context

Dataset

Replication Package: Unboxing Default Argument Breaking Changes in Scikit...

Perovskite Solar Cells Ageing Dataset

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

tips

scikit-learn/tips