2 datasets found
  1. Replication Package: Unboxing Default Argument Breaking Changes in Scikit...

    • zenodo.org
    application/gzip
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente (2023). Replication Package: Unboxing Default Argument Breaking Changes in Scikit Learn [Dataset]. http://doi.org/10.5281/zenodo.8132450
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package

    This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

    Requirements

    We recommend the following requirements to replicate our study:

    1. Internet access
    2. At least 100GB of space
    3. Docker installed
    4. Git installed

    Package Structure

    We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

    • data-analysis, an R-based Container we used to run our data analysis.
    • data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.
    • database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.
    • storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.
    • docker-compose.yml, the Docker file that configures all containers used in the package.

    In the remainder of this document, we describe how to set up each container properly.

    Using VSCode to Setup the Package

    We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

    You first need to set up the containers

    $ cd /replication/package/folder
    $ docker-compose build
    $ docker-compose up
    # Wait docker creating and running all containers
    

    Then, you can open them in Visual Studio Code:

    1. Open VSCode in project root folder
    2. Access the command palette and select "Dev Container: Reopen in Container"
      1. Select either Data Collection or Data Analysis.
    3. Start working

    If you want/need a more customized organization, the remainder of this file describes it in detail.

    Longest Road: Manual Package Setup

    Database Setup

    The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

    Build an image:

    $ cd ./database
    $ docker build --tag 'dabc-database' .
    $ docker image ls
    REPOSITORY  TAG    IMAGE ID    CREATED     SIZE
    dabc-database latest  b6f8af99c90d  50 minutes ago  18.5GB
    

    Create and enter inside the container:

    $ docker run -it --name dabc-database-1 dabc-database
    $ docker exec -it dabc-database-1 /bin/bash
    root# psql -U postgres -h localhost -d jupyter-notebooks
    jupyter-notebooks=# \dt
           List of relations
     Schema |    Name    | Type | Owner
    --------+-------------------+-------+-------
     public | Cell       | table | root
     public | Code_cell     | table | root
     public | Md_cell      | table | root
     public | Notebook     | table | root
     public | Notebook_features | table | root
     public | Notebook_metadata | table | root
     public | repository    | table | root
    

    If you got the tables list as above, your database is properly setup.

    It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

    Data Collection Setup

    This container is responsible for collecting the data to answer our research questions. It has the following structure:

    • dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.
    • dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.
    • Makefile, commands to set up and run both dabcs.py and dabcs-clients.py
    • matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.
    • storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.
    • requirements.txt, Python dependencies adopted in this module.

    Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

    $ cd ./data-collection
    $ docker build --tag "data-collection" .
    $ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
    $ docker exec -it data-collection-1 /bin/bash
    $ ls
    Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
    

    If you see project files, it means the container is configured accordingly.

    Data Analysis Setup

    We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

    • dependencies.R, an R script containing the dependencies used in our data analysis.
    • data-analysis.Rmd, the R notebook we used to perform our data analysis
    • datasets, a docker volume pointing to the storage directory.

    Execute the following commands to run this container:

    $ cd ./data-analysis
    $ docker build --tag "data-analysis" .
    $ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
    $ docker exec -it data-analysis-1 /bin/bash
    $ ls
    data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
    

    If you see project files, it means the container is configured accordingly.

    A note on storage shared folder

    As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

    $ make unzip # extract files
    $ ls
    clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
    $ make zip # compress files
    $ ls
    csv-files.tar.gz Makefile
  2. f

    Aluminum alloy industrial materials defect

    • figshare.com
    zip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    figshare
    Authors
    Ying Han; Yugang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200,patience: 50,batch: 16,imgsz: 640,pretrained: true,optimizer: SGD,close_mosaic: 10,iou: 0.7,momentum: 0.937,weight_decay: 0.0005,box: 7.5,cls: 0.5,dfl: 1.5,pose: 12.0,kobj: 1.0,save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente (2023). Replication Package: Unboxing Default Argument Breaking Changes in Scikit Learn [Dataset]. http://doi.org/10.5281/zenodo.8132450
Organization logo

Replication Package: Unboxing Default Argument Breaking Changes in Scikit Learn

Explore at:
application/gzipAvailable download formats
Dataset updated
Aug 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente; João Eduardo Montandon; Luciana Lourdes Silva; Cristiano Politowski; Ghizlane El Boussaidi; Marco Tulio Valente
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Replication Package

This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".

Requirements

We recommend the following requirements to replicate our study:

  1. Internet access
  2. At least 100GB of space
  3. Docker installed
  4. Git installed

Package Structure

We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:

  • data-analysis, an R-based Container we used to run our data analysis.
  • data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.
  • database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.
  • storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.
  • docker-compose.yml, the Docker file that configures all containers used in the package.

In the remainder of this document, we describe how to set up each container properly.

Using VSCode to Setup the Package

We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.

You first need to set up the containers

$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers

Then, you can open them in Visual Studio Code:

  1. Open VSCode in project root folder
  2. Access the command palette and select "Dev Container: Reopen in Container"
    1. Select either Data Collection or Data Analysis.
  3. Start working

If you want/need a more customized organization, the remainder of this file describes it in detail.

Longest Road: Manual Package Setup

Database Setup

The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:

Build an image:

$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY  TAG    IMAGE ID    CREATED     SIZE
dabc-database latest  b6f8af99c90d  50 minutes ago  18.5GB

Create and enter inside the container:

$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
       List of relations
 Schema |    Name    | Type | Owner
--------+-------------------+-------+-------
 public | Cell       | table | root
 public | Code_cell     | table | root
 public | Md_cell      | table | root
 public | Notebook     | table | root
 public | Notebook_features | table | root
 public | Notebook_metadata | table | root
 public | repository    | table | root

If you got the tables list as above, your database is properly setup.

It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.

Data Collection Setup

This container is responsible for collecting the data to answer our research questions. It has the following structure:

  • dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.
  • dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.
  • Makefile, commands to set up and run both dabcs.py and dabcs-clients.py
  • matroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.
  • storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.
  • requirements.txt, Python dependencies adopted in this module.

Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:

$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py

If you see project files, it means the container is configured accordingly.

Data Analysis Setup

We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:

  • dependencies.R, an R script containing the dependencies used in our data analysis.
  • data-analysis.Rmd, the R notebook we used to perform our data analysis
  • datasets, a docker volume pointing to the storage directory.

Execute the following commands to run this container:

$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile

If you see project files, it means the container is configured accordingly.

A note on storage shared folder

As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.

$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Search
Clear search
Close search
Google apps
Main menu