12 datasets found
  1. Data from: Generative Multi-Purpose Sampler for Weighted M-estimation

    • tandf.figshare.com
    txt
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minsuk Shin; Shijie Wang; Jun S. Liu (2024). Generative Multi-Purpose Sampler for Weighted M-estimation [Dataset]. http://doi.org/10.6084/m9.figshare.24776863.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Minsuk Shin; Shijie Wang; Jun S. Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To overcome computational bottlenecks of various data perturbation procedures such as the bootstrap and cross-validations, we propose the Generative Multi-purpose Sampler (GMS), which directly constructs a generator function to produce solutions of weighted M-estimators from a set of given weights and tuning parameters. The GMS is implemented by a single optimization procedure without having to repeatedly evaluate the minimizers of weighted losses, and is thus capable of significantly reducing the computational time. We demonstrate that the GMS framework enables the implementation of various statistical procedures that would be unfeasible in a conventional framework, such as iterated bootstrap procedures and cross-validation for penalized likelihood. To construct a computationally efficient generator function, we also propose a novel form of neural network called the weight multiplicative multilayer perceptron to achieve fast convergence. An R package called GMS is provided, which runs under Pytorch to implement the proposed methods and allows the user to provide a customized loss function to tailor to their own models of interest. Supplementary materials for this article are available online.

  2. Z

    Data from: Solar flare forecasting based on magnetogram sequences learning...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10246576
    Explore at:
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Universidade Estadual de Campinas
    Universidade Estadual de Campinas (UNICAMP)
    Authors
    Grim, Luís Fernando Lopes; Sampaio Gradvohl, André Leon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders:

    magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.

    M24/M48: both present the following sub-folders structure:

    Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root:

    inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files:

    magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where:

    hmi: is the instrument that captured the image
    sharp_720s: is the database source of SDO/HMI.
    is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA).
    is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds).
    is the date-time when the sequence starts, and follow the same format of .

    is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where:

    is Seq16 if refers to a sequence, or void if refers direct to images.

    "24h" or "48h".

    is "TrainVal" or "Test". The refers to the split of Train/Val.

    void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders:

    Model training codes: "SF_MViT_M+_", where:

    void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test);

    "24h" or "48h";

    "oneSplit" for a specific split or "allSplits" if run all splits.

    void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where:

    point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt:

    train or val;

    void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where:

    k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where:

    k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where:

    epoch number of the checkpoint;

    corresponding valid loss;

    0 to 4.

  3. MS_NonMS_Classification(flair + mask)

    • kaggle.com
    zip
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    farah_mo (2025). MS_NonMS_Classification(flair + mask) [Dataset]. https://www.kaggle.com/datasets/farahmo/ms-nonms-classificationfllair
    Explore at:
    zip(1433950034 bytes)Available download formats
    Dataset updated
    Jul 21, 2025
    Authors
    farah_mo
    Description

    This dataset provides preprocessed FLAIR MRI scans and their corresponding masks, designed to classify Multiple Sclerosis (MS) and Non-MS brain diseases. - Non-MS Dataset (original source): dataverse - Non-MS Dataset exported to Kaggle: WMH Dataset - MS dataset source: ISBI

    Preprocessing:-

    All data was prepared with consistency for training deep learning models: MRI sequence: FLAIR images. Masks: Corresponding binary lesion masks highlighting relevant regions. Preprocessing included intensity normalization, resizing, and alignment. Data is structured into MS and Non-MS folders, each containing paired (FLAIR, mask).

    Model & Training:- We trained a 3D CNN with spatial and channel attention using PyTorch.

    • Inputs: FLAIR image + corresponding mask.
    • Cross-validation: 5-fold cross validation for robust performance evaluation.
    • Metrics: Accuracy, precision, recall, F1-score, and loss curves for training and validation.
    • Checkpoints: Best models saved during training.

    Framework & Implementation:-

    • Framework: PyTorch
    • Model: 3D CNN with spatial & channel attention
    • Evaluation Strategy: 5-fold cross validation + ensemble predictions

    Potential Use Cases:-

    Exploring multi-input classification (FLAIR + mask) for neuroimaging. MS vs Non-MS classification using lesion-focused features. Comparing single-input vs dual-input architectures. Benchmarking advanced attention-based 3D CNNs.

  4. f

    Average score of 5-fold cross validation results of proposed models.

    • figshare.com
    xls
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cihun-Siyong Alex Gong; Chih-Hui Simon Su; Kuo-Wei Chao; Yi-Chu Chao; Chin-Kai Su; Wei-Hang Chiu (2023). Average score of 5-fold cross validation results of proposed models. [Dataset]. http://doi.org/10.1371/journal.pone.0259140.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Cihun-Siyong Alex Gong; Chih-Hui Simon Su; Kuo-Wei Chao; Yi-Chu Chao; Chin-Kai Su; Wei-Hang Chiu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Average score of 5-fold cross validation results of proposed models.

  5. Z

    Single-Molecule Localization Microscopy (SMLM) 2D Digits 123 and TOL letters...

    • data-staging.niaid.nih.gov
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umney, Oliver; Curd, Alistair (2025). Single-Molecule Localization Microscopy (SMLM) 2D Digits 123 and TOL letters and grid dataset processed for machine-learning [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14246302
    Explore at:
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    University of Leeds
    Authors
    Umney, Oliver; Curd, Alistair
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The digits and letters dataset was adapted from Huijben, Teun Adrianus Petrus Maria; Heydarian, Hamidreza; Rieger, B. (Bernd); Stallinga, S. (Sjoerd); Jungmann, R. (Ralf) et. al. (2021): Single-Molecule Localization Microscopy (SMLM) 2D Digits 123 and TOL letters datasets. Version 1. 4TU.ResearchData. dataset. https://doi.org/10.4121/14074091.v1 under CC BY-NC 4.0

    clusternet_hcf.tar.gz contains the files for ClusterNet-HCF

    clusternet_lcf.tar.gz contains the files for ClusterNet-LCF

    The salient folders in these are:

    config - configuration files

    preprocessed - train and test files in Apache .parquet format - these files could be used to re-train a different network using the pipeline

    models - contains the trained model

    output - results

    processed - train, validation and test files in Pytorch Geometric format

    scripts

    Note:

    the test files are the RESERVED TEST SET FILES

    the train and validation set together constitute the data that was used in cross-validation

    the clusters contain the handcrafted features.

    To reproduce/visualise results follow the instructions at https://github.com/oubino/locpix_points/

  6. Z

    Replication Data for: Geometric Transformers for Protein Interface Contact...

    • data.niaid.nih.gov
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morehead, Alex; Chen Chen; Jianlin Cheng (2022). Replication Data for: Geometric Transformers for Protein Interface Contact Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5546774
    Explore at:
    Dataset updated
    Jun 21, 2022
    Dataset provided by
    University of Missouri
    Authors
    Morehead, Alex; Chen Chen; Jianlin Cheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains replication data for the paper titled "Geometric Transformers for Protein Interface Contact Prediction". The dataset consists of pickled Python dictionaries containing pairs of DGLGraphs that can be used to train and validate protein interface contact prediction models. It also contains our best model checkpoints saved as PyTorch LightningModules. Our GitHub repository, DeepInteract, linked in the "Additional notes" metadata section below provides more details on how we use these files as examples for cross-validation.

  7. InceptionV3

    • kaggle.com
    zip
    Updated Dec 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch (2017). InceptionV3 [Dataset]. https://www.kaggle.com/pytorch/inceptionv3
    Explore at:
    zip(100980456 bytes)Available download formats
    Dataset updated
    Dec 12, 2017
    Dataset authored and provided by
    PyTorch
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    InceptionV3

    Rethinking the Inception Architecture for Computer Vision

    Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.

    Authors: Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna
    https://arxiv.org/abs/1512.00567

    InceptionV3 Architecture

    https://4.bp.blogspot.com/-TMOLlkJBxms/Vt3HQXpE2cI/AAAAAAAAA8E/7X7XRFOY6Xo/s1600/image03.png" alt="InceptionV3 Architecture">

    What is a Pre-trained Model?

    A pre-trained model has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on. Learned features are often transferable to different data. For example, a model trained on a large dataset of bird images will contain learned features like edges or horizontal lines that you would be transferable your dataset.

    Why use a Pre-trained Model?

    Pre-trained models are beneficial to us for many reasons. By using a pre-trained model you are saving time. Someone else has already spent the time and compute resources to learn a lot of features and your model will likely benefit from it.

  8. f

    Accuracy and loss function values for the validation set corresponding to...

    • plos.figshare.com
    xls
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang (2025). Accuracy and loss function values for the validation set corresponding to each model fold. [Dataset]. http://doi.org/10.1371/journal.pone.0333209.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 13, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accuracy and loss function values for the validation set corresponding to each model fold.

  9. Model weights and training, validation, and test set images and masks for...

    • zenodo.org
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle (2025). Model weights and training, validation, and test set images and masks for "Uncovering a million small dams in Brazil using deep learning" [Dataset]. http://doi.org/10.5281/zenodo.14927197
    Explore at:
    Dataset updated
    Feb 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kylen Solvik; Kylen Solvik; Yaffa Truelove; Yaffa Truelove; JENNIFER BALCH; JENNIFER BALCH; Michael Lathuilliere; Michael Lathuilliere; Thiago Fontenelle; Andrea Castanho; Andrea Castanho; Michael Coe; Michael Coe; Christina Shintani; Christina Shintani; CARLOS Souza Jr; CARLOS Souza Jr; Marcia Nunes Macedo; Marcia Nunes Macedo; Thiago Fontenelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated masks and Sentinel-1/-2 images split into training, validation, and test sets. Used for training convolutional neural network for small reservoir mapping.

    - manet_sentinel.ckpt: PyTorch model checkpoint file containing model weights.

    - annotations.zip: Contains binary reservoir masks (0 is non-reservoir, 1 is reservoir) split into training, validation, and test sets.

    - images.zip: Contains Sentinel-1/-2 images split into training, validation, and test sets with the following bands:

    1. Blue
    2. Green
    3. Red
    4. Near-infrared
    5. Sentinel-1 SAR VV
    6. Sentinel-1 SAR VH
    7. NDVI
    8. NDWI
    9. Gao's NDWI
    10. MNDWI

  10. Derivative metrics from confusion matrices generated from the test set for...

    • figshare.com
    xls
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang (2025). Derivative metrics from confusion matrices generated from the test set for each model fold. [Dataset]. http://doi.org/10.1371/journal.pone.0333209.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 13, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Derivative metrics from confusion matrices generated from the test set for each model fold.

  11. Derivative metrics from confusion matrices and McNemar’s test results for...

    • plos.figshare.com
    xls
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang (2025). Derivative metrics from confusion matrices and McNemar’s test results for each category test in Model A (fold 4)and Model B (fold 1). [Dataset]. http://doi.org/10.1371/journal.pone.0333209.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 13, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bo Jiang; Junjiao Hu; Xiaofan Chen; Xiong Wu; Kai Deng; Haitao Yang; Weijun Situ; Shan Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Derivative metrics from confusion matrices and McNemar’s test results for each category test in Model A (fold 4)and Model B (fold 1).

  12. Data from: USA News Dataset

    • kaggle.com
    zip
    Updated Aug 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinayak Shanawad (2021). USA News Dataset [Dataset]. https://www.kaggle.com/datasets/vinayakshanawad/us-news-dataset/code
    Explore at:
    zip(49380427 bytes)Available download formats
    Dataset updated
    Aug 11, 2021
    Authors
    Vinayak Shanawad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Problem Description

    Construct two types of models -- (A) a deep learning classifier such as LSTM or similar model to predict the category of a news article given its title and abstract, and (B) A recommendation system to recommend posts that a user is most likely to click.

    The dataset consists of two files -- (1) user_news_clicks.csv, and (2) news_text.csv.

    Model A, the deep learning classifier only requires the news_text.csv dataset. The goal is to predict the ‘category’ label using the ‘title’ and ‘abstract; columns. Model B, the recommendation system only requires user_news_clicks.csv but you can use the news_text.csv in addition if you’d like though it is not necessary for this exercise. The goal is to be able to recommend users news articles that they’re likely to click.

    Data Description

    In news_text.csv - each record consists of three attributes and a target variable: - Category - There are lots of news categories available in this dataset, as requested we need to only 3 categories - news, sports and finance - news_id - Identification number of the news - title - Title of the news - abstract - Abstract of the news

    In user_news_clicks.csv - each record consists of two attributes and a target variable: - click - User has clicked the articles or not - user_id - Identification number of the user - item - Identification number of an item

    Goals

    • Design the deep learning classifier and the recommendation system models
    • Build and train the models using a Python deep learning library such as Tensorflow or PyTorch
    • Test the model’s performance using a set of metrics
    • Report on the performance of the model

    Instructions

    NOTE: We do not need to use the entire dataset, if resources are limited. Feel free to sample. - For Model A, use only the top 3 categories -- namely news, sports, and finance for model training and validation. - Code and build the models A and B using a Python library such as Pytorch or Tensorflow

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Minsuk Shin; Shijie Wang; Jun S. Liu (2024). Generative Multi-Purpose Sampler for Weighted M-estimation [Dataset]. http://doi.org/10.6084/m9.figshare.24776863.v2
Organization logo

Data from: Generative Multi-Purpose Sampler for Weighted M-estimation

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jan 17, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Minsuk Shin; Shijie Wang; Jun S. Liu
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

To overcome computational bottlenecks of various data perturbation procedures such as the bootstrap and cross-validations, we propose the Generative Multi-purpose Sampler (GMS), which directly constructs a generator function to produce solutions of weighted M-estimators from a set of given weights and tuning parameters. The GMS is implemented by a single optimization procedure without having to repeatedly evaluate the minimizers of weighted losses, and is thus capable of significantly reducing the computational time. We demonstrate that the GMS framework enables the implementation of various statistical procedures that would be unfeasible in a conventional framework, such as iterated bootstrap procedures and cross-validation for penalized likelihood. To construct a computationally efficient generator function, we also propose a novel form of neural network called the weight multiplicative multilayer perceptron to achieve fast convergence. An R package called GMS is provided, which runs under Pytorch to implement the proposed methods and allows the user to provide a customized loss function to tailor to their own models of interest. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu