4 datasets found
  1. z

    Complete code and datasets for "ESNLIR: A Spanish Multi-Genre Dataset with...

    • zenodo.org
    bin, pdf, zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships" [Dataset]. http://doi.org/10.5281/zenodo.15002575
    Explore at:
    bin, zip, pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    Arxiv
    Authors
    Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

    This is the complete code, model and datasets for the paper ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

    Installation

    This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

    poetry install

    As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

    ----------------------------------------------------------------------------------------------

    Core code

    The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

    ----------------------------------------------------------------------------------------------

    Parameters

    All the parameters to create datasets and train models with the core code are found in the folder parameters.

    ----------------------------------------------------------------------------------------------

    Models

    Model types

    For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

    Model folder

    The model folder contains all the trained models for the paper. There are three types of models:

    • baseline: An XGBoost model that can be loaded with pickle.
    • roberta: BERTIN based models in pytorch. You can load them with the model_path
    • xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

    Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

    Load model

    Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

    from transformers import AutoModel
    
    model = AutoModel.from_pretrained('

    ----------------------------------------------------------------------------------------------

    Dataset

    labeled_final_dataset.jsonl

    This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

    Other datasets:

    The datasets can be found in the folder data that is divided in the following folders:

    base_dataset

    The splits to train, validate and test the models.

    splits_data

    Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

    sentence_data

    Pairs of sentences found in each corpus. They are used to generate splits_data.

    Dataset dictionary

    This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

    • sentence_1: First sentence of the pair.
    • sentence_2: Second sentence of the pair.
    • connector: Linking phrase used to extract pair.
    • connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
    • extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
    • distance: How many sentences before the connector is the sentence_1
    • sentence_1_position: Number of sentence for sentence_1 in the source document
    • sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
    • sentence_2_position: Number of sentence for sentence_2 in the source document
    • sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
    • id: Unique identifier for the example
    • dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
    • genre: Writing genre of the dataset.
    • domain: Domain genre of the dataset.

    Example:

    {"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

    Dataset load

    To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

    from auto_nli.model.bert_based.dataset import BERTDataset
    dataset = BERTDataset(
    os.path.join(dataset_folder,
    max_len=
    model_type=

    only_premise=
    max_samples=

    ----------------------------------------------------------------------------------------------

    Notebooks

    The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

  2. R

    Solar flare forecasting based on magnetogram sequences learning with MViT...

    • redu.unicamp.br
    • data.niaid.nih.gov
    • +1more
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Repositório de Dados de Pesquisa da Unicamp (2024). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.25824/redu/IH0AH0
    Explore at:
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Repositório de Dados de Pesquisa da Unicamp
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
    Description

    Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.

  3. h

    mini-imagenet

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch Image Models (2024). mini-imagenet [Dataset]. https://huggingface.co/datasets/timm/mini-imagenet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    PyTorch Image Models
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Description

    A mini version of ImageNet-1k with 100 of 1000 classes present. Unlike some 'mini' variants this one includes the original images at their original sizes. Many such subsets downsample to 84x84 or other smaller resolutions.

      Data Splits
    
    
    
    
    
      Train
    

    50000 samples from ImageNet-1k train split

      Validation
    

    10000 samples from ImageNet-1k train split

      Test
    

    5000 samples from ImageNet-1k validation split (all 50 samples per class)… See the full description on the dataset page: https://huggingface.co/datasets/timm/mini-imagenet.

  4. h

    resisc45

    • huggingface.co
    Updated Jun 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyTorch Image Models (2024). resisc45 [Dataset]. https://huggingface.co/datasets/timm/resisc45
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2024
    Dataset authored and provided by
    PyTorch Image Models
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Description

    RESISC45 dataset is a publicly available benchmark for Remote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class. The dataset does not have any default splits. Train, validation, and test splits were based on these definitions here… See the full description on the dataset page: https://huggingface.co/datasets/timm/resisc45.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán (2025). Complete code and datasets for "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships" [Dataset]. http://doi.org/10.5281/zenodo.15002575

Complete code and datasets for "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships"

Explore at:
bin, zip, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset provided by
Arxiv
Authors
Johan David Rodriguez Portela; Johan David Rodriguez Portela; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán; Rubén Francisco Manrique Piramanrique; Nicolás Perez Terán
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

This is the complete code, model and datasets for the paper ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

Installation

This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

poetry install

As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

----------------------------------------------------------------------------------------------

Core code

The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

----------------------------------------------------------------------------------------------

Parameters

All the parameters to create datasets and train models with the core code are found in the folder parameters.

----------------------------------------------------------------------------------------------

Models

Model types

For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

Model folder

The model folder contains all the trained models for the paper. There are three types of models:

  • baseline: An XGBoost model that can be loaded with pickle.
  • roberta: BERTIN based models in pytorch. You can load them with the model_path
  • xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path

Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )

Load model

Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

from transformers import AutoModel

model = AutoModel.from_pretrained('

----------------------------------------------------------------------------------------------

Dataset

labeled_final_dataset.jsonl

This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Other datasets:

The datasets can be found in the folder data that is divided in the following folders:

base_dataset

The splits to train, validate and test the models.

splits_data

Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

sentence_data

Pairs of sentences found in each corpus. They are used to generate splits_data.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

  • sentence_1: First sentence of the pair.
  • sentence_2: Second sentence of the pair.
  • connector: Linking phrase used to extract pair.
  • connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
  • extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
  • distance: How many sentences before the connector is the sentence_1
  • sentence_1_position: Number of sentence for sentence_1 in the source document
  • sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
  • sentence_2_position: Number of sentence for sentence_2 in the source document
  • sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
  • id: Unique identifier for the example
  • dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
  • genre: Writing genre of the dataset.
  • domain: Domain genre of the dataset.

Example:

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset load

To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

from auto_nli.model.bert_based.dataset import BERTDataset
dataset = BERTDataset(
os.path.join(dataset_folder,
max_len=
model_type=

only_premise=
max_samples=

----------------------------------------------------------------------------------------------

Notebooks

The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

Search
Clear search
Close search
Google apps
Main menu