84 datasets found
  1. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  2. g

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • gimi9.com
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://www.gimi9.com/dataset/data-gov_buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f/
    Explore at:
    Dataset updated
    Dec 4, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux)

  3. Datasets & utils for paper USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE...

    • zenodo.org
    zip
    Updated Jun 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Masiero; Simone Masiero (2021). Datasets & utils for paper USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE CODE REVIEW ACTIVITIES [Dataset]. http://doi.org/10.5281/zenodo.4812785
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Simone Masiero; Simone Masiero
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Raw and processed datasets & Configurations files for Pre-training and Fine-Tuning T5 models

    • Pre-Training dataset Obtained by mining Stack Overflow and CodeSearchNet data.
    • Fine-Tuning dataset We will fine-tune our T5 small model on different datasets obtained by mining code review data from Gerrit and GitHub repositories.
      • Fine-Tuning dataset v1 (Small) Same dataset used by Tufano et al., abstracted code and raw comments.
      • Fine-Tuning dataset v2 (Small) Same dataset used by Tufano et al., not abstracted code and cleaned comments.
      • Fine-Tuning dataset (Large) Our new Large dataset
  4. h

    Copernicus-Pretrain

    • huggingface.co
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Copernicus-Pretrain [Dataset]. https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain
    Explore at:
    Dataset updated
    Jul 4, 2024
    Authors
    Yi Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Copernicus-Pretrain

    Copernicus-Pretrain is a large-scale EO pretraining dataset with 18.7M aligned images covering all major Sentinel missions (S1,2,3,5P). Officially named Copernicus-Pretrain, also referred to as SSL4EO-S ("S" means Sentinel), as an extension of SSL4EO-S12 to the whole Sentinel series.

      Dataset Details
    

    Copernicus-Pretrain contains 18.7M aligned imagery from all major Sentinel missions in operation (Sentinel-1 SAR, Sentinel-2… See the full description on the dataset page: https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain.

  5. P

    M5Product Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Sep 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Michael C. Kampffmeyer; XiaoYong Wei; Minlong Lu; YaoWei Wang; Xiaodan Liang (2021). M5Product Dataset [Dataset]. https://paperswithcode.com/dataset/m5product
    Explore at:
    Dataset updated
    Sep 8, 2021
    Authors
    Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Michael C. Kampffmeyer; XiaoYong Wei; Minlong Lu; YaoWei Wang; Xiaodan Liang
    Description

    The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

    • 6 Million multi-modal samples, 5k properties with 24 Million values

    • 5 modalities-image text table video audio

    • 6 Million category annotations with 6k classes

    • Wide data source (1 Million merchants provide)

  6. Data from: PASS: An ImageNet replacement for self-supervised pretraining...

    • zenodo.org
    • data.niaid.nih.gov
    csv, tar
    Updated Jun 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi; Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi (2022). PASS: An ImageNet replacement for self-supervised pretraining without humans [Dataset]. http://doi.org/10.5281/zenodo.5528345
    Explore at:
    tar, csvAvailable download formats
    Dataset updated
    Jun 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi; Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computer vision has long relied on ImageNet and other large datasets of images sampled from the Internet for pretraining models. However, these datasets have ethical and technical shortcomings, such as containing personal information taken without consent, unclear license usage, biases, and, in some cases, even problematic image content. On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining. We thus propose an unlabelled dataset PASS: Pictures without humAns for Self-Supervision. PASS only contains images with CC-BY license and complete attribution metadata, addressing the copyright issue. Most importantly, it contains no images of people at all, and also avoids other types of images that are problematic for data protection or ethics. We show that PASS can be used for pretraining with methods such as MoCo-v2, SwAV and DINO. In the transfer learning setting, it yields similar downstream performances to ImageNet pretraining even on tasks that involve humans, such as human pose estimation. PASS does not make existing datasets obsolete, as for instance it is insufficient for benchmarking. However, it shows that model pretraining is often possible while using safer data, and it also provides the basis for a more robust evaluation of pretraining methods.

    A simple download script is here: https://github.com/yukimasano/PASS/blob/main/download.sh
    Visit our webpage here: https://www.robots.ox.ac.uk/~vgg/research/pass/

  7. h

    pretraining

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dustin, pretraining [Dataset]. https://huggingface.co/datasets/AIGym/pretraining
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Dustin
    Description

    AIGym Pretraining Dataset

      Dataset Description
    

    The AIGym Pretraining Dataset is a large, diverse corpus assembled for pretraining language models and related natural language processing tasks. This dataset aggregates content from four distinct source datasets available on the Hugging Face Hub:

    anothy1/fineweb-edu-cleaned-simplified

    Content: Educational text (originally in the revised_text column).
    Processing: The text is renamed to text and any list structure is… See the full description on the dataset page: https://huggingface.co/datasets/AIGym/pretraining.

  8. P

    MMLU Dataset

    • paperswithcode.com
    Updated Jan 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMLU Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
    Explore at:
    Dataset updated
    Jan 5, 2025
    Authors
    Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
    Description

    MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

  9. f

    The pretraining information from ViT-Large.

    • plos.figshare.com
    xls
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu (2024). The pretraining information from ViT-Large. [Dataset]. http://doi.org/10.1371/journal.pone.0299265.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computer-aided diagnosis systems based on deep learning algorithms have shown potential applications in rapid diagnosis of diabetic retinopathy (DR). Due to the superior performance of Transformer over convolutional neural networks (CNN) on natural images, we attempted to develop a new model to classify referable DR based on a limited number of large-size retinal images by using Transformer. Vision Transformer (ViT) with Masked Autoencoders (MAE) was applied in this study to improve the classification performance of referable DR. We collected over 100,000 publicly fundus retinal images larger than 224×224, and then pre-trained ViT on these retinal images using MAE. The pre-trained ViT was applied to classify referable DR, the performance was also compared with that of ViT pre-trained using ImageNet. The improvement in model classification performance by pre-training with over 100,000 retinal images using MAE is superior to that pre-trained with ImageNet. The accuracy, area under curve (AUC), highest sensitivity and highest specificity of the present model are 93.42%, 0.9853, 0.973 and 0.9539, respectively. This study shows that MAE can provide more flexibility to the input image and substantially reduce the number of images required. Meanwhile, the pretraining dataset scale in this study is much smaller than ImageNet, and the pre-trained weights from ImageNet are not required also.

  10. h

    TxT360

    • huggingface.co
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLM360 (2024). TxT360 [Dataset]. https://huggingface.co/datasets/LLM360/TxT360
    Explore at:
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    LLM360
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

      We introduce TxT360 (Trillion eXtracted Text) the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 commonly used non-web data sources (e.g. FreeLaw, PG-19, etc.) providing pretraining teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset, and train the most performant models.
    
    
    
      TxT360 Compared to Common Pretraining… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.
    
  11. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  12. lsg-roberta-large pretrained weights

    • kaggle.com
    Updated Nov 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kɔuq Wang (2022). lsg-roberta-large pretrained weights [Dataset]. https://www.kaggle.com/datasets/gmhost/lsgrobertalarge-pretrained-weights/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kɔuq Wang
    Description

    Dataset

    This dataset was created by kwang

    Contents

  13. Transfer learning with generative models for object detection on limited...

    • zenodo.org
    zip
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Paiano; Matteo Paiano; Stefano Martina; Stefano Martina (2024). Transfer learning with generative models for object detection on limited datasets [Dataset]. http://doi.org/10.5281/zenodo.13121950
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo Paiano; Matteo Paiano; Stefano Martina; Stefano Martina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided datasets are used for the analysis in the work "Transfer learning with generative models for object detection on limited datasets" (https://doi.org/10.1088/2632-2153/ad65b5). The availability of data is limited in some fields, especially for object detection tasks, where it is necessary to have correctly labeled bounding boxes around each object. A notable example of such data scarcity is found in the domain of marine biology, where it is useful to develop methods to automatically detect submarine species for environmental monitoring. To address this data limitation, the state-of-the-art machine learning strategies employ two main approaches. The first involves pretraining models on existing datasets before generalizing to the specific domain of interest. The second strategy is to create synthetic datasets specifically tailored to the target domain using methods like copy-paste techniques or ad-hoc simulators. The first strategy often faces a significant domain shift, while the second demands custom solutions crafted for the specific task. In response to these challenges, here we propose a transfer learning framework that is valid for a generic scenario. In this framework, generated images help to improve the performances of an object detector in a few-real data regime. This is achieved through a diffusion-based generative model that was pretrained on large generic datasets. With respect to the state-of-the-art, we find that it is not necessary to fine tune the generative model on the specific domain of interest. We believe that this is an important advance because it mitigates the labor-intensive task of manual labeling the images in object detection tasks. We validate our approach focusing on fishes in an underwater environment, and on the more common domain of cars in an urban setting. Our method achieves detection performance comparable to models trained on thousands of images, using only a few hundreds of input data. Our results pave the way for new generative AI-based protocols for machine learning applications in various domains, for instance ranging from geophysics to biology and medicine. The provided datasets are built with the help of Gligen and the already existing NuImages, Ozfish and Deepfish datasets. The file "CarGenerated.zip" contains images generated with Gligen and with provided bounding boxes around cars in an urban environment. The file "fishes_on_bkg.zip" provides fish images generated with fishes from Deepfish inpainted with Gligen on generated backgrounds. The file "fish_text.zip" contains images completely generated with Gligen containing fishes with annotated bounding boxes. Finally, the file "oz_masked_512.zip" contains a simpler dataset of copy paste images of Deepfish fishes on Ozfish backrounds. All the files contains the images saved in different folders for training and validation, plus an index file called gt_fish.csv for the bounding boxes.

  14. S

    CPIA Dataset_Part05: A Comprehensive Pathological Image Analysis Dataset for...

    • scidb.cn
    Updated Mar 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nan Ying; Yanli Lei; Tianyi Zhang; Shangqing Lyu; Sicheng Chen; Zeyu Liu; Yu Zhao; Yunlu Feng; Guanglei Zhang (2024). CPIA Dataset_Part05: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training [Dataset]. http://doi.org/10.57760/sciencedb.14727
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Nan Ying; Yanli Lei; Tianyi Zhang; Shangqing Lyu; Sicheng Chen; Zeyu Liu; Yu Zhao; Yunlu Feng; Guanglei Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This paper presents a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,579 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and characteristic regions of interest (ROIs). And we establish a multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. This is the Part05 of CPIA dataset, including the CPIA-Mini and partial CPIA dataset. The related code and information are available at https://github.com/zhanglab2021/CPIA_Dataset.

  15. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  16. P

    WIT Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
    Explore at:
    Dataset updated
    Jun 14, 2023
    Authors
    Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
    Description

    Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

    Key Advantages

    A few unique advantages of WIT:

    The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.

  17. P

    ImageNet-32 Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patryk Chrabaszcz; Ilya Loshchilov; Frank Hutter (2018). ImageNet-32 Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-32
    Explore at:
    Dataset updated
    Jul 24, 2018
    Authors
    Patryk Chrabaszcz; Ilya Loshchilov; Frank Hutter
    Description

    Imagenet32 is a huge dataset made up of small images called the down-sampled version of Imagenet. Imagenet32 is composed of 1,281,167 training data and 50,000 test data with 1,000 labels.

  18. Z

    CA-SUM pretrained models

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balaouras, Georgios (2022). CA-SUM pretrained models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6562991
    Explore at:
    Dataset updated
    May 20, 2022
    Dataset provided by
    Balaouras, Georgios
    Apostolidis, Evlampios
    Patras, Ioannis
    Mezaris, Vasileios
    Description

    This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, in Proc. ACM ICMR 2022.

    Method overview:

    In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

    File format:

    The “pretrained_models.zip“ file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created “pretrained_models” folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model.

    The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM.

    License and Citation:

    These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced:

    E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM

  19. xlm-roberta-large-squad2-pretrained

    • kaggle.com
    zip
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Splend1dChan(燦爛) (2021). xlm-roberta-large-squad2-pretrained [Dataset]. https://www.kaggle.com/datasets/a24998667/xlm-roberta-large-squad2-pretrained/suggestions
    Explore at:
    zip(10337941335 bytes)Available download formats
    Dataset updated
    Sep 30, 2021
    Authors
    Splend1dChan(燦爛)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Splend1dChan(燦爛)

    Released under CC0: Public Domain

    Contents

  20. H

    Pretrained models + simulations for our HESSD submission "Towards learning...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederik Kratzert (2021). Pretrained models + simulations for our HESSD submission "Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets" [Dataset]. http://doi.org/10.4211/hs.83ea5312635e44dc824eeb99eda12f06
    Explore at:
    zip(895.7 MB)Available download formats
    Dataset updated
    Nov 12, 2021
    Dataset provided by
    HydroShare
    Authors
    Frederik Kratzert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains all models trained for our publication "Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets", as well as the evaluated model simulations. The set contains 48 runs in total, stemming from 3 different models (trained with 8 repetitions) and two different loss functions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Organization logo

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Explore at:
txt, json, binAvailable download formats
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

Search
Clear search
Close search
Google apps
Main menu