84 datasets found

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
g
Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...
gimi9.com
Updated Dec 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://www.gimi9.com/dataset/data-gov_buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f/
Explore at:
Dataset updated
Dec 4, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux)
Datasets & utils for paper USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE...
zenodo.org
zip
Updated Jun 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Masiero; Simone Masiero (2021). Datasets & utils for paper USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE CODE REVIEW ACTIVITIES [Dataset]. http://doi.org/10.5281/zenodo.4812785
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4812785
Dataset updated
Jun 11, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simone Masiero; Simone Masiero
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Raw and processed datasets & Configurations files for Pre-training and Fine-Tuning T5 models

Pre-Training dataset Obtained by mining Stack Overflow and CodeSearchNet data.

Fine-Tuning dataset We will fine-tune our T5 small model on different datasets obtained by mining code review data from Gerrit and GitHub repositories.

Fine-Tuning dataset v1 (Small) Same dataset used by Tufano et al., abstracted code and raw comments.

Fine-Tuning dataset v2 (Small) Same dataset used by Tufano et al., not abstracted code and cleaned comments.

Fine-Tuning dataset (Large) Our new Large dataset
h
Copernicus-Pretrain
huggingface.co
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Copernicus-Pretrain [Dataset]. https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain
Explore at:
Dataset updated
Jul 4, 2024
Authors
Yi Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Copernicus-Pretrain

Copernicus-Pretrain is a large-scale EO pretraining dataset with 18.7M aligned images covering all major Sentinel missions (S1,2,3,5P). Officially named Copernicus-Pretrain, also referred to as SSL4EO-S ("S" means Sentinel), as an extension of SSL4EO-S12 to the whole Sentinel series.

Dataset Details

Copernicus-Pretrain contains 18.7M aligned imagery from all major Sentinel missions in operation (Sentinel-1 SAR, Sentinel-2… See the full description on the dataset page: https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain.
P
M5Product Dataset
paperswithcode.com
opendatalab.com
Updated Sep 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Michael C. Kampffmeyer; XiaoYong Wei; Minlong Lu; YaoWei Wang; Xiaodan Liang (2021). M5Product Dataset [Dataset]. https://paperswithcode.com/dataset/m5product
Explore at:
Dataset updated
Sep 8, 2021
Authors
Xiao Dong; Xunlin Zhan; Yangxin Wu; Yunchao Wei; Michael C. Kampffmeyer; XiaoYong Wei; Minlong Lu; YaoWei Wang; Xiaodan Liang
Description
The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

• 6 Million multi-modal samples, 5k properties with 24 Million values

• 5 modalities-image text table video audio

• 6 Million category annotations with 6k classes

• Wide data source (1 Million merchants provide)
Data from: PASS: An ImageNet replacement for self-supervised pretraining...
zenodo.org
data.niaid.nih.gov
csv, tar
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi; Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi (2022). PASS: An ImageNet replacement for self-supervised pretraining without humans [Dataset]. http://doi.org/10.5281/zenodo.5528345
Explore at:
tar, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5528345
Dataset updated
Jun 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi; Yuki Markus Asano; Christian Rupprecht; Andrew Zisserman; Andrea Vedaldi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Computer vision has long relied on ImageNet and other large datasets of images sampled from the Internet for pretraining models. However, these datasets have ethical and technical shortcomings, such as containing personal information taken without consent, unclear license usage, biases, and, in some cases, even problematic image content. On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining. We thus propose an unlabelled dataset PASS: Pictures without humAns for Self-Supervision. PASS only contains images with CC-BY license and complete attribution metadata, addressing the copyright issue. Most importantly, it contains no images of people at all, and also avoids other types of images that are problematic for data protection or ethics. We show that PASS can be used for pretraining with methods such as MoCo-v2, SwAV and DINO. In the transfer learning setting, it yields similar downstream performances to ImageNet pretraining even on tasks that involve humans, such as human pose estimation. PASS does not make existing datasets obsolete, as for instance it is insufficient for benchmarking. However, it shows that model pretraining is often possible while using safer data, and it also provides the basis for a more robust evaluation of pretraining methods.

A simple download script is here: https://github.com/yukimasano/PASS/blob/main/download.sh
Visit our webpage here: https://www.robots.ox.ac.uk/~vgg/research/pass/
h
pretraining
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin, pretraining [Dataset]. https://huggingface.co/datasets/AIGym/pretraining
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Dustin
Description
AIGym Pretraining Dataset

Dataset Description

The AIGym Pretraining Dataset is a large, diverse corpus assembled for pretraining language models and related natural language processing tasks. This dataset aggregates content from four distinct source datasets available on the Hugging Face Hub:

anothy1/fineweb-edu-cleaned-simplified

Content: Educational text (originally in the revised_text column).
Processing: The text is renamed to text and any list structure is… See the full description on the dataset page: https://huggingface.co/datasets/AIGym/pretraining.
P
MMLU Dataset
paperswithcode.com
Updated Jan 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MMLU Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
Explore at:
Dataset updated
Jan 5, 2025
Authors
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
Description
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
f
The pretraining information from ViT-Large.
plos.figshare.com
xls
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu (2024). The pretraining information from ViT-Large. [Dataset]. http://doi.org/10.1371/journal.pone.0299265.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299265.t004
Dataset updated
Mar 6, 2024
Dataset provided by
PLOS ONE
Authors
Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Computer-aided diagnosis systems based on deep learning algorithms have shown potential applications in rapid diagnosis of diabetic retinopathy (DR). Due to the superior performance of Transformer over convolutional neural networks (CNN) on natural images, we attempted to develop a new model to classify referable DR based on a limited number of large-size retinal images by using Transformer. Vision Transformer (ViT) with Masked Autoencoders (MAE) was applied in this study to improve the classification performance of referable DR. We collected over 100,000 publicly fundus retinal images larger than 224×224, and then pre-trained ViT on these retinal images using MAE. The pre-trained ViT was applied to classify referable DR, the performance was also compared with that of ViT pre-trained using ImageNet. The improvement in model classification performance by pre-training with over 100,000 retinal images using MAE is superior to that pre-trained with ImageNet. The accuracy, area under curve (AUC), highest sensitivity and highest specificity of the present model are 93.42%, 0.9853, 0.973 and 0.9539, respectively. This study shows that MAE can provide more flexibility to the input image and substantially reduce the number of images required. Meanwhile, the pretraining dataset scale in this study is much smaller than ImageNet, and the pre-trained weights from ImageNet are not required also.

TxT360

huggingface.co

Updated Jul 24, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

LLM360 (2024). TxT360 [Dataset]. https://huggingface.co/datasets/LLM360/TxT360

Explore at:

Dataset updated

Jul 24, 2024

Dataset authored and provided by

LLM360

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

  We introduce TxT360 (Trillion eXtracted Text) the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 commonly used non-web data sources (e.g. FreeLaw, PG-19, etc.) providing pretraining teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset, and train the most performant models.



  TxT360 Compared to Common Pretraining… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
lsg-roberta-large pretrained weights
kaggle.com
Updated Nov 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kɔuq Wang (2022). lsg-roberta-large pretrained weights [Dataset]. https://www.kaggle.com/datasets/gmhost/lsgrobertalarge-pretrained-weights/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kɔuq Wang
Description
Dataset

This dataset was created by kwang

Contents
Transfer learning with generative models for object detection on limited...
zenodo.org
zip
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Paiano; Matteo Paiano; Stefano Martina; Stefano Martina (2024). Transfer learning with generative models for object detection on limited datasets [Dataset]. http://doi.org/10.5281/zenodo.13121950
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13121950
Dataset updated
Jul 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo Paiano; Matteo Paiano; Stefano Martina; Stefano Martina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided datasets are used for the analysis in the work "Transfer learning with generative models for object detection on limited datasets" (https://doi.org/10.1088/2632-2153/ad65b5). The availability of data is limited in some fields, especially for object detection tasks, where it is necessary to have correctly labeled bounding boxes around each object. A notable example of such data scarcity is found in the domain of marine biology, where it is useful to develop methods to automatically detect submarine species for environmental monitoring. To address this data limitation, the state-of-the-art machine learning strategies employ two main approaches. The first involves pretraining models on existing datasets before generalizing to the specific domain of interest. The second strategy is to create synthetic datasets specifically tailored to the target domain using methods like copy-paste techniques or ad-hoc simulators. The first strategy often faces a significant domain shift, while the second demands custom solutions crafted for the specific task. In response to these challenges, here we propose a transfer learning framework that is valid for a generic scenario. In this framework, generated images help to improve the performances of an object detector in a few-real data regime. This is achieved through a diffusion-based generative model that was pretrained on large generic datasets. With respect to the state-of-the-art, we find that it is not necessary to fine tune the generative model on the specific domain of interest. We believe that this is an important advance because it mitigates the labor-intensive task of manual labeling the images in object detection tasks. We validate our approach focusing on fishes in an underwater environment, and on the more common domain of cars in an urban setting. Our method achieves detection performance comparable to models trained on thousands of images, using only a few hundreds of input data. Our results pave the way for new generative AI-based protocols for machine learning applications in various domains, for instance ranging from geophysics to biology and medicine. The provided datasets are built with the help of Gligen and the already existing NuImages, Ozfish and Deepfish datasets. The file "CarGenerated.zip" contains images generated with Gligen and with provided bounding boxes around cars in an urban environment. The file "fishes_on_bkg.zip" provides fish images generated with fishes from Deepfish inpainted with Gligen on generated backgrounds. The file "fish_text.zip" contains images completely generated with Gligen containing fishes with annotated bounding boxes. Finally, the file "oz_masked_512.zip" contains a simpler dataset of copy paste images of Deepfish fishes on Ozfish backrounds. All the files contains the images saved in different folders for training and validation, plus an index file called gt_fish.csv for the bounding boxes.
S
CPIA Dataset_Part05: A Comprehensive Pathological Image Analysis Dataset for...
scidb.cn
Updated Mar 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nan Ying; Yanli Lei; Tianyi Zhang; Shangqing Lyu; Sicheng Chen; Zeyu Liu; Yu Zhao; Yunlu Feng; Guanglei Zhang (2024). CPIA Dataset_Part05: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training [Dataset]. http://doi.org/10.57760/sciencedb.14727
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.14727
Dataset updated
Mar 20, 2024
Dataset provided by
Science Data Bank
Authors
Nan Ying; Yanli Lei; Tianyi Zhang; Shangqing Lyu; Sicheng Chen; Zeyu Liu; Yu Zhao; Yunlu Feng; Guanglei Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This paper presents a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,579 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and characteristic regions of interest (ROIs). And we establish a multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. This is the Part05 of CPIA dataset, including the CPIA-Mini and partial CPIA dataset. The related code and information are available at https://github.com/zhanglab2021/CPIA_Dataset.
h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
P
WIT Dataset
paperswithcode.com
huggingface.co
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
Explore at:
Dataset updated
Jun 14, 2023
Authors
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
Description
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
P
ImageNet-32 Dataset
paperswithcode.com
opendatalab.com
Updated Jul 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patryk Chrabaszcz; Ilya Loshchilov; Frank Hutter (2018). ImageNet-32 Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-32
Explore at:
Dataset updated
Jul 24, 2018
Authors
Patryk Chrabaszcz; Ilya Loshchilov; Frank Hutter
Description
Imagenet32 is a huge dataset made up of small images called the down-sampled version of Imagenet. Imagenet32 is composed of 1,281,167 training data and 50,000 test data with 1,000 labels.
Z
CA-SUM pretrained models
data.niaid.nih.gov
zenodo.org
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balaouras, Georgios (2022). CA-SUM pretrained models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6562991
Explore at:
Dataset updated
May 20, 2022
Dataset provided by
Balaouras, Georgios
Apostolidis, Evlampios
Patras, Ioannis
Mezaris, Vasileios
Description
This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, in Proc. ACM ICMR 2022.

Method overview:

In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

File format:

The “pretrained_models.zip“ file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created “pretrained_models” folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model.

The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM.

License and Citation:

These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced:

E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM
xlm-roberta-large-squad2-pretrained
kaggle.com
zip
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Splend1dChan（燦爛） (2021). xlm-roberta-large-squad2-pretrained [Dataset]. https://www.kaggle.com/datasets/a24998667/xlm-roberta-large-squad2-pretrained/suggestions
Explore at:
zip(10337941335 bytes)Available download formats
Dataset updated
Sep 30, 2021
Authors
Splend1dChan（燦爛）
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Splend1dChan（燦爛）

Released under CC0: Public Domain

Contents
H
Pretrained models + simulations for our HESSD submission "Towards learning...
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederik Kratzert (2021). Pretrained models + simulations for our HESSD submission "Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets" [Dataset]. http://doi.org/10.4211/hs.83ea5312635e44dc824eeb99eda12f06
Explore at:
zip(895.7 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.83ea5312635e44dc824eeb99eda12f06
Dataset updated
Nov 12, 2021
Dataset provided by
HydroShare
Authors
Frederik Kratzert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contains all models trained for our publication "Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets", as well as the evaluated model simulations. The set contains 48 runs in total, stemming from 3 different models (trained with 8 repetitions) and two different loss functions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Explore at:

txt, json, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5205322

Dataset updated

Aug 16, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

Clear search

Close search

Google apps

Main menu

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

Datasets & utils for paper USING PRE-TRAINED MODELS TO PARTIALLY AUTOMATE...

Copernicus-Pretrain

M5Product Dataset

Data from: PASS: An ImageNet replacement for self-supervised pretraining...

pretraining

MMLU Dataset

The pretraining information from ViT-Large.

TxT360

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

lsg-roberta-large pretrained weights

Dataset

Contents

Transfer learning with generative models for object detection on limited...

CPIA Dataset_Part05: A Comprehensive Pathological Image Analysis Dataset for...

the-stack

WIT Dataset

ImageNet-32 Dataset

CA-SUM pretrained models

xlm-roberta-large-squad2-pretrained

Dataset

Contents

Pretrained models + simulations for our HESSD submission "Towards learning...

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQLSee More Versions

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL