16 datasets found

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
h
spider-syn
huggingface.co
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AhernTech s.r.o. (2024). spider-syn [Dataset]. https://huggingface.co/datasets/aherntech/spider-syn
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset authored and provided by
AhernTech s.r.o.
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Sypder-Syn

Spyder-Syn is a human curated variant of the Spider Text-to-SQL database. The database was created to test the robustness of text-to-SQL models for robustness of synonym substitution. The source GIT repo for Sypder-Syn is located here: https://github.com/ygan/Spider-Syn Details regarding the data perterbation methods used and objectives are described in ACL 2021: arXiv

Paper Abstract

Recently, there has been significant progress in… See the full description on the dataset page: https://huggingface.co/datasets/aherntech/spider-syn.
h
TURSpider
huggingface.co
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Buğra Kanburoğlu (2025). TURSpider [Dataset]. https://huggingface.co/datasets/AliBugra/TURSpider
Explore at:
Dataset updated
Sep 1, 2025
Authors
Ali Buğra Kanburoğlu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for TURSpider

TURSpider is a human curated variant of the Spider Text-to-SQL database. The source GIT repo for TURSpider is located here: https://github.com/alibugra/TURSpider/

Paper Abstract

This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide… See the full description on the dataset page: https://huggingface.co/datasets/AliBugra/TURSpider.
e
A global database of long-term changes in insect assemblages
knb.ecoinformatics.org
dataone.org
Updated Oct 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F11V5C9V
Explore at:
Unique identifier
https://doi.org/10.5063/F11V5C9V
Dataset updated
Oct 1, 2020
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
Description
This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources, representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
n
Data from: Climatic conditions and functional traits affect spider diets in...
data.niaid.nih.gov
datadryad.org
zip
Updated Feb 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klaus Birkhofer; El Aziz Djoudi; Benjamin Schnerch; Radek Michalko (2022). Climatic conditions and functional traits affect spider diets in agricultural and non-agricultural habitats worldwide [Dataset]. http://doi.org/10.5061/dryad.2bvq83brs
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2bvq83brs
Dataset updated
Feb 16, 2022
Dataset provided by
Mendel University in Brno
Brandenburg University of Technology Cottbus-Senftenberg
Authors
Klaus Birkhofer; El Aziz Djoudi; Benjamin Schnerch; Radek Michalko
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Spiders are dominant predators in terrestrial ecosystems and feed on prey from the herbivore and detritivore subsystem (dual subsystem omnivory) as well as on other predators (intraguild predation). Little is known about how global change potentially affects the importance of different prey groups in predator diets. In this meta-analysis we identify the impact of climatic conditions, land-use types and functional traits of spider species on the relative importance of Hemiptera, Araneae and Collembola prey in spider diets. We use a dataset including 78 publications with 149 observational records of the diet composition of 96 spider species in agricultural and non-agricultural habitats in 24 countries worldwide. The importance of Hemiptera prey was not affected by climatic conditions and was particularily high in smaller spider species in agricultural habitats. Araneae prey was most important for actively hunting, larger spider species in non-agricultural habitats. Collembola prey was most important for small, actively hunting spider species in regions with higher temperature seasonality. Spider species with a higher importance of Araneae prey for their diet also had higher importances of Collembola and lower importances of Hemiptera prey. Future increases of temperature seasonality predicted for several regions worldwide may go along with an increasing importance of Collembola prey which also related to a higher importance of intraguild prey here. Two global change drivers predicted for many regions of the world (increasing climatic seasonality and ongoing conversion of non-agricultural to agricultural land) both hold the potential to increase the importance of Collembola prey in spider diets. The importance of Hemiptera and Araneae prey may however show contrasting responses to these two drivers. These complex potential effects of global change components and their impact on functional traits in spider communities highlight the importance to simultaneously consider multiple drivers of global change to better understand future predator-prey interactions. Methods This study is based on a global database about the diet composition of hunting and web-building spider species in natural ecosystems used in Birkhofer and Wolters (2012) with the addition of data from agricultural ecsoystems and updates from Diehl et al. (2013b), Michalko & Pekár (2015a), Arvidsson et al. (2020) and Mezőfi et al. (2020). All data in the original publications are derived by direct visual records of prey or prey remains in spider species in field studies, not including data from molecular or experimental studies. Note that only subsets of data from the original database from non-agricultural (82 cursorial and web-building spider species in natural habitats: Birkhofer & Wolters 2012) or only for web-building spiders (63 spider species in agricultural, natural and forest habitats: Birkhofer et al. 2018) were previously published. The database includes 118 unique publications that reported 310 datasets about diet compositions in individual spider species worldwide. All datasets that were based on fewer than 20 recorded prey items per spider species in individual studies or did address spider species in forest habitats were excluded for this study. The selection of a minimum of 20 records was based on the fact that spiders in each study could theoretically reach the maximum diet breadth, including prey from all 20 prey orders that were originally recorded across datasets (see Birkhofer & Wolters 2012). The selection of non-forest habitats was based on the aim to compare habitats that share a major structural characteristic by not being dominated by dense, natural tree cover. Forests further have a very different invertebrate community compared to grasslands and arable fields (e.g. Birkhofer et al. 2015), which would limit a comparison of diets between major habitat types. The remaining 78 publications provided 149 datasets on the diet composition of 96 spider species worldwide (Figure 1). This database was used to calculate the relative contribution of each prey order to the overall diet in each dataset as percentage value. The percentages of Hemiptera, Collembola and Araneae prey were then extracted to reflect the relative contribution of these prey orders to the diet of individual spider species.
e
InsectChange: A global database of long-term changes in insect, arachnid and...
knb.ecoinformatics.org
search.dataone.org
+1more
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen (2023). InsectChange: A global database of long-term changes in insect, arachnid and Entognatha assemblages [Dataset]. https://knb.ecoinformatics.org/view/urn%3Auuid%3A9c946111-05e2-48c9-afb1-2783ee43d0ed
Explore at:
Dataset updated
Oct 2, 2023
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen
Time period covered
Jan 1, 1925 - Jan 1, 2020
Area covered
Variables measured
End, Date, Link, Rank, Year, Class, Error, Genus, Level, Order, and 79 more
Description
UPDATE 2023 New tables have been added and old tables have been updated with new datasets Table 'Rawdata 2023.csv' contains the data as extracted from the publications, and as provided by the data owners. This table is linked to the table 'Taxondata 2023' via the column 'Taxon', to the table 'PlotData 2023' via the column 'Plot_ID' and to the table 'SampleData 2023' via the table 'Sample_ID'. This is the data from which the table 'InsectAbundanceBiomassData' was produced, but will not reproduce the exact same table as used in 2020 for the following reasons: 1) some mistakes have been corrected, 2) new datasets were added, 3) it contains other taxa than insects, arachnids and Entognatha: mollusks, worms and some vertebrates are still retained if they were present in the raw data and should be filtered out as needed, 4) other biodiversity metrics than abundance or biomass are included: density (abundance per fixed surface area), richness (number of species), Shannon (Shannon-Wiener index), Pielou (pielou's evenness index), rarefiedRichness (expected number of species for a fixed number of individuals), ENSPIE (inverted Simpson diversity index = Effective number of species of the Probability of interspecific encounter), 5) some raw data are still missing from this table, because for our newer work the raw data needed rarefaction for equalizing sampling effort. One study that was obviously incorrect has been removed (Datasource_ID 70). Table 'Taxondata 2023.csv' contains a taxonomic backbone to resolve all taxa in the table 'rawData 2023' to higher taxonomy. Note that this taxonomy is not corrected for synonyms and taxonomic changes. Nevertheless, the higher taxa are correctly assigned This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data set consists of five linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in t... Visit https://dataone.org/datasets/urn%3Auuid%3A9c946111-05e2-48c9-afb1-2783ee43d0ed for complete metadata about this dataset.
d
Data from: Spider venom potency exhibits phylogenetic prey-specificity but...
datadryad.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keith Lyons; Michel Dugon; Kevin Healy (2025). Spider venom potency exhibits phylogenetic prey-specificity but does not trade-off with body size or silk use in prey capture [Dataset]. http://doi.org/10.5061/dryad.76hdr7t4j
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.76hdr7t4j
Dataset updated
May 7, 2025
Dataset provided by
Dryad
Authors
Keith Lyons; Michel Dugon; Kevin Healy
Time period covered
Jun 17, 2024
Description
To test our hypotheses, we collated data on venom potency, body size, silk use in prey capture (yes/no), venom yield, LD50 model species and natural diet from literature sources using the Web of Science search engine using the key words “LD50” “Venom”, “Spider” “Arachnid” “Yield” and following key references and databases such as The World Spider Trait database. All data was stored and organised in Microsoft Excel (S1-S2). The data has been processed prior to analysis. See Supplementary document S6 for a detailed description of both the methodology used and data descriptions, in the form of column heading descriptions for both dataset files, S1 (Main dataset) and S2 (length to mass conversion data). To reproduce the results and phylogeny, see S3 and S4 (knowledge of R coding language required). To see full model outputs (results) in the form of tables and figures, see S5. To recap: S1 and S2 are the datasets, S1 being the main dataset and S2 being separate data used to convert spider bo...
n
Data from: Pattern of seasonal variation in rates of predation between...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Wise; Robin Mores; Jennifer Pajda-De La O; Matthew McCary (2023). Pattern of seasonal variation in rates of predation between spider families is temporally stable in a food web with widespread intraguild predation [Dataset]. http://doi.org/10.5061/dryad.dz08kps43
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.dz08kps43
Dataset updated
Oct 19, 2023
Dataset provided by
University of Illinois Chicago
Rice University
Authors
David Wise; Robin Mores; Jennifer Pajda-De La O; Matthew McCary
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Intraguild predation (IGP) – predation between generalist predators (IGPredator and IGPrey) that potentially compete for a shared prey resource – is a common interaction module in terrestrial food webs. Understanding temporal variation in webs with widespread IGP is relevant to testing food web theory. We investigated temporal constancy in the structure of such a system: the spider-focused food web of the forest floor. Multiplex PCR was used to detect prey DNA in 3,300 adult spiders collected from the floor of a deciduous forest during spring, summer, and fall over four years. Because only spiders were defined as consumers, the web was tripartite, with 11 consumer nodes (spider families) and 22 resource nodes: 11 non-spider arthropod taxa (order- or family-level) and the 11 spider families. Most (99%) spider-spider predation was on spider IGPrey, and ~90% of these interactions were restricted to spider families within the same broadly defined foraging mode (cursorial or web-spinning spiders). Bootstrapped-derived confidence intervals (BCI’s) for two indices of web structure, restricted connectance and interaction evenness, overlapped broadly across years and seasons. A third index, % IGPrey (% IGPrey among all prey of spiders), was similar across years (~50%) but varied seasonally, with a summer rate (65%) ~1.8x higher than spring and fall. This seasonal pattern was consistent across years. Our results suggest that extensive spider predation on spider IGPrey that exhibits consistent seasonal variation in frequency, and that occurs primarily within two broadly defined spider-spider interaction pathways, must be incorporated into models of the dynamics of forest-floor food webs. Methods Study system We collected spiders and potential non-spider prey from the oak-dominated (Quercus alba and Q. rubra) Swallow Cliff Woods (41° 40.519’ N, 87° 51.437’ W) within the 320-ha Swallow Cliff nature preserve in Palos Township, Illinois (USA). The preserve, which is within the Chicago metropolitan region, is managed by the Cook County Forest Preserve District. Forests in this region are actively managed for several invasive plants (23), and the forest floor at Swallow Cliffs contains a thick leaf-litter layer with an abundant and diverse arthropod community. Collecting spiders and non-spider prey Our goal was to search the ground layer and low understory as thoroughly as possible, so that we would collect enough spiders from less-abundant families to yield the same number of spiders per family analyzed for prey DNA. We did not estimate spider densities. All collections were made between 1000 and 1600 hours. We collected from a different location each day. The size of the area searched each day was not measured and varied with the number of searchers. Collecting areas were widely distributed throughout Swallow Cliff Woods, but we did not subdivide the Woods into sampling regions. Most terrain was upland forest, but some collections were taken from a few scattered wet/marshy areas. The number of collecting days in each season was spring (31), summer (33), and fall (29) over the years 2009, 2010, 2011 and 2012; the number of days per year was 33, 12, 34 and 14, respectively. On each collecting day, we used both litter sifting and simple searching to capture spiders from several microhabitats. For litter sifting, we placed litter collected by hand into a flat tray (58 cm x 17 cm x 15 cm) with a screen bottom. This tray was shaken over a second tray of the same size with a solid bottom, allowing arthropods to fall through the screen to be collected by hand or aspirator. Sifted litter was returned to its original location. Spiders were also collected by hand from the litter surface, open areas in the litter, logs, low vegetation up to ~1m, and tree trunks up to ~2m. Individual spiders were placed in separate labelled vials. Of the spiders that were eventually analyzed for prey DNA (see below), 81% were captured from either leaf litter (70%) or adjacent bare ground/logs (11%). Thus, most spiders were collected from the litter layer broadly defined. The litter layer is a fairly distinct subsystem with respect to rates of migration of arthropod predators and prey (24). Nevertheless, we did not limit our definition of the “forest floor” to the litter layer because many spiders spin webs in vegetation close to the ground. Also, some cursorial species move back and forth between the ground and lower understory vegetation and tree trunks (for example, 84% of the Corinnidae, a guild of “foliage runners” (25), were collected from leaf litter). Therefore, we also analyzed spiders that had been collected from low vegetation (10%) and tree trunks (9%). All specimens were placed on ice within one hour of capture. On the same day, spiders collected for detection of consumed prey using PCR were taken to the laboratory where they were weighed and stored at -20◦C in a 1.5-mL microcentrifuge tube containing 95% ethanol (EtOH). Spiders and non-spider prey (see below) intended for primer development or assay optimization (see below for details) were kept alive, weighed, placed individually into 60-mL glass vials, and provided with water ad libitum at room temperature. Spiders were identified to family and genus using identification guides (26-29). Voucher specimens (one adult male and female) for each genus (when available) were archived at The Field Museum (Chicago, Illinois). Over the four years, ~14,000 spiders (juveniles and adults) from 20 families were collected. Presence of prey DNA was tested for adult spiders from 11 abundant families (those with at least 300 adults) that live primarily on the forest floor. Spiders from six of these families (Corinnidae, Gnaphosidae, Lycosidae, Pisauridae, Salticidae, and Thomisidae) do not spin webs to capture prey (“cursorial” spiders). The other five families (Agelenidae, Dictynidae, Hahniidae, Linyphiidae, and Theridiidae) are “web spinners.” This dichotomy reflects basic differences in foraging behavior (16, 17), but the distinction is not absolute. The web spinners in our food web include genera of spiders that also forage for prey off their web (18). Non-spider arthropod prey were also collected for primer development. They were not sampled quantitatively, but were simply selected due to their apparent abundance in leaf litter and/or activity just above the litter layer, and their likely occurrence in the diets of at least one spider family (15-17, 30). Non-spider nodes of the food web were broadly defined taxonomically (at the Order level except for Gryllidae): flies (Diptera), moths/butterflies (Lepidoptera), springtails (Collembola), ants/bees/wasps (Hymenoptera), jumping bristletails (Archaeognatha), crickets (Gryllidae), pseudoscorpions (Pseudoscorpiones), harvestmen (Opiliones), beetles (Coleoptera), earwigs (Dermaptera), and pillbugs (Isopoda). Molecular techniques Primer development and optimization We utilized multiplex PCR to sequence DNA from at least ten spiders from each family and at least ten specimens from each non-spider prey taxon. Each spider was first starved for at least ten days to eliminate any gut-content DNA that may have been present. Specimens were then homogenized in 180 μL of phosphate-buffered saline (PBS) (Hoefer, San Francisco, CA). DNA was then extracted with a Qiagen DNEasy Tissue Kit (Valencia, CA) using the manufacturer’s protocol. Upon completion of DNA extraction, the 200μL of eluate was well-mixed, separated into 20μL aliquots, and stored at -20°C until analysis. The general arthropod primers LCO-1490 and HCO-2198 (31) were used to amplify DNA from the mitochondrial genome’s cytochrome oxidase I (COI) region. Eluate from DNA extractions was amplified and sequenced by The Field Museum (Chicago, IL) or Research Resources Center (RRC) at the University of Illinois, Chicago. Sequences were used to conduct BLASTN searches following the protocol developed by (32) using the databases GenBank and BOLD (the Barcode of Life Database). Following (33), database sequences were used only if they showed ≥97% match to submitted sequences. Sequences were aligned using the CLUSTALW or AMPLICON programs. Primers were designed with the assistance of the IDT (Integrated DNA Technologies, Coralville, IA) program PrimerQuest and tested for melting temperature and CG content using Sci-Tools OligoAnalyzer (IDT). Spider gut-content testing After a PCR assay was developed and optimized for a particular prey taxon (spider family or non-spider arthropod), frozen field-caught adult spiders were tested for the presence of the target-prey DNA. Spiders were thawed to room temperature and underwent DNA extraction and PCR amplification as described above. The entire spider was homogenized, except for the largest individuals, for which legs were removed to increase the prey/predator DNA ratio; coxae were left attached to the body when possible because spider guts often extend into the coxae (17). The homogenate was then mixed and 4uL were added to a well (on a 96-well plate) that contained 21 uL of Master Mix. Every run also included positive, negative, and blank controls to ensure that target DNA was amplified and that no contamination existed on the run. Positive controls consisted of DNA specific to the target taxon in question, negative controls contained the PCR Master Mix without DNA template, and blank controls were created from MBG water. A sample was considered positive for target-prey DNA within the spider’s gut if the Ct value of the amplification curve was above the background threshold, if the shape of the curve was sigmoidal, and if the positive and negative controls were acceptable. Samples that did not show amplification were re-analyzed using arthropod-general primers (31) before identifying them as negative results; questionable samples (low amplification or a non-sigmoidal shape) were re-tested. For constructing the food web, adult
f
The importance of the six bioclimatic predictors used in habitat range...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifu Wang; Nicolas Casajus; Christopher Buddle; Dominique Berteaux; Maxim Larrivée (2023). The importance of the six bioclimatic predictors used in habitat range models of Latrodectus variolus and Sphodros niger. [Dataset]. http://doi.org/10.1371/journal.pone.0201094.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201094.t003
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Yifu Wang; Nicolas Casajus; Christopher Buddle; Dominique Berteaux; Maxim Larrivée
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The importance of the six bioclimatic predictors used in habitat range models of Latrodectus variolus and Sphodros niger.
m
Building behaviour does not drive rates of phenotypic evolution in spiders
figshare.mq.edu.au
researchdata.edu.au
+2more
bin
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Wolff (2023). Building behaviour does not drive rates of phenotypic evolution in spiders [Dataset]. http://doi.org/10.5061/dryad.tb2rbp015
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tb2rbp015
Dataset updated
Jun 1, 2023
Dataset provided by
Macquarie University
Authors
Jonas Wolff
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set contains raw data tables, scripts and supplemental figures supporting the article Wolff et al. (2021, PNAS 118: e2102693118). In our study we assembled morphometric and ecological trait data of spiders from literature and de novo measurements and observations and used this data to infer the rates of morphological change over deep time in relation to web building behaviour.

Methods We built a database of morphometric and ecological data on a representative taxon sample of the order Araneae. We followed the taxon sample of the Araneae Tree of Life project (AToL) 2, which contains 932 terminals of all at that time valid families except Synaphridae (corresponds to ~2% of described species). This sample is representative of the phylogenetic and morphological diversity of spiders. AToL terminals that were not identified to species level and for which no image material was available were replaced with described species with a type locality close to the collection site (26.3% of the used sample; details in main article 1). 11% of the AToL terminals were omitted as there was not enough information to determine a suitable replacement, resulting in a total of 828 included species.

The morphological data were assembled by extracting data from taxonomic descriptions using the WSC database 3, and measurements on images published in articles or online repositories (including Morphbank :: Biological Imaging, http://www.morphbank.net, where images of AToL specimens were deposited), with one to seven sources combined per species (for statistics of used sources see main article 1, and for a list see file “Dataset_morphometric-data-raw.xlxs”). As many spiders exhibit a significant sexual dimorphism, only data of adult females were used. We included only general traits, i.e., ones that were assumed to be affected by more than one niche property. For instance, body shape may be under selection from a mix of abiotic (e.g., temperature and microhabitat structure) and biotic (e.g., prey spectrum and predation) factors. The following measurements were recorded: body length; cephalothorax (prosoma) length; cephalothorax width; height of cephalothorax (carapace); length of mouth parts (i.e. cheliceral base segment); diameter of each eye type; length of front leg (excl. coxa, trochanter and pretarsus). From these the following six traits were calculated: (1) body size (=body length); (2) body shape (cephalothorax width / cephalothorax length); (3) relative cephalothorax height (cephalothorax height / (cephalothorax length + width)); (4) size of mouth parts (paturon length / cephalothorax height); (5) eye size (sum of diameters of all eye types / cephalothorax width); (6) relative leg length (length of front leg / cephalothorax width). From each trait the species mean was calculated (i.e., from the 1-7 data sources, for details see main article 1 and file “Dataset_morphometric-data-raw.xlsx”) and log-transformed, to build the species matrix for further analysis (file “Dataset_combined-trait-matrix.csv”).

The ecological data matrix was built by assessing the literature on same or closely related species, and in few cases complemented by personal observations (for details, see notes in file “Dataset_ecological-data-raw.xlsx”). We used a binary coded category: state 0, non-builder; state 1, builder. We defined a species as a ‘builder’ (1), if individuals spend most of their life in a self-constructed web or burrow, i.e. foraging and reproduction takes place on, in or from the artefact, and the artefact aids in prey capture, signalling and/or defence. In contrast, a ‘non-builder’ (0) does not build a capture web or a burrow, it may build a retreat, which, however, is only used in periods of inactivity and does not aid in prey capture.

Method References 1 Wolff, J. O., Wierucka, K., Uhl, G., & Herberstein, M. E. Building behavior does not drive rates of phenotypic evolution in spiders. Proc Natl Acad Sci 118, e2102693118 (2021). 2 Wheeler, W. C. et al. The spider tree of life: phylogeny of Araneae based on target‐gene analyses from an extensive taxon sampling. Cladistics 33, 574-616 (2017). 3 Nentwig, W., Gloor, D. & Kropf, C. Taxonomic database: Spider taxonomists catch data on web. Nat Cell Biol 528, 479 (2015).

Usage Notes For a description of files and data please refer to the README file.
Data for "A trade-off between latitude and elevation is a possible driver...
figshare.com
txt
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Mammola; Thomas Hesselberg; Enrico Lunghi (2020). Data for "A trade-off between latitude and elevation is a possible driver for range segregation of broadly distributed cave-dwelling spiders" [Dataset]. http://doi.org/10.6084/m9.figshare.12687692.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12687692.v1
Dataset updated
Sep 11, 2020
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Stefano Mammola; Thomas Hesselberg; Enrico Lunghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tab-delimited(.txt) occurrence database used to generate the analysis in the study:Mammola S, Hesselberg T, Lunghi E (2020) A trade-off between latitude and elevation is a possible driver for range segregation of broadly distributed cave-dwelling spiders. Journal of Zoological Systematics and Evolutionary Research, accepted.Occurrences were thinned prior to analysis to control for spatial autocorrelation. The complete database of occurrence localities of Meta spiders in Europe is being published and made available in an associated data paper (Hesselberg et al., in preparation).
f
Text-to-SQL Verification Methods and Benchmark
figshare.com
zip
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarfah Alrashed; David R. Karger; Madhup Sukoon; Natasha Noy (2025). Text-to-SQL Verification Methods and Benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.29896328.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29896328.v5
Dataset updated
Aug 31, 2025
Dataset provided by
figshare
Authors
Tarfah Alrashed; David R. Karger; Madhup Sukoon; Natasha Noy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the source code, evaluation results, and a new verification benchmark developed for our research on LLM-generated SQL verification.We propose and evaluate two novel LLM-based SQL verification methods. To advance research in this area, we have constructed a dedicated LLM-generated text-to-SQL verification benchmark. Unlike traditional generation-focused benchmarks, which only contain "gold" SQL queries, our benchmark provides a mix of labeled correct and incorrect SQL queries for a given natural language question. This allows for a more comprehensive evaluation of verification algorithms, enabling the measurement of both False Acceptance Rate (FAR) and False Rejection Rate (FRR).This benchmark is derived from the development sets of three popular Text-to-SQL generation benchmarks: BIRD, Spider, and KaggleDBQA. We hope that by providing this dataset, we can contribute to the ongoing improvement of Text-to-SQL verification systems. The benchmark data includes the original natural language questions, database schemas, and our generated candidate SQL queries, each labeled as either correct or incorrect based on execution-based ground truth.We are also sharing the complete source code for our proposed SQL verification methods, as well as the implementation for the SQL Critique baseline method, which is commonly used in recent Text-to-SQL generation pipelines. The code allows for the reproduction of our verification experiments and the evaluation of other verification algorithms on our new benchmark.
n
Seed traits of seed within spider monkey, howler monkey feces, and dung...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karen Marie Pedersen; Nico Blüthgen (2022). Seed traits of seed within spider monkey, howler monkey feces, and dung beetles' dung balls [Dataset]. http://doi.org/10.5061/dryad.cjsxksn6p
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cjsxksn6p
Dataset updated
Oct 21, 2022
Dataset provided by
Technical University of Darmstadt
Authors
Karen Marie Pedersen; Nico Blüthgen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
These data files contain seed traits from three sources, 1) seed traits from "Seeds of Amazonian Plants", 2) Royal Botanic Gardens Kew Seed Information Database, and 3) seeds dissected from field collections of primate feces and dung balls from dung beetles. The dataset was used in the article published in Biotropica entitled "Seed size and pubescence facilitate secondary dispersal by dung beetles". The data mostly describes seed traits of morphospecies within the feces of brown-headed spider monkeys (Ateles fusciceps) and mantled howler monkeys (Alouatta palliata). Traits included in the data set are size, surface, length, width, shape, color, and dispersal by mammals. Methods 1. Description of methods used for collection/generation of data:

The data file "Seed Traits.csv" was generated from the key in: Cornejo, F., & Janovec, J. (2010). Seeds of Amazonian plants. Princeton University Press. The data file "dispersal syndromes KEW.csv" was generated from Royal Botanic Gardens Kew (2020). Seed Information Database (SID). Version 7.1. https://data.kew.org/sid/. The dispersal data for each non-wind-dispersed genus included in "Seeds of Amazonaian Plants", was matched against the SID, to extract dispersal information, and then a logical variable created "YES/NO", for mammal dispersal. Data from the remaining data files was generated from monkey fecal samples, and dung beetle dung balls collected in the field.

Methods for processing the data:

The seed traits from the "Seed Traits.csv" file are all taken from the genus identification key, using the characters defined by the book, size, shape, color, and surface. Some genera have more than one combination of characters. For the data file "dispersal syndromes KEW.csv" dispersal data for each none wind dispersed genus included in "Seeds of Amazonaian Plants", was matched against the SID, to extract dispersal information, and then a logical variable created "YES/NO", for mammal dispersal. Remaining data files are from field-collected data. In the field, the monkey species that produced the feces was identified, and if the sample was a dung ball, the beetle was collected with the ball for identification. Fecal samples and dung balls were dissected to remove the seeds. The seeds were then grouped by morphospecies and identified to genus as well as possible. Seed length and width were measured and the seed surface was characterized.
f
SVM-Based Prediction of Propeptide Cleavage Sites in Spider Toxins...
figshare.com
ai
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily S. W. Wong; Margaret C. Hardy; David Wood; Timothy Bailey; Glenn F. King (2023). SVM-Based Prediction of Propeptide Cleavage Sites in Spider Toxins Identifies Toxin Innovation in an Australian Tarantula [Dataset]. http://doi.org/10.1371/journal.pone.0066279
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0066279
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Emily S. W. Wong; Margaret C. Hardy; David Wood; Timothy Bailey; Glenn F. King
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spider neurotoxins are commonly used as pharmacological tools and are a popular source of novel compounds with therapeutic and agrochemical potential. Since venom peptides are inherently toxic, the host spider must employ strategies to avoid adverse effects prior to venom use. It is partly for this reason that most spider toxins encode a protective proregion that upon enzymatic cleavage is excised from the mature peptide. In order to identify the mature toxin sequence directly from toxin transcripts, without resorting to protein sequencing, the propeptide cleavage site in the toxin precursor must be predicted bioinformatically. We evaluated different machine learning strategies (support vector machines, hidden Markov model and decision tree) and developed an algorithm (SpiderP) for prediction of propeptide cleavage sites in spider toxins. Our strategy uses a support vector machine (SVM) framework that combines both local and global sequence information. Our method is superior or comparable to current tools for prediction of propeptide sequences in spider toxins. Evaluation of the SVM method on an independent test set of known toxin sequences yielded 96% sensitivity and 100% specificity. Furthermore, we sequenced five novel peptides (not used to train the final predictor) from the venom of the Australian tarantula Selenotypus plumipes to test the accuracy of the predictor and found 80% sensitivity and 99.6% 8-mer specificity. Finally, we used the predictor together with homology information to predict and characterize seven groups of novel toxins from the deeply sequenced venom gland transcriptome of S. plumipes, which revealed structural complexity and innovations in the evolution of the toxins. The precursor prediction tool (SpiderP) is freely available on ArachnoServer (http://www.arachnoserver.org/spiderP.html), a web portal to a comprehensive relational database of spider toxins. All training data, test data, and scripts used are available from the SpiderP website.
Z
Code and data associated with: Searching the web builds fuller picture of...
data.niaid.nih.gov
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strine, Colin T. (2024). Code and data associated with: Searching the web builds fuller picture of arachnid trade [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5758540
Explore at:
Dataset updated
Jul 17, 2024
Dataset provided by
Orr, Michael C.
Strine, Colin T.
Fukushima, Caroline S.
Hughes, Alice C.
Marshall, Benjamin Michael
Cardoso, Pedro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code used in the paper: Searching the web builds fuller picture of arachnid trade. Throughout the methods we have indicated the stage of analysis each data component was used and the code script connected. We have numbered to code and data supplements to reflect as closely as possible the order in which data generation and summary was undertaken. The following provide additional details linked to each of the data files.

Data S1 - Website data: lang = language of the search engine used, ad hoc websites had language described after discovery; engine = the search engine used; page = the page on which the website appeared from the search engine; searchdate = search date in YYYY-mm-dd HH:MM:SS; link = link to the webpage, redacted to protect website identity; reviewdate = date revewied for arachnids being sold and search strategy; sells = whether the website sells arachnids (1 == sells); allow = whether the site explcicilt forbids automated searching (1 == allows, NA when search method was not fully automated, e.g., single page); type = the type of the website (e.g., trade, classified ads); order = whether arachnids where organised in a particular ways; target = a refined target URL to start search; method = the search method chosen, see methods for details; refine = any refinement or filter than could constrain the scope of the website to be searched; spages = the number of pages required to cycle through to cover the entire stock (also separated by ; if multiple cycles where needed or multiple single pages could be easily collected); prelimCheck = whether the website passed initial checks for arachnid selling; notes = any details that might need special attention during searches; webID = code used for subsequent data summary.

Data S2 - Raw keyword searches outputs: species keywords. sp = the modern species or genus that a keyword is associated with; page = the number of the page the keyword was detected on; keyw = the exact keyword that was detected; spORgen = whether the keyword was a species binomial or just genus; termsSurrounding = the words surrounding a genus keyword detection (only applies to Data S3); webID = the website ID.

Data S3 – Raw keyword searches outputs: genus keywords. sp = the modern species or genus that a keyword is associated with; page = the number of the page the keyword was detected on; keyw = the exact keyword that was detected; spORgen = whether the keyword was a species binomial or just genus; termsSurrounding = the words surrounding a genus keyword detection (multiple detections separated by ;); webID = the website ID.

Data S4 - Raw keyword search outputs: temporal sample. sp = the modern species or genus that a keyword is associated with; page = the number of the page the keyword was detected on; keyw = the exact keyword that was detected; spORgen = whether the keyword was a species binomial or just genus; termsSurrounding = the words surrounding a genus keyword detection (multiple detections separated by ;); webID = the website ID; timestamp.parse = the timestamp extracted from the archived web page; year = a simplified timestamp including only the year.

Data S5 - LEMIS data used. An arachnid filtered version of 74,75.

Data S6 - CITES trade database data used 76.

Data S7 - CITES appendices data used 77.

Data S8 - IUCN Redlist data used 78.

Data S9 - Compiled final dataset, with data deriving from WSC, Scorpion files, ITIS, WAM and the data collection process. speciesId = a numeric code, one per species; clade = the clade the species belongs to; family = the family the species belongs to; genus = the genus of the species; species = the species epithet; author = the species authority name; year = the species authority year; parentheses = whether parentheses are needed with the authority; distribution = WSC original distribution descriptions; invalid = whether the species is considered valid; source = the species source, either World Spider Catalogue, Scorpion files, ITIS or WAM; accName = the species binomial being used as our accepted name; allNames = the accepted species binomial and all synonyms; allGenera = the accepted genus, and all other genera the species has belonged to at one point; onlineTradeSnap = whether the species was detected via a match to the accName in the snapshot data; onlineTradeSnap_Any = whether the species was detected via any synonym in the snapshot data; onlineTradeSnap_genus = whether the genus was detected via a match to the genus in the snapshot data; onlineTradeSnap_genusAny = whether the genus was detected via any synonym in the snapshot data; onlineTradeTemp = whether the species was detected via a match to the accName in the temporal data; onlineTradeTemp_Any = whether the species was detected via any synonym in the temporal data; onlineTradeTemp_genus = whether the genus was detected via a match to the genus in the temporal data; onlineTradeTemp_genusAny = whether the genus was detected via any synonym in the temporal data; onlineTradeEither = whether the species was detected via a match to the accName in the temporal data or snapshot data; onlineTradeEither_Any = whether the species was detected via any synonym in the temporal data or snapshot data; LEMIStrade = whether the species was detected via a match to the accName in the LEMIS data; LEMIStrade_Any = whether the species was detected via any synonym in the LEMIS data; LEMIStrade_genus = whether the genus was detected via any synonym in the LEMIS data; LEMIStrade_genusAny = whether the genus was detected via any synonym in the LEMIS data; CITEStrade = whether the species was detected via a match to the accName in the CITES trade database data; CITEStrade_Any = whether the species was detected via any synonym in the CITES trade database data; CITEStrade_genus = whether the genus was detected via any synonym in the CITES trade database data; CITEStrade_genusAny = whether the genus was detected via any synonym in the CITES trade database data; CITESapp = the CITES appendix the species is listed under using an exact match to the accName; CITESapp_Any = the CITES appendix the species is listed under using any match to any of the species’ synonyms; redlist = the IUCN Redlist category the species is listed under using an exact match to the accName; redlist_Any = the IUCN Redlist category the species is listed under using any match to any of the species’ synonyms; extactMatchTraded = the species is detected in any of the trade sources via a match to the accName; anyMatchTraded = the species is detected in any of the trade sources via a match to any species’ synonym.

Data S10 - Forum listings of “What species are you currently keeping” from an online fora posted between 9th September 2021 and 9th October 2021, to provide an idea of online discussions. Each user with a separate list is provided in a separate tab. Morph_collector is the same as poster1, but the potential cryptic species or morphs are noted separately to make them clearer.

Data S11 – Distribution information for spiders. Only two columns used in summaries: accName = the accepted name used throughout summaries; NAME = the country name the spider occurs in.

Data S12 - Distribution information for scorpions. species = the accepted name used throughout summaries; NAME = the country name the scorpions occurs in.

Code S1 - Search URL Extract.R

Code S2 - Retrieve web data.R

Code S3 - Temporal Classified Ads.R

Code S4 - Keyword Generation.R

Code S5 - Keyword Search.R

Code S6 - LEMIS filter and summary.R

Code S7 - Compiling results.R

Code S8 - Summary Figures.R

Code S9 - Temporal Figures.R

Code S10 - New description figure.R

Code S11 - Term exploration.R

Code S12 - LEMIS summary and mapping.R
srli_global_araneae
demo.gbif.org
gbif.org
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Biodiversity Data Journal (2025). srli_global_araneae [Dataset]. http://doi.org/10.15468/kjmhog
Explore at:
Unique identifier
https://doi.org/10.15468/kjmhog
Dataset updated
Mar 24, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Biodiversity Data Journal
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 10, 6 - Jun 10, 37
Area covered
Earth
Description
A sample of 200 species of spiders were randomly selected from the World Spider Catalog 2018, an updated global database containing all recognized species names for the group. Species data were collected from all taxonomic bibliography available at the World Spider Catalog 2018 and complemented by data in other publications found through Google Scholar or other sources (https://www.biodiversitylibrary.org; https://login.webofknowledge.com; http://srs.britishspiders.org.uk; http://symbiota4.acis.ufl.edu/scan/portal; https://lepus.unine.ch; http://www.tuite.nl/iwg/Araneae/SpiBenelux/?species; https://atlas.arages.de; https://arachnology.cz/rad/araneae-1.html; http://biodiversityresearch.org/research/biogeography/iberia/).
These data were used in assessing the global threat status of spider species worldwide. This will serve as the basis for a future Sampled Red List Index (SRLI) for spiders. SRLI are typically employed to assess the conservation priorities and trends of large organismal groups, and are thus suited for assessing the conservation trends of large taxa as a whole.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Explore at:

txt, json, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5205322

Dataset updated

Aug 16, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

Clear search

Close search

Google apps

Main menu

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

spider-syn

TURSpider

A global database of long-term changes in insect assemblages

Data from: Climatic conditions and functional traits affect spider diets in...

InsectChange: A global database of long-term changes in insect, arachnid and...

Data from: Spider venom potency exhibits phylogenetic prey-specificity but...

Data from: Pattern of seasonal variation in rates of predation between...

The importance of the six bioclimatic predictors used in habitat range...

Building behaviour does not drive rates of phenotypic evolution in spiders

Data for "A trade-off between latitude and elevation is a possible driver...

Text-to-SQL Verification Methods and Benchmark

Seed traits of seed within spider monkey, howler monkey feces, and dung...

SVM-Based Prediction of Propeptide Cleavage Sites in Spider Toxins...

Code and data associated with: Searching the web builds fuller picture of...

srli_global_araneae

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQLSee More Versions

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL