Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of German question-answer pairs along with their corresponding context [1]. It is designed to enhance and facilitate natural language processing (NLP) tasks in the German language [1]. The dataset includes two main files, train.csv
and test.csv
, each containing numerous entries of various contexts with associated questions and answers in German [1]. The contextual information can range from paragraphs to concise sentences, offering a well-rounded representation of different scenarios [1]. It serves as a valuable resource for training machine learning models to improve question-answering systems or other NLP applications specific to the German language [1].
The dataset consists of the following columns [1, 2]: * id: An identifier for each entry [2]. * context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information [1]. * question: The question related to the given context [2]. * answers: This column contains the answer or answers to the given question within the corresponding context [1]. The answers could be single or multiple [1]. * Label Count: Numerical ranges with corresponding counts [2].
The dataset is provided in CSV format [1, 3], comprising two main files: train.csv
and test.csv
[1]. Both files contain a significant number of question-answer pairs and their respective contexts [1]. While specific total row or record counts are not explicitly stated, the source material indicates substantial amounts of data [1]. For instance, certain label counts range from 36,419.00 to 45,662.00, with varying numbers of entries within those ranges, such as 529, 508, or 29 unique values for specific segments [2].
This dataset is ideal for a variety of applications and use cases, including [1]: * Building question-answering systems in German. * Training models for German language understanding and translation tasks. * Developing information retrieval systems that can process German user queries and return relevant information from provided contexts. * Enhancing NLP models for accuracy and robustness in German. * Exploring state-of-the-art methodologies or developing novel approaches for natural language understanding in German [1].
The dataset's linguistic scope is specifically the German language [1]. Geographically, it is intended for global use [4]. There are no specific notes on time range or demographic availability within the provided sources.
CC0
The dataset is intended for [1]: * Researchers working on advancements in machine learning techniques applied to natural language understanding in German. * Developers building and refining NLP applications for the German language. * Enthusiasts exploring and implementing machine learning models for language processing.
Original Data Source: German Question-Answer Context Dataset
General overview The following datasets are described by this metadata record, and are available for download from the provided URL.
####
Physical parameters raw log files
Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
The headers are PAR: Photoactive radiation relETR: F0/Fm x PAR Notes: Stage/step of light curve Treatment: Acidified or control
The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
After 10.0 sec at 0.5% = 1 umols PAR After 10.0 sec at 0.7% = 1 umols PAR After 10.0 sec at 1.1% = 0.96 umols PAR After 10.0 sec at 1.6% = 4.32 umols PAR After 10.0 sec at 2.4% = 4.32 umols PAR After 10.0 sec at 3.6% = 8.31 umols PAR After 10.0 sec at 5.3% =15.78 umols PAR After 10.0 sec at 8.0% = 25.75 umols PAR
This dataset appears to be missing data, note D5 rows potentially not useable information
See the word document in the download file for more information.
This version (V3) fixes a bug in Version 2 where 1993 data did not properly deal with missing values, leading to enormous counts of crime being reported. This is a collection of Offenses Known and Clearances By Arrest data from 1960 to 2016. The monthly zip files contain one data file per year(57 total, 1960-2016) as well as a codebook for each year. These files have been read into R using the ASCII and setup files from ICPSR (or from the FBI for 2016 data) using the package asciiSetupReader. The end of the zip folder's name says what data type (R, SPSS, SAS, Microsoft Excel CSV, feather, Stata) the data is in. Due to file size limits on open ICPSR, not all file types were included for all the data. The files are lightly cleaned. What this means specifically is that column names and value labels are standardized. In the original data column names were different between years (e.g. the December burglaries cleared column is "DEC_TOT_CLR_BRGLRY_TOT" in 1975 and "DEC_TOT_CLR_BURG_TOTAL" in 1977). The data here have standardized columns so you can compare between years and combine years together. The same thing is done for values inside of columns. For example, the state column gave state names in some years, abbreviations in others. For the code uses to clean and read the data, please see my GitHub file here. https://github.com/jacobkap/crime_data/blob/master/R_code/offenses_known.RThe zip files labeled "yearly" contain yearly data rather than monthly. These also contain far fewer descriptive columns about the agencies in an attempt to decrease file size. Each zip folder contains two files: a data file in whatever format you choose and a codebook. The data file is aggregated yearly and has already combined every year 1960-2016. For the code I used to do this, see here https://github.com/jacobkap/crime_data/blob/master/R_code/yearly_offenses_known.R.If you find any mistakes in the data or have any suggestions, please email me at jkkaplan6@gmail.comAs a description of what UCR Offenses Known and Clearances By Arrest data contains, the following is copied from ICPSR's 2015 page for the data.The Uniform Crime Reporting Program Data: Offenses Known and Clearances By Arrest dataset is a compilation of offenses reported to law enforcement agencies in the United States. Due to the vast number of categories of crime committed in the United States, the FBI has limited the type of crimes included in this compilation to those crimes which people are most likely to report to police and those crimes which occur frequently enough to be analyzed across time. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. Much information about these crimes is provided in this dataset. The number of times an offense has been reported, the number of reported offenses that have been cleared by arrests, and the number of cleared offenses which involved offenders under the age of 18 are the major items of information collected.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.
Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).
At-a-station Hydraulic Geometry
We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:
\(w={aQ^b}\)
\(d=cQ^f\)
\(v=kQ^m\)
where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:
Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.
At-a-station Manning's n
We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by
\(n = {{R^{2/3}S^{1/2}} \over v}\)
where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q50; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10—which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.
Daily discharge percentiles
We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:
We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q1 and Q5, increments of 5% (e.g. Q10, Q15, Q20, etc.) between Q5 and Q95, increments of 1% between Q95 and Q99, and increments of 0.1% between Q99 and Q100 in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.
HG Regionalization
We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:
\(log(Q_i) = \beta_1log(DA) + \beta_0\)
where Qi is streamflow at percentile i, DA is drainage area and \(\beta_1\) and \(\beta_0\) are regression parameters. We report \(\beta_1\), \(\beta_0\) , and the r2 value of the regression relationship for Q percentiles Q10, Q25, Q50, Q75, Q90, Q95, Q99, and Q99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).
Dataset description
We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r2 values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.
Errata
In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Avian predators avoid attacking fly-mimicking beetles: A field experiment on evasive mimicry using artificial preyMany Neotropical beetles present coloration patterns mimicking red-eyed flies, which are presumably evasive mimicry models. However, the role of predators in selecting for evasive mimics in nature remains untested. In a field experiment, we used nontoxic plasticine replicas of a specialized fly-mimicking beetle species, which we placed on the host plants of the beetles. We show that replicas painted with reddish patches simulating the eyes of flesh flies experienced a much lower predation rate than control replicas. We found that beak marks were the most frequent signs of attack on plasticine replicas, underlining the potential selective pressure exerted by birds. Replicas that matched the size of the beetles suffered higher predation than smaller or larger replicas. The predation rate was also higher for beetle replicas exposed during the warm and wet season, when adult beetles occur. Our results support predator-mediated selection of mimic beetles, highlighting that reddish spots resembling flies’ eyes comprise an important trait in reducing attack by avian predators.This study describes the results of two field experiments using artificial beetle replicas made with nontoxic plasticine placed on host plants of a specialized fly-mimicking beetle species.The experiment_#1 manipulated the coloration pattern of beetle replicas to test the prediction that replicas with red patches would suffer less attacks by avian predators than control replicas not resembling red-eyed flies. The experiment_#2 evaluated three specific predictions.First, attack marks corresponding to beaks should be frequently observed in the beetle replicas. Second, we predicted that replicas with intermediate size (0,5 – 1,0 cm) should be more attacked by birds than those in the extreme size classes (0,25 – 2,0 and 4,0 cm). Third, we predicted that avian attacks on replicas should be higher in the peak of the wet season than during the peak of the dry season. The file "Dataset_Avian predators avoid attacking fly-mimicking" contains all data.The spreadsheet Experiment_#1_raw contains seven columns and 361 lines, where each line represents a single plasticine replica. The first column (#), designates plasticine replica number within each sample unity. The second column (Site) designates the area where blocks were stablished. The third column (Block) designates different locations where sample unities (groups of 10 plasticine replicas of each treatment) were subject to same spatial conditions. The fourth column (PlantTag) designates individual mistletoes where replicas were placed. The fifth column (Treatment) designates coloration pattern of plasticine replicas, classified in control, bad (brown-eyed) and good (red-eyed). The sixth column (Attack) designates the occurrence of attack to replicas, and classification according to vestiges left on plasticine (bird - V or U-shaped beak marks; insects - small mouthparts perforations; mammal - teeth marks; missing - plasticine removed). The spreadsheet Experiment_#1 contains five columns and 37 lines, where each line represents a sample unity,group of ten plasticine replicas of the same color pattern (Treatment) within a block. The first column (Site) designates the area where blocks were stablished. The second column (Block) designates different locations of groups of 10 plasticine replicas of each treatment. The third column (Treatment) designates color pattern of plasticine replicas. The fourth column (Total) designates the valid number of replicas in each sample unity. The fifth column (Attacked) designates the number of replicas with evidence of bird attack.The spreadsheet Experiment_#2_raw contains eight columns and 1201 lines, where each line represents a single plasticine replica. The first column (#), designates plasticine replica number within each sample unity. The second column (Site) designates the area where blocks were stablished. The third column (Block) designates different locations where sample unities (groups of 10 plasticine replicas of each treatment) were subject to same spatial conditions. The fourth column (PlantTag) designates individual threes where replicas were placed. The fifth column (Season) designates period when the experiment was conducted. The sixth column (Size) designates size class of plasticine replicas. The seventh column (Attack) designates the occurrence of attack to replicas, and classification according to vestiges left on plasticine (bird - V or U-shaped beak marks; insects - small mouthparts perforations; mammal - teeth marks; missing - plasticine removed). The spreadsheet Experiment_#2 contains seven columns and 121 lines, where each line represents a sample unity, group of ten plasticine replicas of the same size within a block in each period. The first column (Site) designates the area where blocks were stablished.The second column (Block) designates different locations of groups of 10 plasticine replicas of each treatment. The third column (Season) designates period when the experiment was conducted. The fourth column (Size) designates size class of plasticine replicas. The fifth column (Total) designates the valid number of replicas in each sample unity. The sixth column (Attacked) designates the number of replicas with evidence of bird attack. The spreadsheet Coordinates contains blocks’ geographical coordinates and elevation data.We analyzed data using R software (R Core Team, 2019). We employed generalized linear mixed effects models (GLMMs, glmer for non-normal datasets, with lme4 package in R) with fixed and random effects to analyze the datasets of replica attacks for experiments #1 and #2.In the experiment_#1, the explanatory variable Treatment encompassed replica coloration with three levels (control – nor paint, bad – brown eyed, and good – red eyed) and was considered as a fixed factor and Site and Block assigned as random effect. The response variable were the proportions of replicas exclusively attacked by avian predators after 14 days of exposure, calculated by dividing the variable Attacked by the variable Total. For this analysis we used data of the spreadsheet Experiment_#1. Data of spreadsheet Experiment_#1 was transformed into a TXT file denominated "Attack" to be used in R software. In the experiment_#2 the explanatory variables Size (replica size classes, five levels: 0,25 – 0,5 – 1,0 – 2,0 – 4,0 cm) and Season (two levels: dry and wet) and the interaction among these factors were considered as fixed factors, whereas Site and Block were assigned as random effects accounting for spatial heterogeneity of samples. The response variable were the proportions of replicas exclusively attacked by avian predators after 14 days of exposure, calculated by dividing the variable Attacked by the variable Total. For this analysis we used data of the spreadsheet Experiment_#2. Data of spreadsheet Experiment_#2 was transformed into a txt.file denominated "Size" to be used in R software. The dataset of both experiments fitted binomial distribution error of the response variable. We selected the minimal models after the removal of non-significant variables (P-value > 0.05). If we detected significant differences in variables with more than two levels (Size or Treatment), we performed contrast analysis to determine differences among levels. In both minimal models we checked for error distribution and data overdispersion.The figures 3 and 4ab were generated on Excel using absolute counts or claculated means and standard errors. The raw data underlying Figures 3, 4a and 4b are on the spreadsheets with the same name (Figure 3, Figure 4a, Figure 4b) of file "Dataset_Avian predators avoid attacking fly-mimicking", where the figure and the underlying formula are located. Double click on the figures to view how the data generated the figure. The figures were pasted on PowerPoint for inclusion of letters and photos of the replicas. Slides were than transformed on .jpeg archives.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Latin Inscriptions in Space and Time (LIST) dataset is an aggregate of the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/); aggregated EDH on Zenodo and Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/); aggregated EDCS on Zenodo epigraphic datasets created by the Social Dynamics in the Ancient Mediterranean Project (SDAM), 2019-2023, funded by the Aarhus University Forskningsfond Starting grant no. AUFF-E-2018-7-2. The LIST dataset consists of 525,870 inscriptions, enriched by 65 attributes. 77,091 inscriptions are overlapping between the two source datasets (i.e. EDH and EDCS); 3,316 inscriptions are exclusively from EDH; 445,463 inscriptions are exclusively from EDCS. 511,973 inscriptions have valid geospatial coordinates (the geometry
attribute). This information is also used to determine the urban context of each inscription (i.e. whether it is in the neighbourhood (i.e. within a 5000m buffer) of a large city, medium city, or small city or rural (>5000m to any type of city; see the attributes urban_context
, urban_context_city
, and urban_context_pop
). 206,570 inscriptions have a numerical date of origin expressed by means of an interval or singular year using the attributes not_before
and not_after
. The dataset also employs a machine learning model to classify the inscriptions covered exclusively by EDCS in terms of 22 categories employed by EDH, see Kaše, Heřmánková, Sobotkova 2021.
Formats
We publish the dataset in the parquet and geojson file format. A description of individual attributes is available in the Metadata.csv. Using geopandas
library, you can load the data directly from Zenodo into your Python environment using the following command: LIST = gpd.read_parquet("
https://zenodo.org/record/8431323/files/LIST_v1-0.parquet?download=1")
. In R, the sfarrow and sf library hold tools (st_read_parquet(), read_sf()) to load a parquet and geojson respectively after you have downloaded the datasets locally. The scripts used to generate the dataset are available via GitHub: https://github.com/sdam-au/LI_ETL
The origin of existing attributes is further described in columns ‘dataset_source’, ‘source’, and ‘description’ in the attached Metadata.csv.
Further reading on the dataset creation and methodology:
Reading on applications of the datasets in research:
Notes on spatial attributes
Machine-readable spatial point geometries are provided within the geojson and parquet formats, as well as ‘Latitude’ and ‘Longitude’ columns, which contain geospatial decimal coordinates where these are known. Additional attributes exist that contain textual references to original location at different scales. The most reliable attribute with textual information on place of origin is the urban_context_city. This contains the ancient toponym of the largest city within a 5 km distance from the inscription findspot, using cities from Hanson’s 2016 list. After these universal attributes, the remaining columns are source-dependent, and exist only for either EDH or EDCS subsets. ‘pleiades_id’ column, for example, cross references the inscription findspot to geospatial location in the Pleiades but only in the EDH subset. ‘place’ attribute exists for data from EDCS (Ort) and contains ancient as well as modern place names referring to the findspot or region of provenance separated by “/”. This column requires additional cleaning before computational analysis. Attributes with _clean affix indicate that the text string has been stripped of symbols (such as ?), and most refer to aspects of provenance in the EDH subset of inscriptions.
List of all spatial attributes:
Disclaimer
The original data is provided by the third party indicated as the data source (see the ‘data_source’ column in the Metadata.csv). SDAM did not create the original data, vouch for its accuracy, or guarantee that it is the most recent data available from the data provider. For many or all of the data, the data is by its nature approximate and will contain some inaccuracies or missing values. The data may contain errors introduced by the data provider(s) and/or by SDAM. We always recommend checking the accuracy directly in the primary source, i.e. the editio princeps of the inscription in question.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List available_habitat.txt - data file representing the available habitat gps_locations.txt - data file representing the GPS location data Rcode.R - R code to analyze the example data using the RSF likelihood for GPS fix success all_files.zip - all files at once Description Rcode.R analyzes the example data, which is one of the simulated data sets with 90% GPS fix success contained in the Nielson et al. paper. There are two data files (available_habitat.txt and gps_locations.txt) representing the available habitat and the GPS location data, respectively. The example data can be analyzed by saving both data files to a working directory, opening an R session and copying and pasting all text below at the R command prompt. Select quantities are output to the terminal. The description of columns in the data files is provided below.In available_habitat.txt: (1) utmX = utm easting coordinate of habitat unit (2) utmY = utm northing coordinate of habitat unit (3) unit.id = habitat unit ID (4) prcnt.sage = % Wyoming Big Sage (5) elevation = elevation (km) In gps_locations.txt: (1) unit.id = habitat unit of GPS location (missing = NA) (2) fix.attempt = sequential fix attempt number Column sums for 'available_habitat.txt" (in order): utmX = 4.985109e+7 = 0.0000004985109 utmY = 8.483487e+8 = 0.00000008483487 unit.id = 1.7205e+4 = 0.00017205 prcnt.sage = 1.06875e+4 = 0.000106875 elevation = 3.795941e+2 = 0.03795941 Column sums for "gps_locations.txt" (in order): unit.id = 42761 (note some "NA" values, which indicate missing) fix.attempt = 98790
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.
The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).
The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.
Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).
Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.
We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:
Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)
For example, calculations for the 3-day cumulative wind speed are:
(1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)
Similarly, we calculated the cumulative weighted wind Direction using the formula:
Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)
We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.
To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.
To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.
Limitations: It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.
Format:
It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
The data set in this accession contains 100 stations of hydrographic data collected in the northeast Atlantic, south of the Azores, aboard R/V ENDEAVOR, cruise #143. Date of the data are May 1-19, 1987. Two dissolved chlorofluorocarbons CCL3F (Freon 11) and CCL2F2 (Freon 12) were obtained at a number of stations along the cruise track. Data format: the first three columns are CTD pressures (dbar), depth (meters) and CTD temperatures (Deg C) at which each water sample was collected. These columns are followed by the water sample salinity (o/oo), dissolved oxygen (ml/l), calculated variable potential temperature (Deg C), Freon 11 (pmol/kg) and Freon 12 (pmol/kg). Missing values are indicated with -9.000. The data provided by Dr. T. Joyce, Woods Hole Oceanographic Institution.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.