10 datasets found

f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
o
German Language Understanding Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). German Language Understanding Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/eebd00ea-b68f-4b29-9aa1-7fe980e60502
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a collection of German question-answer pairs along with their corresponding context [1]. It is designed to enhance and facilitate natural language processing (NLP) tasks in the German language [1]. The dataset includes two main files, train.csv and test.csv, each containing numerous entries of various contexts with associated questions and answers in German [1]. The contextual information can range from paragraphs to concise sentences, offering a well-rounded representation of different scenarios [1]. It serves as a valuable resource for training machine learning models to improve question-answering systems or other NLP applications specific to the German language [1].

Columns

The dataset consists of the following columns [1, 2]: * id: An identifier for each entry [2]. * context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information [1]. * question: The question related to the given context [2]. * answers: This column contains the answer or answers to the given question within the corresponding context [1]. The answers could be single or multiple [1]. * Label Count: Numerical ranges with corresponding counts [2].

Distribution

The dataset is provided in CSV format [1, 3], comprising two main files: train.csv and test.csv [1]. Both files contain a significant number of question-answer pairs and their respective contexts [1]. While specific total row or record counts are not explicitly stated, the source material indicates substantial amounts of data [1]. For instance, certain label counts range from 36,419.00 to 45,662.00, with varying numbers of entries within those ranges, such as 529, 508, or 29 unique values for specific segments [2].

Usage

This dataset is ideal for a variety of applications and use cases, including [1]: * Building question-answering systems in German. * Training models for German language understanding and translation tasks. * Developing information retrieval systems that can process German user queries and return relevant information from provided contexts. * Enhancing NLP models for accuracy and robustness in German. * Exploring state-of-the-art methodologies or developing novel approaches for natural language understanding in German [1].

Coverage

The dataset's linguistic scope is specifically the German language [1]. Geographically, it is intended for global use [4]. There are no specific notes on time range or demographic availability within the provided sources.

License

CC0

Who Can Use It

The dataset is intended for [1]: * Researchers working on advancements in machine learning techniques applied to natural language understanding in German. * Developers building and refining NLP applications for the German language. * Enthusiasts exploring and implementing machine learning models for language processing.

Dataset Name Suggestions

German QA Context Dataset

German NLP Question-Answer Pairs

Contextual German Questions & Answers

German Language Understanding Dataset

Attributes

Original Data Source: German Question-Answer Context Dataset
n
Respiration_chambers/raw_log_files and combined datasets of biomass and...
cmr.earthdata.nasa.gov
researchdata.edu.au
+1more
Updated Dec 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. http://doi.org/10.26179/5c1827d5d6711
Explore at:
Unique identifier
https://doi.org/10.26179/5c1827d5d6711
Dataset updated
Dec 18, 2018
Time period covered
Jan 27, 2015 - Feb 23, 2015
Area covered
Description
General overview The following datasets are described by this metadata record, and are available for download from the provided URL.

Raw log files, physical parameters raw log files

Raw excel files, respiration/PAM chamber raw excel spreadsheets

Processed and cleaned excel files, respiration chamber biomass data

Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment

Associated R script file for pump cycles of respirations chambers

####

Physical parameters raw log files

Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual

####

Respiration/PAM chamber raw excel spreadsheets

Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from

####

Respiration chamber biomass data

The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

####

Associated R script file for pump cycles of respirations chambers

Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

####

Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

The headers are PAR: Photoactive radiation relETR: F0/Fm x PAR Notes: Stage/step of light curve Treatment: Acidified or control

The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

After 10.0 sec at 0.5% = 1 umols PAR After 10.0 sec at 0.7% = 1 umols PAR After 10.0 sec at 1.1% = 0.96 umols PAR After 10.0 sec at 1.6% = 4.32 umols PAR After 10.0 sec at 2.4% = 4.32 umols PAR After 10.0 sec at 3.6% = 8.31 umols PAR After 10.0 sec at 5.3% =15.78 umols PAR After 10.0 sec at 8.0% = 25.75 umols PAR

This dataset appears to be missing data, note D5 rows potentially not useable information

See the word document in the download file for more information.
g
Uniform Crime Reporting Program Data: Offenses Known and Clearances by...
datasearch.gesis.org
Updated Jun 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2018). Uniform Crime Reporting Program Data: Offenses Known and Clearances by Arrest, 1960-2016 [Dataset]. http://doi.org/10.3886/E100707V3-5862
Explore at:
Unique identifier
https://doi.org/10.3886/E100707V3-5862
Dataset updated
Jun 12, 2018
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
This version (V3) fixes a bug in Version 2 where 1993 data did not properly deal with missing values, leading to enormous counts of crime being reported. This is a collection of Offenses Known and Clearances By Arrest data from 1960 to 2016. The monthly zip files contain one data file per year(57 total, 1960-2016) as well as a codebook for each year. These files have been read into R using the ASCII and setup files from ICPSR (or from the FBI for 2016 data) using the package asciiSetupReader. The end of the zip folder's name says what data type (R, SPSS, SAS, Microsoft Excel CSV, feather, Stata) the data is in. Due to file size limits on open ICPSR, not all file types were included for all the data. The files are lightly cleaned. What this means specifically is that column names and value labels are standardized. In the original data column names were different between years (e.g. the December burglaries cleared column is "DEC_TOT_CLR_BRGLRY_TOT" in 1975 and "DEC_TOT_CLR_BURG_TOTAL" in 1977). The data here have standardized columns so you can compare between years and combine years together. The same thing is done for values inside of columns. For example, the state column gave state names in some years, abbreviations in others. For the code uses to clean and read the data, please see my GitHub file here. https://github.com/jacobkap/crime_data/blob/master/R_code/offenses_known.RThe zip files labeled "yearly" contain yearly data rather than monthly. These also contain far fewer descriptive columns about the agencies in an attempt to decrease file size. Each zip folder contains two files: a data file in whatever format you choose and a codebook. The data file is aggregated yearly and has already combined every year 1960-2016. For the code I used to do this, see here https://github.com/jacobkap/crime_data/blob/master/R_code/yearly_offenses_known.R.If you find any mistakes in the data or have any suggestions, please email me at jkkaplan6@gmail.comAs a description of what UCR Offenses Known and Clearances By Arrest data contains, the following is copied from ICPSR's 2015 page for the data.The Uniform Crime Reporting Program Data: Offenses Known and Clearances By Arrest dataset is a compilation of offenses reported to law enforcement agencies in the United States. Due to the vast number of categories of crime committed in the United States, the FBI has limited the type of crimes included in this compilation to those crimes which people are most likely to report to police and those crimes which occur frequently enough to be analyzed across time. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. Much information about these crimes is provided in this dataset. The number of times an offense has been reported, the number of reported offenses that have been cleared by arrests, and the number of cleared offenses which involved offenders under the age of 18 are the major items of information collected.
HyG: A hydraulic geometry dataset derived from historical stream gage...
zenodo.org
data.niaid.nih.gov
csv
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh (2024). HyG: A hydraulic geometry dataset derived from historical stream gage measurements across the conterminous United States [Dataset]. http://doi.org/10.5281/zenodo.10425392
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10425392
Dataset updated
Feb 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Contiguous United States, United States
Description
Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.

Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).

At-a-station Hydraulic Geometry

We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:

\(w={aQ^b}\)

\(d=cQ^f\)

\(v=kQ^m\)

where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:

We considered only measurements which reported Q, v, w and A.

For each gage, we excluded measurements older than the most recent five years, so as to minimize the effects of long-term channel evolution on observed hydraulic geometry relationships.

We excluded gages for which measured Q disagreed with the product of measured velocity and measured area by more than 5%. Gages for which \( Q eq vA\) are often tidally influenced and therefore may not conform to expected channel geometry relationships.

Q, v, w, and d from field measurements at each gage were log-transformed. We performed robust linear regressions on the relationships between log(Q) and log(w), log(v), and log(d). AHG parameters were derived from the regressed explanatory variables.

We applied an iterative outlier detection procedure to the linear regression residuals. Values of log-transformed w, v, and d residuals falling outside a three median absolute deviation (MAD) envelope were excluded. Regression coefficients were recalculated and the outlier detection procedure was reapplied until no new outliers were detected.

Gages for which one or more regression had p-values >0.05 were excluded, as the relationships between log-transformed Q and w, v, or d lacked statistical significance.

Gages were omitted if regressed AHG parameters did not fulfill two additional relationships derived by Leopold and Maddock: \(b+f+m=1{\displaystyle \pm }0.1\) and \(a{\displaystyle \times }c{\displaystyle \times }k=1{\displaystyle \pm }0.1\).

If the number of field measurements for a given gage was less than 10, either initially or after individual measurements were removed via steps 1-4, the gage was excluded from further analysis.

Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.

At-a-station Manning's n

We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by

\(n = {{R^{2/3}S^{1/2}} \over v}\)

where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10^-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q₅₀; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10—which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.

Daily discharge percentiles

We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:

For a given gage, we removed blocks of missing discharge values longer than 6 months. These long blocks of missing data generally correspond to intervals in which a gage was temporarily decommissioned for maintenance.

A gage was omitted from further analysis if its discharge record was less than 10 years (3,652 days) long, and/or less than 90% complete (>10% missing values after removal of long blocks in step 1.

We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q₁ and Q₅, increments of 5% (e.g. Q₁₀, Q₁₅, Q₂₀, etc.) between Q₅ and Q₉₅, increments of 1% between Q₉₅ and Q₉₉, and increments of 0.1% between Q₉₉ and Q₁₀₀ in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.

HG Regionalization

We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:

\(log(Q_i) = \beta_1log(DA) + \beta_0\)

where Q_i is streamflow at percentile i, DA is drainage area and \(\beta_1\) and \(\beta_0\) are regression parameters. We report \(\beta_1\), \(\beta_0\) , and the r² value of the regression relationship for Q percentiles Q₁₀, Q₂₅, Q₅₀, Q₇₅, Q₉₀, Q₉₅, Q₉₉, and Q_99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).

Dataset description

We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r² values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.

Errata

In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.
f
Avian predators avoid attacking fly-mimicking beetles: A field experiment on...
figshare.com
txt
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tadeu Guerra; Rodrigo Fagundes; Flavio Camarota; Frederico Neves; Geraldo Fernandes (2024). Avian predators avoid attacking fly-mimicking beetles: A field experiment on evasive mimicry using artificial prey - Dataset & R Scripts [Dataset]. http://doi.org/10.6084/m9.figshare.22696870.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22696870.v6
Dataset updated
Feb 25, 2024
Dataset provided by
figshare
Authors
Tadeu Guerra; Rodrigo Fagundes; Flavio Camarota; Frederico Neves; Geraldo Fernandes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Avian predators avoid attacking fly-mimicking beetles: A field experiment on evasive mimicry using artificial preyMany Neotropical beetles present coloration patterns mimicking red-eyed flies, which are presumably evasive mimicry models. However, the role of predators in selecting for evasive mimics in nature remains untested. In a field experiment, we used nontoxic plasticine replicas of a specialized fly-mimicking beetle species, which we placed on the host plants of the beetles. We show that replicas painted with reddish patches simulating the eyes of flesh flies experienced a much lower predation rate than control replicas. We found that beak marks were the most frequent signs of attack on plasticine replicas, underlining the potential selective pressure exerted by birds. Replicas that matched the size of the beetles suffered higher predation than smaller or larger replicas. The predation rate was also higher for beetle replicas exposed during the warm and wet season, when adult beetles occur. Our results support predator-mediated selection of mimic beetles, highlighting that reddish spots resembling flies’ eyes comprise an important trait in reducing attack by avian predators.This study describes the results of two field experiments using artificial beetle replicas made with nontoxic plasticine placed on host plants of a specialized fly-mimicking beetle species.The experiment_#1 manipulated the coloration pattern of beetle replicas to test the prediction that replicas with red patches would suffer less attacks by avian predators than control replicas not resembling red-eyed flies. The experiment_#2 evaluated three specific predictions.First, attack marks corresponding to beaks should be frequently observed in the beetle replicas. Second, we predicted that replicas with intermediate size (0,5 – 1,0 cm) should be more attacked by birds than those in the extreme size classes (0,25 – 2,0 and 4,0 cm). Third, we predicted that avian attacks on replicas should be higher in the peak of the wet season than during the peak of the dry season. The file "Dataset_Avian predators avoid attacking fly-mimicking" contains all data.The spreadsheet Experiment_#1_raw contains seven columns and 361 lines, where each line represents a single plasticine replica. The first column (#), designates plasticine replica number within each sample unity. The second column (Site) designates the area where blocks were stablished. The third column (Block) designates different locations where sample unities (groups of 10 plasticine replicas of each treatment) were subject to same spatial conditions. The fourth column (PlantTag) designates individual mistletoes where replicas were placed. The fifth column (Treatment) designates coloration pattern of plasticine replicas, classified in control, bad (brown-eyed) and good (red-eyed). The sixth column (Attack) designates the occurrence of attack to replicas, and classification according to vestiges left on plasticine (bird - V or U-shaped beak marks; insects - small mouthparts perforations; mammal - teeth marks; missing - plasticine removed). The spreadsheet Experiment_#1 contains five columns and 37 lines, where each line represents a sample unity,group of ten plasticine replicas of the same color pattern (Treatment) within a block. The first column (Site) designates the area where blocks were stablished. The second column (Block) designates different locations of groups of 10 plasticine replicas of each treatment. The third column (Treatment) designates color pattern of plasticine replicas. The fourth column (Total) designates the valid number of replicas in each sample unity. The fifth column (Attacked) designates the number of replicas with evidence of bird attack.The spreadsheet Experiment_#2_raw contains eight columns and 1201 lines, where each line represents a single plasticine replica. The first column (#), designates plasticine replica number within each sample unity. The second column (Site) designates the area where blocks were stablished. The third column (Block) designates different locations where sample unities (groups of 10 plasticine replicas of each treatment) were subject to same spatial conditions. The fourth column (PlantTag) designates individual threes where replicas were placed. The fifth column (Season) designates period when the experiment was conducted. The sixth column (Size) designates size class of plasticine replicas. The seventh column (Attack) designates the occurrence of attack to replicas, and classification according to vestiges left on plasticine (bird - V or U-shaped beak marks; insects - small mouthparts perforations; mammal - teeth marks; missing - plasticine removed). The spreadsheet Experiment_#2 contains seven columns and 121 lines, where each line represents a sample unity, group of ten plasticine replicas of the same size within a block in each period. The first column (Site) designates the area where blocks were stablished.The second column (Block) designates different locations of groups of 10 plasticine replicas of each treatment. The third column (Season) designates period when the experiment was conducted. The fourth column (Size) designates size class of plasticine replicas. The fifth column (Total) designates the valid number of replicas in each sample unity. The sixth column (Attacked) designates the number of replicas with evidence of bird attack. The spreadsheet Coordinates contains blocks’ geographical coordinates and elevation data.We analyzed data using R software (R Core Team, 2019). We employed generalized linear mixed effects models (GLMMs, glmer for non-normal datasets, with lme4 package in R) with fixed and random effects to analyze the datasets of replica attacks for experiments #1 and #2.In the experiment_#1, the explanatory variable Treatment encompassed replica coloration with three levels (control – nor paint, bad – brown eyed, and good – red eyed) and was considered as a fixed factor and Site and Block assigned as random effect. The response variable were the proportions of replicas exclusively attacked by avian predators after 14 days of exposure, calculated by dividing the variable Attacked by the variable Total. For this analysis we used data of the spreadsheet Experiment_#1. Data of spreadsheet Experiment_#1 was transformed into a TXT file denominated "Attack" to be used in R software. In the experiment_#2 the explanatory variables Size (replica size classes, five levels: 0,25 – 0,5 – 1,0 – 2,0 – 4,0 cm) and Season (two levels: dry and wet) and the interaction among these factors were considered as fixed factors, whereas Site and Block were assigned as random effects accounting for spatial heterogeneity of samples. The response variable were the proportions of replicas exclusively attacked by avian predators after 14 days of exposure, calculated by dividing the variable Attacked by the variable Total. For this analysis we used data of the spreadsheet Experiment_#2. Data of spreadsheet Experiment_#2 was transformed into a txt.file denominated "Size" to be used in R software. The dataset of both experiments fitted binomial distribution error of the response variable. We selected the minimal models after the removal of non-significant variables (P-value > 0.05). If we detected significant differences in variables with more than two levels (Size or Treatment), we performed contrast analysis to determine differences among levels. In both minimal models we checked for error distribution and data overdispersion.The figures 3 and 4ab were generated on Excel using absolute counts or claculated means and standard errors. The raw data underlying Figures 3, 4a and 4b are on the spreadsheets with the same name (Figure 3, Figure 4a, Figure 4b) of file "Dataset_Avian predators avoid attacking fly-mimicking", where the figure and the underlying formula are located. Double click on the figures to view how the data generated the figure. The figures were pasted on PowerPoint for inclusion of letters and photos of the replicas. Slides were than transformed on .jpeg archives.
LIST
zenodo.org
data.niaid.nih.gov
bin, csv
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vojtěch Kaše; Vojtěch Kaše; Petra Heřmánková; Petra Heřmánková; Adéla Sobotková; Adéla Sobotková (2024). LIST [Dataset]. http://doi.org/10.5281/zenodo.10473706
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10473706
Dataset updated
Jan 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vojtěch Kaše; Vojtěch Kaše; Petra Heřmánková; Petra Heřmánková; Adéla Sobotková; Adéla Sobotková
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Latin Inscriptions in Space and Time (LIST) dataset is an aggregate of the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/); aggregated EDH on Zenodo and Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/); aggregated EDCS on Zenodo epigraphic datasets created by the Social Dynamics in the Ancient Mediterranean Project (SDAM), 2019-2023, funded by the Aarhus University Forskningsfond Starting grant no. AUFF-E-2018-7-2. The LIST dataset consists of 525,870 inscriptions, enriched by 65 attributes. 77,091 inscriptions are overlapping between the two source datasets (i.e. EDH and EDCS); 3,316 inscriptions are exclusively from EDH; 445,463 inscriptions are exclusively from EDCS. 511,973 inscriptions have valid geospatial coordinates (the geometry attribute). This information is also used to determine the urban context of each inscription (i.e. whether it is in the neighbourhood (i.e. within a 5000m buffer) of a large city, medium city, or small city or rural (>5000m to any type of city; see the attributes urban_context, urban_context_city, and urban_context_pop). 206,570 inscriptions have a numerical date of origin expressed by means of an interval or singular year using the attributes not_before and not_after. The dataset also employs a machine learning model to classify the inscriptions covered exclusively by EDCS in terms of 22 categories employed by EDH, see Kaše, Heřmánková, Sobotkova 2021.

Formats

We publish the dataset in the parquet and geojson file format. A description of individual attributes is available in the Metadata.csv. Using geopandas library, you can load the data directly from Zenodo into your Python environment using the following command: LIST = gpd.read_parquet("https://zenodo.org/record/8431323/files/LIST_v1-0.parquet?download=1"). In R, the sfarrow and sf library hold tools (st_read_parquet(), read_sf()) to load a parquet and geojson respectively after you have downloaded the datasets locally. The scripts used to generate the dataset are available via GitHub: https://github.com/sdam-au/LI_ETL

The origin of existing attributes is further described in columns ‘dataset_source’, ‘source’, and ‘description’ in the attached Metadata.csv.

Further reading on the dataset creation and methodology:

Heřmánková, Petra, Vojtěch Kaše, and Adéla Sobotkova. “Inscriptions as Data: Digital Epigraphy in Macro-Historical Perspective.” Journal of Digital History 1, no. 1 (2021): 99. https://doi.org/10.1515/jdh-2021-1004.

Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotkova. “Classifying Latin Inscriptions of the Roman Empire: A Machine-Learning Approach.” Proceedings of the 2nd Workshop on Computational Humanities Research (CHR2021) 2989 (2021): 123–35.

Reading on applications of the datasets in research:

Glomb, Tomáš, Vojtěch Kaše, and Petra Heřmánková. “Popularity of the Cult of Asclepius in the Times of the Antonine Plague: Temporal Modeling of Epigraphic Evidence.” Journal of Archaeological Science: Reports 43 (2022): 103466. https://doi.org/10.1016/j.jasrep.2022.103466.

Kaše, Vojtěch, Petra Heřmánková, and Adéla Sobotková. “Division of Labor, Specialization and Diversity in the Ancient Roman Cities: A Quantitative Approach to Latin Epigraphy.” Edited by Peter F. Biehl. PLOS ONE 17, no. 6 (June 16, 2022): e0269869. https://doi.org/10.1371/journal.pone.0269869.

Notes on spatial attributes

Machine-readable spatial point geometries are provided within the geojson and parquet formats, as well as ‘Latitude’ and ‘Longitude’ columns, which contain geospatial decimal coordinates where these are known. Additional attributes exist that contain textual references to original location at different scales. The most reliable attribute with textual information on place of origin is the urban_context_city. This contains the ancient toponym of the largest city within a 5 km distance from the inscription findspot, using cities from Hanson’s 2016 list. After these universal attributes, the remaining columns are source-dependent, and exist only for either EDH or EDCS subsets. ‘pleiades_id’ column, for example, cross references the inscription findspot to geospatial location in the Pleiades but only in the EDH subset. ‘place’ attribute exists for data from EDCS (Ort) and contains ancient as well as modern place names referring to the findspot or region of provenance separated by “/”. This column requires additional cleaning before computational analysis. Attributes with _clean affix indicate that the text string has been stripped of symbols (such as ?), and most refer to aspects of provenance in the EDH subset of inscriptions.

List of all spatial attributes:

‘geometry’ spatial point coordinate pair, ready for computational use in R or Python ‘latitude’ and ‘longitude’ attributes contain geospatial coordinates

‘urban_context_city’ attribute contains a name (ancient toponym) of the city determining the urban context, based on Hanson 2016.

‘province’ attribute contains province names as they appear in EDCS. This attribute contains data only for inscriptions appearing in EDCS, for inscriptions appearing solely in EDH this attribute is empty.

‘pleiades_id’ provides a referent for the geographic location in Pleiades (https://pleiades.stoa.org/), provided by EDH. In EDCS this attribute is empty.

‘province_label_clean’ attribute contains province names as they appear in EDH. This attribute contains data only for inscriptions appearing in EDH, for inscriptions appearing solely in EDCS this attribute is empty.

‘findspot_ancient_clean’, ‘findspot_modern_clean’, ‘country_clean’, ‘modern_region_clean’, and ‘present_location’ are additional EDH metadata, for their description see the attached Metadata file.

Disclaimer

The original data is provided by the third party indicated as the data source (see the ‘data_source’ column in the Metadata.csv). SDAM did not create the original data, vouch for its accuracy, or guarantee that it is the most recent data available from the data provider. For many or all of the data, the data is by its nature approximate and will contain some inaccuracies or missing values. The data may contain errors introduced by the data provider(s) and/or by SDAM. We always recommend checking the accuracy directly in the primary source, i.e. the editio princeps of the inscription in question.
f
Supplement 1. Example data and R code.
wiley.figshare.com
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan M. Nielson; Bryan F. J. Manly; Lyman L. McDonald; Hall Sawyer; Trent L. McDonald (2023). Supplement 1. Example data and R code. [Dataset]. http://doi.org/10.6084/m9.figshare.3532106.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3532106.v1
Dataset updated
May 31, 2023
Dataset provided by
Wiley
Authors
Ryan M. Nielson; Bryan F. J. Manly; Lyman L. McDonald; Hall Sawyer; Trent L. McDonald
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List available_habitat.txt - data file representing the available habitat gps_locations.txt - data file representing the GPS location data Rcode.R - R code to analyze the example data using the RSF likelihood for GPS fix success all_files.zip - all files at once Description Rcode.R analyzes the example data, which is one of the simulated data sets with 90% GPS fix success contained in the Nielson et al. paper. There are two data files (available_habitat.txt and gps_locations.txt) representing the available habitat and the GPS location data, respectively. The example data can be analyzed by saving both data files to a working directory, opening an R session and copying and pasting all text below at the R command prompt. Select quantities are output to the terminal. The description of columns in the data files is provided below.In available_habitat.txt: (1) utmX = utm easting coordinate of habitat unit (2) utmY = utm northing coordinate of habitat unit (3) unit.id = habitat unit ID (4) prcnt.sage = % Wyoming Big Sage (5) elevation = elevation (km) In gps_locations.txt: (1) unit.id = habitat unit of GPS location (missing = NA) (2) fix.attempt = sequential fix attempt number Column sums for 'available_habitat.txt" (in order): utmX = 4.985109e+7 = 0.0000004985109 utmY = 8.483487e+8 = 0.00000008483487 unit.id = 1.7205e+4 = 0.00017205 prcnt.sage = 1.06875e+4 = 0.000106875 elevation = 3.795941e+2 = 0.03795941 Column sums for "gps_locations.txt" (in order): unit.id = 42761 (note some "NA" values, which indicate missing) fix.attempt = 98790
r
R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO)
researchdata.edu.au
bin
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richardson, Anthony J, Prof (2019). R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO) [Dataset]. https://researchdata.edu.au/r-code-analysis-223-csiro/1360980
Explore at:
binAvailable download formats
Dataset updated
2019
Dataset provided by
eAtlas
Authors
Richardson, Anthony J, Prof
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 1985 - Dec 31, 2016
Area covered
Great Barrier Reef
Description
This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.

The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).

The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.

Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).

Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.

We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:

Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)

For example, calculations for the 3-day cumulative wind speed are:

(1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)

Similarly, we calculated the cumulative weighted wind Direction using the formula:

Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)

We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.

To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.

To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.

Limitations: It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.

Format:

It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).

Data Location:

This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
d
CAS (CHEMICAL ABSTRACTS SOCIETY) PARAMETER CODES and Other Data from...
catalog.data.gov
dataone.org
+1more
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). CAS (CHEMICAL ABSTRACTS SOCIETY) PARAMETER CODES and Other Data from EVERGREEN From NE Atlantic (limit-40 W) from 1986-05-01 to 1986-05-19 (NCEI Accession 9100094) [Dataset]. https://catalog.data.gov/dataset/cas-chemical-abstracts-society-parameter-codes-and-other-data-from-evergreen-from-ne-atlantic-l
Explore at:
Dataset updated
Jul 1, 2025
Dataset provided by
(Point of Contact)
Description
The data set in this accession contains 100 stations of hydrographic data collected in the northeast Atlantic, south of the Azores, aboard R/V ENDEAVOR, cruise #143. Date of the data are May 1-19, 1987. Two dissolved chlorofluorocarbons CCL3F (Freon 11) and CCL2F2 (Freon 12) were obtained at a number of stations along the cruise track. Data format: the first three columns are CTD pressures (dbar), depth (meters) and CTD temperatures (Deg C) at which each water sample was collected. These columns are followed by the water sample salinity (o/oo), dissolved oxygen (ml/l), calculated variable potential temperature (Deg C), Freon 11 (pmol/kg) and Freon 12 (pmol/kg). Missing values are indicated with -9.000. The data provided by Dr. T. Joyce, Woods Hole Oceanographic Institution.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1

Data and tools for studying isograms

Explore at:

application/x-sqlite3Available download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5245810.v1

Dataset updated

Jul 31, 2017

Dataset provided by

figshare

Authors

Florian Breit

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

Clear search

Close search

Google apps

Main menu

Data and tools for studying isograms

German Language Understanding Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Respiration_chambers/raw_log_files and combined datasets of biomass and...

Uniform Crime Reporting Program Data: Offenses Known and Clearances by...

HyG: A hydraulic geometry dataset derived from historical stream gage...

Avian predators avoid attacking fly-mimicking beetles: A field experiment on...

LIST

Supplement 1. Example data and R code.

R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO)

CAS (CHEMICAL ABSTRACTS SOCIETY) PARAMETER CODES and Other Data from...

Data and tools for studying isograms