Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
TwitterThe segment counts by social group and species or species group for the Waterfowl Breeding Population and Habitat Survey and associated segment effort information. Three data files are included with their associated metadata (html and xml formats). Segment counts are summed counts of waterfowl per segment and are separated into two files, described below, along with the effort table needed to analyze recent segment count information. wbphs_segment_counts_1955to1999_forDistribution.csv, which represents the period prior the collection of geolocated data. There is no associated effort file for these counts and segments with zero birds are included in the segment counts table, so effort can be inferred; there is no information to determine the proportion of each segment surveyed for this period and it must be presumed they were surveyed completely. Number of rows in table = 1,988,290. wbphs_segment_counts_forDistribution.csv, which contains positive segment records only, by species or species group beginning with 2000. wbphs_segment_effort_forDistribution.csv file is important for this segment counts file and can be used to infer zero value segments, by species or species group. Number of rows in table = 381,402. wbphs_segment_effort_forDistribution.csv. The segment survey effort and location from the Waterfowl Breeding Population and Habitat Survey beginning with 2000. If a segment was not flown, it is absent from the table for the corresponding year. Number of rows in table = 67,874. Also included here is a small R code file, createSingleSegmentCountTable.R, which can be run to format the 2000+ data to match the 1955-1999 format and combine the data over the two time periods. Please consult the metadata for an explanation of the fields and other information to understand the limitations of the data.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Despite strong interest in how noise affects marine mammals, little is known about the most abundant and commonly exposed taxa. Social delphinids occur in groups of hundreds of individuals that travel quickly, change behavior ephemerally, and are not amenable to conventional tagging methods, posing challenges in quantifying noise impacts. We integrated drone-based photogrammetry, strategically-placed acoustic recorders, and broad-scale visual observations to provide complimentary measurements of different aspects of behavior for short- and long-beaked common dolphins. We measured behavioral responses during controlled exposure experiments (CEEs) of military mid-frequency (3-4 kHz) active sonar (MFAS) using simulated and actual Navy sonar sources. We used latent-state Bayesian models to evaluate response probability and persistence in exposure and post-exposure phases. Changes in sub-group movement and aggregation parameters were commonly detected during different phases of MFAS CEEs but not control CEEs. Responses were more evident in short-beaked common dolphins (n=14 CEEs), and a direct relationship between response probability and received level was observed. Long-beaked common dolphins (n=20) showed less consistent responses, although contextual differences may have limited which movement responses could be detected. These are the first experimental behavioral response data for these abundant dolphins to directly inform impact assessments for military sonars.
Methods
We used complementary visual and acoustic sampling methods at variable spatial scales to measure different aspects of common dolphin behavior in known and controlled MFAS exposure and non-exposure contexts. Three fundamentally different data collection systems were used to sample group behavior. A broad-scale visual sampling of subgroup movement was conducted using theodolite tracking from shore-based stations. Assessments of whole-group and sub-group sizes, movement, and behavior were conducted at 2-minute intervals from shore-based and vessel platforms using high-powered binoculars and standardized sampling regimes. Aerial UAS-based photogrammetry quantified the movement of a single focal subgroup. The UAS consisted of a large (1.07 m diameter) custom-built octocopter drone launched and retrieved by hand from vessel platforms. The drone carried a vertically gimballed camera (at least 16MP) and sensors that allowed precise spatial positioning, allowing spatially explicit photogrammetry to infer movement speed and directionality. Remote-deployed (drifting) passive acoustic monitoring (PAM) sensors were strategically deployed around focal groups to examine both basic aspects of subspecies-specific common dolphin acoustic (whistling) behavior and potential group responses in whistling to MFAS on variable temporal scales (Casey et al., in press). This integration allowed us to evaluate potential changes in movement, social cohesion, and acoustic behavior and their covariance associated with the absence or occurrence of exposure to MFAS. The collective raw data set consists of several GB of continuous broadband acoustic data and hundreds of thousands of photogrammetry images.
Three sets of quantitative response variables were analyzed from the different data streams: directional persistence and variation in speed of the focal subgroup from UAS photogrammetry; group vocal activity (whistle counts) from passive acoustic records; and number of sub-groups within a larger group being tracked by the shore station overlook. We fit separate Bayesian hidden Markov models (HMMs) to each set of response data, with the HMM assumed to have two states: a baseline state and an enhanced state that was estimated in sequential 5-s blocks throughout each CEE. The number of subgroups was recorded during periodic observations every 2 minutes and assumed constant across time blocks between observations. The number of subgroups was treated as missing data 30 seconds before each change was noted to introduce prior uncertainty about the precise timing of the change. For movement, two parameters relating to directional persistence and variation in speed were estimated by fitting a continuous time-correlated random walk model to spatially explicit photogrammetry data in the form of location tracks for focal individuals that were sequentially tracked throughout each CEE as a proxy for subgroup movement.
Movement parameters were assumed to be normally distributed. Whistle counts were treated as normally distributed but truncated as positive because negative count data is not possible. Subgroup counts were assumed to be Poisson distributed as they were distinct, small values. In all cases, the response variable mean was modeled as a function of the HMM with a log link:
log(Responset) = l0 + l1Z t
where at each 5-s time block t, the hidden state took values of Zt = 0 to identify one state with a baseline response level l0, or Zt = 1 to identify an “enhanced” state, with l1 representing the enhancement of the quantitative value of the response variable. A flat uniform (-30,30) prior distribution was used for l0 in each response model, and a uniform (0,30) prior distribution was adopted for each l1 to constrain enhancements to be positive. For whistle and subgroup counts, the enhanced state indicated increased vocal activity and more subgroups. A common indicator variable was estimated for the latent state for both the movement parameters, such that switching to the enhanced state described less directional persistence and more variation in velocity. Speed was derived as a function of these two parameters and was used here as a proxy for their joint responses, representing directional displacement over time.
To assess differences in the behavior states between experimental phases, the block-specific latent states were modeled as a function of phase-specific probabilities, Z t ~ Bernoulli (pphaset), to learn about the probability pphase of being in an enhanced state during each phase. For each pre-exposure, exposure, and post-exposure phase, this probability was assigned a flat uniform (0,1) prior probability. The model was programmed in R (R version 3.6.1; The R Foundation for Statistical Computing) with the nimble package (de Valpine et al. 2020) to estimate posterior distributions of model parameters using Markov Chain Monte Carlo (MCMC) sampling. Inference was based on 100,000 MCMC samples following a burn-in of 100,000, with chain convergence determined by visual inspection of three MCMC chains and corroborated by convergence diagnostics (Brooks and Gelman, 1998). To compare behavior across phases, we compared the posterior distribution of the pphase parameters for each response variable, specifically by monitoring the MCMC output to assess the “probability of response” as the proportion of iterations for which pexposure was greater or less than ppre-exposure and the “probability of persistence” as the proportion of iterations for which ppost-exposre was greater or less than ppre-exposure. These probabilities of response and persistence thus estimated the extent of separation (non-overlap) between the distributions of pairs of pphase parameters: if the two distributions of interest were identical, then p=0.5, and if the two were non-overlapping, then p=1. Similarly, we estimated the average values of the response variables in each phase by predicting phase-specific functions of the parameters:
Mean.responsephase = exp(l0 + l1pphase)
and simply derived average speed as the mean of the speed estimates for 5-second blocks in each phase.
Facebook
TwitterRNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.
Facebook
TwitterThis dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Facebook
TwitterThis data package was produced by researchers working on the Shortgrass Steppe Long Term Ecological Research (SGS-LTER) Project, administered at Colorado State University. Long-term datasets and background information (proposals, reports, photographs, etc.) on the SGS-LTER project are contained in a comprehensive project collection within the Digital Collections of Colorado (http://digitool.library.colostate.edu/R/?func=collections&collection_id=3429). The data table and associated metadata document, which is generated in Ecological Metadata Language, may be available through other repositories serving the ecological research community and represent components of the larger SGS-LTER project collection. Additional information and referenced materials can be found: http://hdl.handle.net/10217/83448. Rabbits are the most important small-mammal herbivores in shortgrass steppe, and may significant influence the physiognomy and population dynamics of herbaceous plants and woody shrubs. Rabbits also are the most important prey of mammalian carnivores such as coyotes and large raptors such as golden eagles and great horned owls. Two hares (Lepus californicus, L. townsendii) and one cottontail rabbit (Sylvilagus audubonii) occur in shortgrass steppe. In 1994, we initiated long-term studies to track changes in relative abundance of rabbits on the Central Plains Experimental Range (CPER). On four nights each year (one night each season, usually on new moon nights in January, April, July, October), we drove a 32-km route consisting of pasture two-track and gravel roads on the CPER. This was the same route as that driven for carnivore scat counts. Surveys began at twilight. Observers with two spotlights sat in the back of a 4WD pick-up driven at Resources in this dataset:Resource Title: Website Pointer to html file. File Name: Web Page, url: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sgs&identifier=136 Webpage with information and links to data files for download
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Your one-stop shop for all things active transport.\r \r \r Active transport provides tangible benefits by increasing daily physical activity levels and reducing greenhouse gas emissions through a reduction in cars on the road. Other benefits include improved social well-being and a greater sense of community.\r \r \r This data set contains links to the various data sets available on the Open Data Hub that relate to Active Transport.\r \r \r * Pop Up Cycleway\r * Cycling Propensity\r * Cycling Count\r * Cycle Network - City of Sydney\r * Cycleway Data\r * Sydney Spring Cycle 2017 - Road Closures\r * Smart Pedestrian Project\r * Active Transport: Walking\r * Smart Cities Macquarie Park\r * Walking Count Sites\r * Eurobodalla Shire Council Cycleway\r * UNSW Bicycling Dashboards\r \r
Facebook
TwitterLoad and view a real-world dataset in RStudio
• Calculate “Measure of Frequency” metrics
• Calculate “Measure of Central Tendency” metrics
• Calculate “Measure of Dispersion” metrics
• Use R’s in-built functions for additional data quality metrics
• Create a custom R function to calculate descriptive statistics on any given dataset
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is a record of bird species recorded in Arabuko – Sokoke Forest. The data was collected using point count method of data collection for a number of days from 18 – 24 June 2017. Point count data in birds is mostly used for estimating population sizes and for detecting changes in population sizes of birds. At times this method is used to determine species diversity and prepare checklists. This data is also made up of a small list of owl species recorded in the forest specifically in Cyanometra Forest of the Arabuko – Sokoke Forest.
Facebook
TwitterPlease see the methods description in our associated publication.
Facebook
TwitterThe Santa Barbara Channel MBON tracks long-term patterns in species abundance and diversity. This dataset contains counts of fish (including cryptic fish, which are deliberately sought out) produced by integrating data from four contributing projects working in the kelp forests of the Santa Barbara Channel, USA.
This dataset includes three entities, two data tables and R code. The main data table contains counts of organisms, the area over which that number was counted and the height above the bottom. Data were collected by human observation (divers using SCUBA) during regular surveys. The column labeled “count” records the number of organisms found in each plot/transect at a given timestamp.. A second data table contains place names and geolocation for sampling sites. Information is sufficient for the calculation of fish density, which is left to the user. Sample R-code is included (third entity) to illustrate generation of a basic table of areal density by taxa and sampling site. See Methods for information on integration and data processing
The four contributing projects are two research projects: the Santa Barbara Coastal LTER (SBC LTER) and the Partnership for Interdisciplinary Studies of Coastal Oceans (PISCO), and the kelp forest monitoring program of the Santa Barbara Channel National Park, and the San Nicolas Island monitoring program supported by USGS. Together, these projects have recorded data for more than 200 species at approximately 100 sites on both the mainland coast and on the Santa Barbara Channel Islands. Sampling began in 1982 and is ongoing.
Facebook
Twitterhttps://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreementhttps://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreement
Database uses data on shorebird counts from around Australia. The majority of the records are from Birdlife’s Birdata database. We supplemented this data with bird surveys within the Coorong, from David Paton, from data available from the South Australian Government (Paton, Paton, and Bailey 2016). Data for some shorebird areas, namely eighty mile beach, Roebuck Bay, Werribee/Avalon, did not have count area level data for a number of recent years (2019-2022), but had aggregated summary data available. Within the database, observations of the number of individuals per species are organized in “count areas”, which are generally one high tide roost, or segments of beach. Count areas are situated within “Shorebird areas”, which are the maximum areas in which individual birds are likely to move during the non-breeding season (Clemens, Herrod, and Weston 2014). The database contains >380,000 records from 448 shorebird areas around the country. Data on individual species generation times was sourced from Birdlife’s Data Zone (‘BirdLife Data Zone’ 2022). For our analysis, we aggregated the data into independent count occasions for each shorebird area within each Australian summer, here termed “season”. To achieve this, we summarized the total number of each species observed at a shorebird area for each month during the summer months (October, November, December, January, and February). First, the database was subset to records that had complete fields for “shorebird area”, “point count ID”, “count”, and “date”. Data were then aggregated to find the max observations per species per count area per month. The max observations per count area per month were then summed per shorebird area. For shorebird areas with counts across multiple months, we used the top two counts per season as input for our data analysis. Finally, we only included shorebird areas that had at least 500 birds observed over the entire time series and had at least one count for at least half of the years in the entire time series (14 years of the 29 years in the time series). Structured, regular monitoring began in 1993, so we used data from 1993-2022. Modelling abundance and population trends - The objective was to estimate abundance and population trends of the targeted species at the national level, using the time-series data described above. Following the successful example of modelling population trends of shorebirds in Australia by Studds et al (2017), we also used N-mixture models, which estimate the abundance of each species at each shorebird area each year, while accounting for imperfect detection of individuals as well as among-area difference, temporal trends, and over-dispersion in abundance. The model allowed us to estimate: (i) the abundance of each species at each shorebird area each year, (ii) the total abundance of each species across all areas each year, and (iii) the nationwide population index of each species, which shows “average” changes in the species abundance across all shorebird areas. As N-mixture models tend to be highly complex with many parameters and thus require much information (i.e., data) for those parameters to be successfully estimated, we developed two types of N-mixture models with varying levels of complexity/assumption: (i) the model assuming that detection probabilities at a given area vary among months within each year, and (ii) the model assuming that detection probabilities at a given area are constant across months within each year. We first fitted model (i) above to all targeted species using the program JAGS (Hornik et al. 2003) through the R2jags package (Su, Yajima, and Edu 2022) in R version 4.2.1 (R Core Team 2015). Model convergence was checked with R-hat values and trace plots. If the model still did not converge, we next fitted model (ii) above and increased the number of iterations until the model converged. If both models did not converge, we fitted a simpler model, which had the same structure but without accounting for the imperfect detection of individuals (i.e., assuming that all individuals are detectable). Using model outputs, we then estimated the rate of change in abundance. For a given time frame (29 years, three generations, 1993-2013, 2013-2022) we calculated growth rates using generalized least squares regression to account for temporal autocorrelation. We then sampled 1000 growth rates from each regression result and calculated the mean and standard deviation based on these samples. We then took the difference in the samples and calculated the probability of the difference being larger than zero using the code: 100*length(difference_in_samples[difference_in_samples <0])/length(difference_in_samples) We used IUCN criteria A2, to assess how the species should be listed based on the estimated declines from our analysis. These thresholds are: 80% - critically endangered 50% - endangered, and 30% - vunerable From the IUCN IUCN Red List of Threatened Species “
Facebook
TwitterThe XMM-Newton survey of the Small Magellanic Cloud (SMC) yielded complete coverage of the bar and eastern wing in the 0.2-12.0 keV energy band. In addition to the main-field (5.58 deg2), available outer fields were included in the catalogue, yielding a total field area of 6.32 deg2. This catalog comprises 3,053 unique X-ray point sources and sources with moderate extent that have been reduced from 5,236 individual detections found in 100 observations between April 2000 and April 2010 (the details of these exposures are given in Table B.1 of the reference paper). For 927 sources, there were detections at multiple epochs, with some SMC fields observed up to 36 times. The detected sources have a median position uncertainty of 1.3 arcseconds (1 sigma) and limiting fluxes down to ~1 x 10-14 erg/s/cm2 in the 0.2-4.5 keV band, corresponding to X-ray luminosities of ~5 x 1033 erg/s for sources located in the SMC. Sources have been classified using X-ray hardness ratios, X-ray variability, and their multi-wavelength properties. In their paper, the authors discuss the statistical properties of the detected X-ray sources, like the spatial distribution, X-ray color diagrams, luminosity functions, and time variability. They have identified 49 SMC high-mass X-ray binaries (HMXB), four super-soft X-ray sources (SSS), 34 foreground stars, and 72 active galactic nuclei (AGN) behind the SMC. In addition, they found candidates for SMC HMXBs (45) and faint SSSs (8) as well as AGN (2092) and galaxy clusters (13). Notice that X-ray sources with high extent (>40 arcseconds), e.g. supernova remnants and galaxy clusters, have been previously presented by Haberl et al. (2012, A&A, 545, A128) and are not included in this table. To investigate the spectral behavior of all sources, the authors used hardness ratios HRi (i = 1, 2, 3, 4), defined by HRi = (Ri+1 - Ri)/(Ri+1 + Ri), where Ri is the count rate in energy band i as defined by:
Band Energy Range 1 0.2-0.5 keV 2 0.5-1.0 keV 3 1.0-2.0 keV 4 2.0-4.5 keV 5 4.5-12. keVTo increase statistics, the authors also calculated average HR_is, combining all available instruments and observations. HRi is not given if both rates Ri and Ri+1 are null or if the 1-sigma uncertainty of Delta(HRi) covers the complete HR interval from -1 to +1. To convert an individual count rate Ri of an energy band i into a setup-independent, observed flux Fi, the authors calculated energy conversion factors (ECFs) fi = Ri/Fi , as described in Sect. A.3 of the reference paper. For the calculation, they assumed a universal spectrum for all sources, described by a power-law model with a photon index of 1.7 and a photo-electric foreground absorption by the Galaxy of NH,Gal = 6 x 1020 cm-2 (average for the SMC main field in the H I map of Dickey & Lockman 1990, ARAA, 28, 215). In addition to the fluxes for each detection, the authors calculated flux upper limits FUL for each observation and source, if the source was observed but not detected in an individual observation. As for the initial source detection, they used the emldetect task to fit sources, but kept the source positions fixed at the master positions and accepted all detection likelihoods in order to get an upper limit for the flux. This table was created by the HEASARC in October 2013 based on CDS Catalog J/A+A/558/A3 file smc_src.dat. This is a service provided by NASA HEASARC .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online.
Facebook
TwitterThis reference contains the R-code for the analysis and summary of detections of Bachman's sparrow, bobwhite quail and brown-headed nuthatch through 2020. Specifically generates probability of detection and occupancy of the species based on call counts and elicited calls with playback. The code loads raw point count (CSV files) and fire history data (CSV) and cleans/transforms into a tidy format for occupancy analysis. It then creates the necessary data structure for occupancy analysis, performs the analysis for the three focal species, and provides functionality for generating tables and figures summarizing the key findings of the occupancy analysis. The raw data, point count locations and other spatial data (ShapeFiles) are contained in the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from the traffic signalling system (SCATS). The dataset contains hourly traffic volumes for the Adelaide City Council jurisdiction.\r \r Metadata:\r \r
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of species occurrence data from 1900 to the present for Australian terrestrial species organised by IBRA region and CAPAD protection status. Counts are provided by species and IBRA region for: \r \r 1.\tThe number of records showing evidence of the species’ presence inside the region\r 2.\tThe number of records showing evidence of the species’ presence within CAPAD 2020 protected areas inside the region\r 3.\tThe number of records showing evidence of the species’ presence within the protected areas under indigenous management within the region \r \r The first count includes all records in the second count and the second count includes all records in the third count.\r The counts in this dataset allow for some simple comparisons between the numbers of species recorded inside and outside protected areas in each region and comparison of the species in each region. Inclusion of EPBC statuses allows comparisons based on conservation status. Note however that such counts may not accurately reflect real differences in biodiversity. It is important to consider confounding factors (particularly variations in recording effort over time). \r \r Occurrence records were aggregated and organised by the Atlas of Living Australia (ALA, https://ala.org.au/) and include survey and monitoring data collected and managed by the Integrated Marine Observing System (IMOS, https://imos.org.au/) and the Terrestrial Ecosystem Research Network (TERN, https://tern.org.au/). \r \r To find out more about this dataset, visit: https://ecoassets.org.au/data/summary-data-protection-status-for-australian-terrestrial-species-occurrences/\r \r DOI: https://doi.org/10.26197/ala.445c74bb-89f2-415e-9e5b-0e38e9c7ee1e
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explorative visualization techniques provide a first summary of microbiome read count datasets through dimension reduction. A plethora of dimension reduction methods exists, but many of them focus primarily on sample ordination, failing to elucidate the role of the bacterial species. Moreover, implicit but often unrealistic assumptions underlying these methods fail to account for overdispersion and differences in sequencing depth, which are two typical characteristics of sequencing data. We combine log-linear models with a dispersion estimation algorithm and flexible response function modelling into a framework for unconstrained and constrained ordination. The method is able to cope with differences in dispersion between taxa and varying sequencing depths, to yield meaningful biological patterns. Moreover, it can correct for observed technical confounders, whereas other methods are adversely affected by these artefacts. Unlike distance-based ordination methods, the assumptions underlying our method are stated explicitly and can be verified using simple diagnostics. The combination of unconstrained and constrained ordination in the same framework is unique in the field and facilitates microbiome data exploration. We illustrate the advantages of our method on simulated and real datasets, while pointing out flaws in existing methods. The algorithms for fitting and plotting are available in the R-package RCM.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.