Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Effective population size (Ne) is a particularly useful metric for conservation as it affects genetic drift, inbreeding and adaptive potential within populations. Current guidelines recommend a minimum Ne of 50 and 500 to avoid short-term inbreeding and to preserve long-term adaptive potential, respectively. However, the extent to which wild populations reach these thresholds globally has not been investigated, nor has the relationship between Ne and human activities. Through a quantitative review, we generated a dataset with 4610 georeferenced Ne estimates from 3829 unique populations, extracted from 723 articles. These data show that certain taxonomic groups are less likely to meet 50/500 thresholds and are disproportionately impacted by human activities; plant, mammal, and amphibian populations had a <54% probability of reaching = 50 and a <9% probability of reaching = 500. Populations listed as being of conservation concern according to the IUCN Red List had a smaller median than unlisted populations, and this was consistent across all taxonomic groups. was reduced in areas with a greater Global Human Footprint, especially for amphibians, birds, and mammals, however relationships varied between taxa. We also highlight several considerations for future works, including the role that gene flow and subpopulation structure plays in the estimation of in wild populations, and the need for finer-scale taxonomic analyses. Our findings provide guidance for more specific thresholds based on Ne and help prioritize assessment of populations from taxa most at risk of failing to meet conservation thresholds. Methods Literature search, screening, and data extraction A primary literature search was conducted using ISI Web of Science Core Collection and any articles that referenced two popular single-sample Ne estimation software packages: LDNe (Waples & Do, 2008), and NeEstimator v2 (Do et al., 2014). The initial search included 4513 articles published up to the search date of May 26, 2020. Articles were screened for relevance in two steps, first based on title and abstract, and then based on the full text. For each step, a consistency check was performed using 100 articles to ensure they were screened consistently between reviewers (n = 6). We required a kappa score (Collaboration for Environmental Evidence, 2020) of ³ 0.6 in order to proceed with screening of the remaining articles. Articles were screened based on three criteria: (1) Is an estimate of Ne or Nb reported; (2) for a wild animal or plant population; (3) using a single-sample genetic estimation method. Further details on the literature search and article screening are found in the Supplementary Material (Fig. S1). We extracted data from all studies retained after both screening steps (title and abstract; full text). Each line of data entered in the database represents a single estimate from a population. Some populations had multiple estimates over several years, or from different estimation methods (see Table S1), and each of these was entered on a unique row in the database. Data on N̂e, N̂b, or N̂c were extracted from tables and figures using WebPlotDigitizer software version 4.3 (Rohatgi, 2020). A full list of data extracted is found in Table S2. Data Filtering After the initial data collation, correction, and organization, there was a total of 8971 Ne estimates (Fig. S1). We used regression analyses to compare Ne estimates on the same populations, using different estimation methods (LD, Sibship, and Bayesian), and found that the R2 values were very low (R2 values of <0.1; Fig. S2 and Fig. S3). Given this inconsistency, and the fact that LD is the most frequently used method in the literature (74% of our database), we proceeded with only using the LD estimates for our analyses. We further filtered the data to remove estimates where no sample size was reported or no bias correction (Waples, 2006) was applied (see Fig. S6 for more details). Ne is sometimes estimated to be infinity or negative within a population, which may reflect that a population is very large (i.e., where the drift signal-to-noise ratio is very low), and/or that there is low precision with the data due to small sample size or limited genetic marker resolution (Gilbert & Whitlock, 2015; Waples & Do, 2008; Waples & Do, 2010) We retained infinite and negative estimates only if they reported a positive lower confidence interval (LCI), and we used the LCI in place of a point estimate of Ne or Nb. We chose to use the LCI as a conservative proxy for in cases where a point estimate could not be generated, given its relevance for conservation (Fraser et al., 2007; Hare et al., 2011; Waples & Do 2008; Waples 2023). We also compared results using the LCI to a dataset where infinite or negative values were all assumed to reflect very large populations and replaced the estimate with an arbitrary large value of 9,999 (for reference in the LCI dataset only 51 estimates, or 0.9%, had an or > 9999). Using this 9999 dataset, we found that the main conclusions from the analyses remained the same as when using the LCI dataset, with the exception of the HFI analysis (see discussion in supplementary material; Table S3, Table S4 Fig. S4, S5). We also note that point estimates with an upper confidence interval of infinity (n = 1358) were larger on average (mean = 1380.82, compared to 689.44 and 571.64, for estimates with no CIs or with an upper boundary, respectively). Nevertheless, we chose to retain point estimates with an upper confidence interval of infinity because accounting for them in the analyses did not alter the main conclusions of our study and would have significantly decreased our sample size (Fig. S7, Table S5). We also retained estimates from populations that were reintroduced or translocated from a wild source (n = 309), whereas those from captive sources were excluded during article screening (see above). In exploratory analyses, the removal of these data did not influence our results, and many of these populations are relevant to real-world conservation efforts, as reintroductions and translocations are used to re-establish or support small, at-risk populations. We removed estimates based on duplication of markers (keeping estimates generated from SNPs when studies used both SNPs and microsatellites), and duplication of software (keeping estimates from NeEstimator v2 when studies used it alongside LDNe). Spatial and temporal replication were addressed with two separate datasets (see Table S6 for more information): the full dataset included spatially and temporally replicated samples, while these two types of replication were removed from the non-replicated dataset. Finally, for all populations included in our final datasets, we manually extracted their protection status according to the IUCN Red List of Threatened Species. Taxa were categorized as “Threatened” (Vulnerable, Endangered, Critically Endangered), “Nonthreatened” (Least Concern, Near Threatened), or “N/A” (Data Deficient, Not Evaluated). Mapping and Human Footprint Index (HFI) All populations were mapped in QGIS using the coordinates extracted from articles. The maps were created using a World Behrmann equal area projection. For the summary maps, estimates were grouped into grid cells with an area of 250,000 km2 (roughly 500 km x 500 km, but the dimensions of each cell vary due to distortions from the projection). Within each cell, we generated the count and median of Ne. We used the Global Human Footprint dataset (WCS & CIESIN, 2005) to generate a value of human influence (HFI) for each population at its geographic coordinates. The footprint ranges from zero (no human influence) to 100 (maximum human influence). Values were available in 1 km x 1 km grid cell size and were projected over the point estimates to assign a value of human footprint to each population. The human footprint values were extracted from the map into a spreadsheet to be used for statistical analyses. Not all geographic coordinates had a human footprint value associated with them (i.e., in the oceans and other large bodies of water), therefore marine fishes were not included in our HFI analysis. Overall, 3610 Ne estimates in our final dataset had an associated footprint value.
Facebook
TwitterBy US Open Data Portal, data.gov [source]
This dataset contains the age-standardized stroke mortality rate in the United States from 2013 to 2015, by state/territory, county, gender and race/ethnicity. The data source is the highly respected National Vital Statistics System. The rates are reported as a 3-year average and have been age-standardized. Moreover, county rates are spatially smoothed for further accuracy. The interactive map of heart disease and stroke produced by this dataset provides invaluable information about the geographic disparities in stroke mortality across America at different scales - county, state/territory and national. By using the adjustable filter settings provided in this interactive map, you can quickly explore demographic details such as gender (Male/Female) or race/ethnicity (e.g Non-Hispanic White). Conquer your fear of unknown with evidence! Investigate these locations now to inform meaningful action plans for greater public health resilience in America and find out if strokes remain a threat to our millions of citizens every day! Updated regularly since 2020-02-26, so check it out now!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The US Age-Standardized Stroke Mortality Rates (2013-2015) by State/County/Gender/Race dataset provides valuable insights into stroke mortality rates among adults ages 35 and over in the USA between 2013 and 2015. This dataset contains age-standardized data from the National Vital Statistics System at the state, county, gender, and race level. Use this guide to learn how best use this dataset for your purposes!
Understand the Data
This dataset provides information about stroke mortality rates among adult Americans aged 35+. The data is collected from 2013 to 2015 in three year averages. Even though it is possible to view county level data, spatial smoothing techniques have been applied here. The following columns of data are provided: - Year – The year of the data collection - LocationAbbr – The abbreviation of location where the data was collected
- LocationDesc – A description of this location
- GeographicLevel – Geographic level of granularity where these numbers are recorded * DataSource - source of these statistics * Class - class or group into which these stats fall * Topic - overall topic on which we have stats * Data_Value - age standardized value associated with each row * Data_Value_Unit - units associated with each value * Stratification1– First stratification defined for a given row * Stratification2– Second stratification defined for a given rowAdditionally, several other footnotes fields such as ‘Data_value_Type’; ‘Data_Value_Footnote _Symbol’; ‘StratificationCategory1’ & ‘StratificatoinCategory2’ etc may be present accordingly .## Exploring Correlations
Now that you understand what individual columns mean it should take no time to analyze correlations within different categories using standard statistical methods like linear regressions or boxplots etc. If you want to compare different regions , then you can use
LocationAbbrcolumn with locations reduced geographical levels such asStateorRegion. Alternatively if one wants comparisons across genders then they can refer column labelledStratifacation1alongwith their desired values within this
- Creating a visualization to show the relationship between stroke mortality and specific variations in race/ethnicity, gender, and geography.
- Comparing two or more states based on their average stroke mortality rate over time.
- Building a predictive model that disregards temporal biases to anticipate further changes in stroke mortality for certain communities or entire states across the US
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: csv-1.csv | Column name | Description | |:--...
Facebook
TwitterNotice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.
April 9, 2020
April 20, 2020
April 29, 2020
September 1st, 2020
February 12, 2021
new_deaths column.February 16, 2021
The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.
The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.
The AP is updating this dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic
Filter cases by state here
Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac
Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true
Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.
Pull the 100 counties with the highest per-capita confirmed cases here
Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.
The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.
@(https://datawrapper.dwcdn.net/nRyaf/15/)
<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here
This data should be credited to Johns Hopkins University COVID-19 tracking project
Facebook
TwitterPhase 1: ASK
1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.
2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.
3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members
Phase 2: PREPARE:
1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.
2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.
#Installing packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("readr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("geosphere", repos = "http://cran.us.r-project.org")
install.packages("gridExtra", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(readr)
library(janitor)
library(geosphere)
library(gridExtra)
#Importing data & verifying the information within the dataset
all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
glimpse(all_tripdata_clean)
summary(all_tripdata_clean)
Phase 3: PROCESS
1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.
#Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),]
#Creating columns for the individual date components (days_of_week should be run last)
all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**
#Calculating the ride length in miles & minutes
all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
#Calculating the mean time and distance based on the user groups
userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
userType_means <- all_tripdata_clean %>%
group_by(member_casual) %>%
summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.
#Calculations
with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
with_bike_type %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual,rideable_type,weekday) %>%
summarise(totals=n(), .groups="drop") %>%
with_bike_type %>%
group_by(member_casual,rideable_type) %>%
summarise(totals=n(), .groups="drop") %>%
#Calculating the ride differential
all_tripdata_clean %>%
mutate(weekday = wkday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length),.groups = 'drop') %>%
arrange(me...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wind Spacecraft:
The Wind spacecraft (https://wind.nasa.gov) was launched on November 1, 1994 and currently orbits the first Lagrange point between the Earth and sun. A comprehensive review can be found in Wilson et al. [2021]. It holds a suite of instruments from gamma ray detectors to quasi-static magnetic field instruments, Bo. The instruments used for this data product are the fluxgate magnetometer (MFI) [Lepping et al., 1995] and the radio receivers (WAVES) [Bougeret et al., 1995]. The MFI measures 3-vector Bo at ~11 samples per second (sps); WAVES observes electromagnetic radiation from ~4 kHz to >12 MHz which provides an observation of the upper hybrid line (also called the plasma line) used to define the total electron density and also takes time series snapshot/waveform captures of electric and magnetic field fluctuations, called TDS bursts herein.
WAVES Instrument:
The WAVES experiment [Bougeret et al., 1995] on the Wind spacecraft is composed of three orthogonal electric field antenna and three orthogonal search coil magnetometers. The electric fields are measured through five different receivers: Low Frequency FFT receiver called FFT (0.3 Hz to 11 kHz), Thermal Noise Receiver called TNR (4-256 kHz), Radio receiver band 1 called RAD1 (20-1040 kHz), Radio receiver band 2 called RAD2 (1.075-13.825 MHz), and the Time Domain Sampler (TDS). The electric field antenna are dipole antennas with two orthogonal antennas in the spin plane and one spin axis stacer antenna.
The TDS receiver allows one to examine the electromagnetic waves observed by Wind as time series waveform captures. There are two modes of operation, TDS Fast (TDSF) and TDS Slow (TDSS). TDSF returns 2048 data points for two channels of the electric field, typically Ex and Ey (i.e. spin plane components), with little to no gain below ~120 Hz (the data herein has been high pass filtered above ~150 Hz for this reason). TDSS returns four channels with three electric(magnetic) field components and one magnetic(electric) component. The search coils show a gain roll off ~3.3 Hz [e.g., see Wilson et al., 2010; Wilson et al., 2012; Wilson et al., 2013 and references therein for more details].
The original calibration of the electric field antenna found that the effective antenna lengths are roughly 41.1 m, 3.79 m, and 2.17 m for the X, Y, and Z antenna, respectively. The +Ex antenna was broken twice during the mission as of June 26, 2020. The first break occurred on August 3, 2000 around ~21:00 UTC and the second on September 24, 2002 around ~23:00 UTC. These breaks reduced the effective antenna length of Ex from ~41 m to 27 m after the first break and ~25 m after the second break [e.g., see Malaspina et al., 2014; Malaspina & Wilson, 2016].
TDS Bursts:
TDS bursts are waveform captures/snapshots of electric and magnetic field data. The data is triggered by the largest amplitude waves which exceed a specific threshold and are then stored in a memory buffer. The bursts are ranked according to a quality filter which mostly depends upon amplitude. Due to the age of the spacecraft and ubiquity of large amplitude electromagnetic and electrostatic waves, the memory buffer often fills up before dumping onto the magnetic tape drive. If the memory buffer is full, then the bottom ranked TDS burst is erased every time a new TDS burst is sampled. That is, the newest TDS burst sampled by the instrument is always stored and if it ranks higher than any other in the list, it will be kept. This results in the bottom ranked burst always being erased. Earlier in the mission, there were also so called honesty bursts, which were taken periodically to test whether the triggers were working properly. It was found that the TDSF triggered properly, but not the TDSS. So the TDSS was set to trigger off of the Ex signals.
A TDS burst from the Wind/WAVES instrument is always 2048 time steps for each channel. The sample rate for TDSF bursts ranges from 1875 samples/second (sps) to 120,000 sps. Every TDS burst is marked a unique set of numbers (unique on any given date) to help distinguish it from others and to ensure any set of channels are appropriately connected to each other. For instance, during one spacecraft downlink interval there may be 95% of the TDS bursts with a complete set of channels (i.e., TDSF has two channels, TDSS has four) while the remaining 5% can be missing channels (just example numbers, not quantitatively accurate). During another downlink interval, those missing channels may be returned if they are not overwritten. During every downlink, the flight operations team at NASA Goddard Space Fligth Center (GSFC) generate level zero binary files from the raw telemetry data. Those files are filled with data received on that date and the file name is labeled with that date. There is no attempt to sort chronologically the data within so any given level zero file can have data from multiple dates within. Thus, it is often necessary to load upwards of five days of level zero files to find as many full channel sets as possible. The remaining unmatched channel sets comprise a much smaller fraction of the total.
All data provided here are from TDSF, so only two channels. Most of the time channel 1 will be associated with the Ex antenna and channel 2 with the Ey antenna. The data are provided in the spinning instrument coordinate basis with associated angles necessary to rotate into a physically meaningful basis (e.g., GSE).
TDS Time Stamps:
Each TDS burst is tagged with a time stamp called a spacecraft event time or SCET. The TDS datation time is sampled after the burst is acquired which requires a delay buffer. The datation time requires two corrections. The first correction arises from tagging the TDS datation with an associated spacecraft major frame in house keeping (HK) data. The second correction removes the delay buffer duration. Both inaccuracies are essentially artifacts of on ground derived values in the archives created by the WINDlib software (K. Goetz, Personal Communication, 2008) found at https://github.com/lynnbwilsoniii/Wind_Decom_Code.
The WAVES instrument's HK mode sends relevant low rate science back to ground once every spacecraft major frame. If multiple TDS bursts occur in the same major frame, it is possible for the WINDlib software to assign them the same SCETs. The reason being that this top-level SCET is only accurate to within +300 ms (in 120,000 sps mode) due to the issues described above (at lower sample rates, the error can be slightly larger). The time stamp uncertainty is a positive definite value because it results from digitization rounding errors. One can correct these issues to within +10 ms if using the proper HK data.
*** The data stored here have not corrected the SCETs! ***
The 300 ms uncertainty, due to the HK corrections mentioned above, results from WINDlib trying to recreate the time stamp after it has been telemetered back to ground. If a burst stays in the TDS buffer for extended periods of time (i.e., >2 days), the interpolation done by WINDlib can make mistakes in the 11th significant digit. The positive definite nature of this uncertainty is due to rounding errors associated with the onboard DPU (digital processing unit) clock rollover. The DPU clock is a 24 bit integer clock sampling at ∼50,018.8 Hz. The clock rolls over at ∼5366.691244092221 seconds, i.e., (16*224)/50,018.8. The sample rate is a temperature sensitive issue and thus subject to change over time. From a sample of 384 different points on 14 different days, a statistical estimate of the rollover time is 5366.691124061162 ± 0.000478370049 seconds (calculated by Lynn B. Wilson III, 2008). Note that the WAVES instrument team used UR8 times, which are the number of 86,400 second days from 1982-01-01/00:00:00.000 UTC.
The method to correct the SCETs to within +10 ms, were one to do so, is given as follows:
Retrieve the DPU clock times, SCETs, UR8 times, and DPU Major Frame Numbers from the WINDlib libraries on the VAX/ALPHA systems for the TDSS(F) data of interest.
Retrieve the same quantities from the HK data.
Match the HK event number with the same DPU Major Frame Number as the TDSS(F) burst of interest.
Find the difference in DPU clock times between the TDSS(F) burst of interest and the HK event with matching major frame number (Note: The TDSS(F) DPU clock time will always be greater than the HK DPU clock if they are the same DPU Major Frame Number and the DPU clock has not rolled over).
Convert the difference to a UR8 time and add this to the HK UR8 time. The new UR8 time is the corrected UR8 time to within +10 ms.
Find the difference between the new UR8 time and the UR8 time WINDlib associates with the TDSS(F) burst. Add the difference to the DPU clock time assigned by WINDlib to get the corrected DPU clock time (Note: watch for the DPU clock rollover).
Convert the new UR8 time to a SCET using either the IDL WINDlib libraries or TMLib (STEREO S/WAVES software) libraries of available functions. This new SCET is accurate to within +10 ms.
One can find a UR8 to UTC conversion routine at https://github.com/lynnbwilsoniii/wind_3dp_pros in the ~/LYNN_PRO/Wind_WAVES_routines/ folder.
Examples of good waveforms can be found in the notes PDF at https://wind.nasa.gov/docs/wind_waves.pdf.
Data Set Description
Each Zip file contains 300+ IDL save files; one for each day of the year with available data. This data set is not complete as the software used to retrieve and calibrate these TDS bursts did not have sufficient error handling to handle some of the more nuanced bit errors or major frame errors in some of the level zero files. There is currently (as of June 27, 2020) an effort (by Keith Goetz et al.) to generate the entire TDSF and TDSS data set in one repository to be put on SPDF/CDAWeb as CDF files. Once that data set is available, it will supercede
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for the case study on the spread of the vernacular term "based" across 4chan/pol/, Reddit, and Twitter. Data was gathered in November 2021. All files are anonymised as much as possible. They contain:
Table 2 below details the queries we carried out for the collection of the initial datasets. For all platforms, we chose to retain non-English languages since the diffusion of the term in other languages was also deemed relevant.
| source | query | query type |
|
(#based OR (based (pilled OR pill OR redpilled OR redpill OR chad OR virgin OR cringe OR cringy OR triggered OR trigger OR tbh OR lol OR lmao OR wtf OR swag OR nigga OR finna OR bitch OR rare) ) OR " is based" OR "that\'s based" OR "based as fuck" OR "based af" OR "too based" OR "fucking based" "extremely based" OR "totally based" OR "incredibly based" OR "very based" OR "so based" OR "pretty based" OR "quite based" OR "kinda based" OR "kind of based" OR "fairly based" OR "based ngl" OR "as based as" OR "thank you based " OR "stay based" OR "based god") -"based in"-"based off"-"based * off"-"based around"-"based * around"-"based on"-"based * on"-"based out of"-"based upon"-"based * upon"-"based at"-"based from"-"is based by"-"is based of"-"on which * is based"-"upon which * is based"-"which is based there"-"is based all over"-"based more on"-"plant based"-"text based"-"turn based"-"need based"-"evidence based"-"community based" -"web based" -is:retweet -is:nullcast | Twitter v2 API | |
| based -"based in" -"based off" -"based around" -"based on" -"based them on" -"based it on" -"evidence based" | Pushshift API | |
| 4chan/pol/ | lower(body) LIKE '%based%' AND lower(body) NOT SIMILAR TO '%(-based|debased|based in |based off |based around |based on |based them on|based it on|based her on|based him on|based only on|based completely on|based solely on|based purely on|based entirely on|based not on |based not simply on|based entirely around|based out of|based upon |based at |is based by |is based of|on which it is based|on which this is based|which is based there|is based all over|which it is based|is based of |based firmly on|based off |based solely off|based more on|plant based|text based|turn based|need based|evidence based|community based|home based|internet based|web based|physics based)%' | PostgreSQL |
There were some data gaps for 4chan/pol/ and Reddit. /pol/ data was missing because of gaps in the archives (mostly due to outages). The following time periods are incomplete or missing entirely:
15 - 16 April 2019
14 - 15 December 2019
3 - 10 December 2020
29 March 2021
10 - 12 April 2021
16 - 18 August 2021
11 October 2021
The 4plebs archive moreover only started in November 2013, meaning the first two years of /pol/’s existence are missing.
The data returned by the Pushshift API did not return posts for certain dates. We somewhat mitigated this by also retrieving data through the new Beta endpoint. However, the following time periods were still missing data:
1 - 30 September 2017
1 February - 31 March 2018
5 - 6 November 2020
23 March 2021 through 27 March 2021
10 - 13 April 2021
Afterward initial data collection, we carried out several rounds of filtering to get rid of remaining false positives. For 4chan/pol/, we only needed to do this filtering once (attaining 0.95 precision), while for Twitter we carried out eight rounds (0.92 precision). For Reddit, we formulated nearly 500 exclusions but failed to generate a precision over 0.9. We thus had to do more rigorous filtering. We observed that longer comments were more likely to be false positives, so we removed all comments over 350 characters long. We settled on this number on the basis of our first sample; almost no true positives were over 350 characters long. Furthermore, we removed all comments except for those wherein based was used as a standalone word (thus excluding e.g. “plant-based”), at the start or end of a sentence, in capitals, or in conjunction with certain keywords or in certain phrases (e.g. “kinda based”). We also deleted posts by bot accounts by (rather crudely) removing posts of usernames including ‘bot’ or ‘auto’. This finally led to a precision of 0.9.
| -based|location based |
|
@-mentions with “based” "on which "where "wherever #based #customer| alkaline based| anime based | are based near | astrology based | at the based of| b0Iuip5wnA| based economy| based game | based locally| based my name | based near | based not upon| based points| based purely off| based quite near | based solely off| based soy source| based upstairs| blast based| class based| clearly based of this| combat based| condition based| dos based| emotional based| eth based| fact based| gender based| he based his | he's based in | indian based| is based for fans| is based lies| is based near | is based not around | is based not on | is based once again on | is based there| is based within| issue based| jersey based| listen to 01 we rare| music based| oil based| on which it's based| page based 1000| paper based| park based | pc based| pic based| pill based regimen| puzzle based| sex based | she based her | she's based in | skill based| story based| they based their | they're based in| toronto based| trigger on a new yoga 2| u.s. based| universal press| us based| value based| we're based in | where you based?| you're based in |#alkaline #based|#apps #based|#based #acidic|#flash #based|#home #based|#miami #based|#piano #based|#value #based|american based|australia based|australian based|based my decision|based entirely around|based entirely on|based exactly on |based her announcement|based her decision|based her off|based him off|based his announcement|based his decision|based largely on|based less on|based mostly on|based my guess|based only around|based only on|based partly on|based partly upon|based purely on |based solely around|based solely on|based strictly on|based the announcement|based the decision|based their announcement|based their decision|based, not upon|battery based|behavior based|behaviour based|blockchain based|book based series|canon based|character based|cloud based|commision based|component based|computer based|confusion based|content based|depression based|dev based|dnd based|factually based|faith based|fear based|flash based|flintstones based|flour based|home based|homin based|i based my|interaction based|is based circa|is based competely on|is based entirely off|is based here|is based more on|is based outta|is based totally on |is based up here|is based way more on|live conferences with r3|living based of|london based|luck based|malex based|market based|miami based|needs based|nyc based|on which the film is based|opinion based|piano based|point based|potato based|premise is based|region based|religious based|science based|she is based there|slavery based show|softball based|thanks richard clark|u.k. based|uk based|vendor based|vodka based|volunteer based|water based|where he is based|where the disney film is based|where the military is based|who are based there|who is based there|wordpress cms |
|
Allowed all posts:
|
Facebook
TwitterBackground
The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.New reweighting policy
Following the new reweighting policy ONS has reviewed the latest population estimates made available during 2019 and have decided not to carry out a 2019 LFS and APS reweighting exercise. Therefore, the next reweighting exercise will take place in 2020. These will incorporate the 2019 Sub-National Population Projection data (published in May 2020) and 2019 Mid-Year Estimates (published in June 2020). It is expected that reweighted Labour Market aggregates and microdata will be published towards the end of 2020/early 2021.
Secure Access QLFS household data
Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. For some quarters, users should note that all missing values in the data are set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. From the 2013 household datasets, the standard -8 and -9 missing categories have been reinstated.
Secure Access household datasets for the QLFS are available from 2002 onwards, and include additional, detailed variables not included in the standard 'End User Licence' (EUL) versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurence of learning difficulty or disability; and benefits.
Prospective users of a Secure Access version of the QLFS will need to fulfil additional requirements, commencing with the completion of an extra application form to demonstrate to the data owners exactly why they need access to the extra, more detailed variables, in order to obtain permission to use that version. Secure Access users must also complete face-to-face training and agree to Secure Access' User Agreement (see 'Access' section below). Therefore, users are encouraged to download and inspect the EUL version of the data prior to ordering the Secure Access version.
LFS Documentation
The documentation available from the Archive to accompany LFS datasets largely consists of each volume of the User Guide including the appropriate questionnaires for the years concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS LFS User Guidance pages before commencing analysis.
The study documentation presented in the Documentation section includes the most recent documentation for the LFS only, due to available space. Documentation for previous years is provided alongside the data for access and is also available upon request.
Review of imputation methods for LFS Household data - changes to missing values
A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.
Variables DISEA and LNGLST
Dataset A08 (Labour market status of disabled people) which ONS suspended due to an apparent discontinuity between April to June 2017 and July to September 2017 is now available. As a result of this apparent discontinuity and the inconclusive investigations at this stage, comparisons should be made with caution between April to June 2017 and subsequent time periods. However users should note that the estimates are not seasonally adjusted, so some of the change between quarters could be due to seasonality. Further recommendations on historical comparisons of the estimates will be given in November 2018 when ONS are due to publish estimates for July to September 2018.
An article explaining the quality assurance investigations that have been conducted so far is available on the ONS Methodology webpage. For any queries about Dataset A08 please email Labour.Market@ons.gov.uk.
Latest Edition Information
For the seventeenth edition (August 2025), one quarterly data file covering the time period July-September, 2024 has been added to the study.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Direct Download (Raster Data Gateway)Objective: Assess impacts to vegetation from acute exposure to ozone (O3) during the growing season (April to September).Data: Two exposure indices were calculated from hourly O3 data collected by the EPA to track effects from chronic O3 exposure (W126 index) and acute O3 exposure (N100). The N100 index is the sum of hours over 100 ppb in a given month and captures peak events of high or acute exposure. For both indices, hourly data was used to calculate monthly values. Monitoring sites were only included if at least 75% of the hourly observations were available, and monthly values were corrected to account for any missing data following the procedures of Lefohn, Knudsen, and Shadwick (2011). Corrected monthly values were totaled across the growing season to generate an annual value. Annual values were averaged over the most recently available 3 year period to generate a value for each site. Site values were then interpolated using inverse distance weighting to attribute values to landscapes within 40 km of a monitoring site. There is typically a 1 year lag between the TCA assessment year and the most recently available for instance, for the 2024 TCA Assessment, ozone data from 2021-2023 were used to compute the 3-year average. For the 2023 TCA Assessment, 2022 ozone source data was not available so the same data from the prior assessment was used. Source data ranges and assessment year in parentheses are: 2017-2019 (TCA Assessment 2020), 2018-2020 (TCA Assessment 2021), 2019-2021 (TCA Assessment 2022), 2019-2021 (TCA Assessment 2023), and 2021-2023 (TCA Assessment 2024).Data Format: Point (site data) interpolated through inverse distance weighting and filtered to areas only within 40 km of an O3 monitor.Units: N100 is expressed in hoursSpatial Resolution: 1000m (1km)Source data: EPA Hourly Ozone dataAdditional Resources:Details on Method Changes and Source Data VersionsOverview of the Terrestrial Condition Assessment: TCA Hubsite or Landfire Office Hour PresentationExplore the results of the most recent assessment: TCA Interactive Data ViewerLearn more about the TCA KPI: TCA Dashboard*if you have trouble viewing the Dashboard, please submit a Tableau Viewer Access Request
Facebook
Twitter
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Latest edition information
For the third edition (October 2022), the 2022 weighting variable was added to the dataset, and the old 2020 weight removed.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Info
This is a dataset I was given to solve for an interview for a transactions company. It is perfect to practice DAX measures. Dataset: anonymized sample of credit card deposit attempts over a 12-month period. Main problem: It shows a longitudinally decreasing approval rate from 10/1/2020 to 9/26/2021. Note: This means that the approval rate for credit card deposit attempts has been declining over this time period
TOOL You can do this with any tool you like. I used PowerBI and I consider it to be one of the best to solve this exercise.
PARAMETER DESCRIPTIONS
Appr? = Deposit attempts '1' or '0' = approved or declined. CustomerID Co Website = online divisions to which the deposit attempt is directed. Processing Co = credit card processing company that is processing the transaction. (nb: besides processing companies, a few fraud risk filters are also included here). Issuing Bank = bank that has issued the customer's credit card. Amount Attempt Timestamp
QUESTIONS (Qs 1-5 & 8 worth 10 points. Qs 6-7 worth 20 points. Total = 100 points)
1) What is the dataset's approval rate by quarter?
2)How many customers attempted a deposit of $50 in Sept 2021?
3)How much did the group identified in QUESTION 2 successfully deposit during the month?
4)Of the Highest Approval Rate for top 10 banks with the most deposit attempts between $150.00 and $999.99 in 2021?
5)Without performing any analysis, which two parameters would you suspect of causing the successive quarterly decrease in approval rate? Why?
6)Identify and describe 2 main causal factors of the decline in approval rates seen in Q3 2021 vs Q4 2020?
7)Choose one of the main factors identified in QUESTION 6. How much of the approval rate decline seen in Q3 2021 vs Q4 2020 is explained by this factor?
8) If you had more time, which other analyses would you like to perform on this dataset to identify additional causal factors to those identified in QUESTION 6
POWERBI TIPS:
• Try to add the least number of columns. There is no problem with this data but with big datasets more data means slower performance. Make DAX measures instead. 2 • Redefine each question: Picture how to display and make it on the PowerBI. Write what you´ll do. Ex: 1) What is the dataset's approval rate by quarter? = line graph, title = “Approval rate by quarter”. X axis= quarters, y axis = approval rate. • Define each column data type on the PowerBI not the query. This error persists over the years, you may define the type on the query but once you load it changes to the default. • In most datasets add the calendar table. Very useful • GREAT TIP: try to apply the less amount of filters to the visual and use calculated measures instead. You will need them in the future. As the questions start to be more complex • I use this rule for all my reports. Measures starting with "Total" are unfiltered. This means, no matter what the filter they should always be the same. You will use them a lot.
Facebook
TwitterWhat does the data show?
Wind-driven rain refers to falling rain blown by a horizontal wind so that it falls diagonally towards the ground and can strike a wall. The annual index of wind-driven rain is the sum of all wind-driven rain spells for a given wall orientation and time period. It’s measured as the volume of rain blown from a given direction in the absence of any obstructions, with the unit litres per square metre per year.
Wind-driven rain is calculated from hourly weather and climate data using an industry-standard formula from ISO 15927–3:2009, which is based on the product of wind speed and rainfall totals. Wind-driven rain is only calculated if the wind would strike a given wall orientation. A wind-driven rain spell is defined as a wet period separated by at least 96 hours with little or no rain (below a threshold of 0.001 litres per m2 per hour).
The annual index of wind-driven rain is calculated for a baseline (historical) period of 1981-2000 (corresponding to 0.61°C warming) and for global warming levels of 2.0°C and 4.0°C above the pre-industrial period (defined as 1850-1900). The warming between the pre-industrial period and baseline is the average value from six datasets of global mean temperatures available on the Met Office Climate Dashboard: https://climate.metoffice.cloud/dashboard.html. Users can compare the magnitudes of future wind-driven rain with the baseline values.
What is a warming level and why are they used?
The annual index of wind-driven rain is calculated from the UKCP18 local climate projections which used a high emissions scenario (RCP 8.5) where greenhouse gas emissions continue to grow. Instead of considering future climate change during specific time periods (e.g., decades) for this scenario, the dataset is calculated at various levels of global warming relative to the pre-industrial (1850-1900) period. The world has already warmed by around 1.1°C (between 1850–1900 and 2011–2020), so this dataset allows for the exploration of greater levels of warming.
The global warming levels available in this dataset are 2°C and 4°C in line with recommendations in the third UK Climate Risk Assessment. The data at each warming level were calculated using 20 year periods over which the average warming was equal to 2°C and 4°C. The exact time period will be different for different model ensemble members. To calculate the value for the annual wind-driven rain index, an average is taken across the 20 year period. Therefore, the annual wind-driven rain index provides an estimate of the total wind-driven rain that could occur in each year, for a given level of warming.
We cannot provide a precise likelihood for particular emission scenarios being followed in the real world in the future. However, we do note that RCP8.5 corresponds to emissions considerably above those expected under current international policy agreements. The results are also expressed for several global warming levels because we do not yet know which level will be reached in the real climate; the warming level reached will depend on future greenhouse emission choices and the sensitivity of the climate system, which is uncertain. Estimates based on the assumption of current international agreements on greenhouse gas emissions suggest a median warming level in the region of 2.4-2.8°C, but it could either be higher or lower than this level.
What are the naming conventions and how do I explore the data?
Each row in the data corresponds to one of eight wall orientations – 0, 45, 90, 135, 180, 225, 270, 315 compass degrees. This can be viewed and filtered by the field ‘Wall orientation’.
The columns (fields) correspond to each global warming level and two baselines. They are named 'WDR' (Wind-Driven Rain), the warming level or baseline, and ‘upper’ ‘median’ or ‘lower’ as per the description below. For example, ‘WDR 2.0 median’ is the median value for the 2°C projection. Decimal points are included in field aliases but not field names; e.g., ‘WDR 2.0 median’ is ‘WDR_20_median’.
Please note that this data MUST be filtered with the ‘Wall orientation’ field before styling it by warming level. Otherwise it will not show the data you expect to see on the map. This is because there are several overlapping polygons at each location, for each different wall orientation.
To understand how to explore the data, see this page: https://storymaps.arcgis.com/stories/457e7a2bc73e40b089fac0e47c63a578
What do the ‘median’, ‘upper’, and ‘lower’ values mean?
Climate models are numerical representations of the climate system. To capture uncertainty in projections for the future, an ensemble, or group, of climate models are run. Each ensemble member has slightly different starting conditions or model set-ups. Considering all of the model outcomes gives users a range of plausible conditions which could occur in the future.
For this dataset, the model projections consist of 12 separate ensemble members. To select which ensemble members to use, annual wind-driven rain indices were calculated for each ensemble member and they were then ranked in order from lowest to highest for each location.
The ‘lower’ fields are the second lowest ranked ensemble member. The ‘upper’ fields are the second highest ranked ensemble member. The ‘median’ field is the central value of the ensemble.
This gives a median value, and a spread of the ensemble members indicating the range of possible outcomes in the projections. This spread of outputs can be used to infer the uncertainty in the projections. The larger the difference between the lower and upper fields, the greater the uncertainty.
‘Lower’, ‘median’ and ‘upper’ are also given for the baseline periods as these values also come from the model that was used to produce the projections. This allows a fair comparison between the model projections and recent past.
Data source
The annual wind-driven rain index was calculated from hourly values of rainfall, wind speed and wind direction generated from the UKCP Local climate projections. These projections were created with a 2.2km convection-permitting climate model. To aid comparison with other models and UK-based datasets, the UKCP Local model data were aggregated to a 5km grid on the British National Grid; the 5 km data were processed to generate the wind-driven rain data.
Useful links
Further information on the UK Climate Projections (UKCP). Further information on understanding climate data within the Met Office Climate Data Portal.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NOTE: The gzipped files in this upload have mistakenly been compressed twice. To decompress the files, please run e.g.: gunzip -c CO1_asv_counts_SE.tsv.gz | gunzip -c > CO1_asv_counts_SE.tsv
The Insect Biome Atlas project was supported by the Knut and Alice Wallenberg Foundation (dnr 2017.0088). The project analyzed the insect faunas of Sweden and Madagascar, and their associated microbiomes, mainly using DNA metabarcoding of Malaise trap samples collected in 2019 (Sweden) or 2019–2020 (Madagascar).
Please cite this version of the dataset as: Miraldo A, Iwaszkiewicz-Eggebrecht E, Sundh J, Lokeshwaran M, Granqvist E, Goodsell R, Andersson AF, Lukasik P, Roslin T, Tack A, Ronquist F. 2024. Processed ASV data from the Insect Biome Atlas Project, version 3. doi:10.17044/scilifelab.27202368.v3 or https://doi.org/10.17044/scilifelab.27202368.v3
This dataset contains the results from bioinformatic processing of version 1 of the amplicon sequence variant (ASV) data from the Insect Biome Atlas project (Miraldo et al. 2024), that is, the cytochrome oxidase subunit 1 (CO1) metabarcoding data from Malaise trap samples processed using the FAVIS mild lysis protocol (Iwaszkiewicz et al. 2023). The bioinformatic processing involved: (1) taxonomic assignment of ASVs, (2) chimera removal; (3) clustering into OTUs; (4) noise filtering and (5) cleaning. The clustering step involved resolution of the taxonomic annotation of the cluster and identification of a representative ASV. The noise filtering step involved removal of ASV clusters identified as potentially originating from nuclear mitochondrial DNA (NUMTs) or representing other types of error or noise. The cleaning step involved removal of ASV clusters present in >5% of negative control samples. ASV taxonomic assignments, ASV cluster designations, consensus taxonomies and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Sequences of cluster representatives are provided in compressed FASTA format files. The bioinformatic processing pipeline is further described in Sundh et al. (2024). NB! All result files include ASVs and clusters that represent biological and synthetic spike-ins.
MethodsTaxonomic assignmentASVs were taxonomically assigned using kmer-based methods implemented in a Snakemake workflow available here (https://github.com/insect-biome-atlas/happ) . Specifically ASVs were assigned a taxonomy using the SINTAX algorithm in vsearch (v2.21.2) using a CO1 database constructed from the Barcode Of Life Data System (Sundh 2022). ASVs assigned to Class 'Insecta' or 'Collembola' but unassigned at lower taxonomic ranks were then placed into a reference phylogeny of 49,325 insect species (represented by 49,338 sequences) using the phylogenetic placement tool EPA-NG with subsequent taxonomic assignments using GAPPA. Assignments at the order level in this second pass were used to update the first kmer-based assignments, but only at the order level, leaving child ranks with the ‘unclassified’ prefix.
Chimera removalThe workflow first identifies chimeric ASVs in the input data using the ‘uchime_denovo’ method implemented in vsearch. This was done with a so-called ‘strict samplewise’ strategy where each sample was analysed separately (hence the ‘samplewise’ notation), only comparing ASVs present in the same sample. Further, ASVs had to be identified as chimeric in all samples where they were present (corresponding to the ‘strict’ notation) in order to be removed as chimeric.
ASV clusteringNon-chimeric sequences were then split by family-level taxonomic assignments and ASVs within each family were clustered in parallel using swarm (v3.1.0) with differences=15. Representative ASVs were selected for each generated cluster by taking the ASV with the highest relative abundance across all samples in a cluster. Counts were generated at the cluster level by summing over all ASVs in each cluster.
Consensus taxonomyA consensus taxonomy was created for each cluster by taking into account the taxonomic assignments of all ASVs in a cluster as well as the total abundance of ASVs. For each cluster, starting at the most resolved taxonomic level, each unique taxonomic assignment was weighted by the sum of read counts of ASVs with that assignment. If a single weighted assignment made up 80% or more of all weighted assignments at that rank, that taxonomy was propagated to the ASV cluster, including parent rank assignments. If no taxonomic assignment was above the 80% threshold, the algorithm continued to the parent rank in the taxonomy. Taxonomic assignments at any available child ranks were set to the consensus assignment prefixed with ‘unresolved’.
Noise filtering and cleaningThe clustered data was further cleaned from NUMTs and other types of noise using the NEEAT algorithm, which takes taxonomic annotation, correlations in occurrence across samples (‘echo signal’) and evolutionary signatures into account, as well as cluster abundance (Sundh et al., 2024). We used default settings for all parameters in the evolutionary and distributional filtering steps, and removed clusters unassigned at the order level and with less than 3 reads summed across each dataset.
As a last clean-up step in the noise filtering, clusters containing at least one ASV present in more than 5% of blanks were removed. Further, we removed ASvs assigned to a reference sequence in the BOLD database annotated as Zoarces gillii (BOLD:AEB5125), a fish found between Japan and eastern Korea. Closer inspection revealed that this was a mis-annotated bacterial sequence and ASVs assigned to this reference most likely represent bacterial sequences in our dataset. This record has been deleted from BOLD after our custom reference database was constructed.
The chimera filtering and ASV clustering methods have been implemented in a Snakemake workflow available here (https://github.com/insect-biome-atlas/happ) . This workflow takes as input:
Cleaning of ASV clusters in controls and identification of spikeins was done with a custom R script available here (https://github.com/insect-biome-atlas/utils) .
Available dataProcessed ASV data filesASV taxonomic assignments, non-chimeric ASV cluster designations, consensus taxonomies, sequences of cluster representatives and summed counts of clusters in the sequenced samples are provided in compressed tab-separated files. Files are organized by country (Sweden and Madagascar), marked by the suffixes SE and MG, respectively.
Taxonomic assignmentsThe files asv_taxonomy_[SE|MG].tsv.gz are tab-separated files with taxonomic assignments using SINTAX+EPA-NG for all ASVs. Columns:
The files asv_taxonomy_sintax_[SE|MG].tsv.gz, asv_taxonomy_epang_[SE|MG].tsv.gz and asv_taxonomy_vsearch_[SE|MG].tsv.gz have the same structure, but contain results from assignments with SINTAX, EPA-NG and VSEARCH, respectively.
Cluster assignmentsThe files cluster_taxonomy_[SE|MG].tsv are tab-separated files containing all non-chimeric ASVs (that is, the ASVs passing the chimera-filtering step) with their corresponding taxonomic and cluster assignments. Columns:
Sequences of cluster representativesThe files cluster_reps_[SE|MG].fasta are text files in FASTA format with representative sequences for each cluster. The fasta headers have the format “>ASV_ID CLUSTER_NAME”.
Consensus taxonomyThe files cluster_consensus_taxonomy_[SE|MG].tsv are tab-separated files with consensus taxonomy of each generated ASV cluster. Columns are the same as in asv_taxonomy_[SE|MG].tsv.
Noise-filtered dataThe files prefixed with 'noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering step.
Cleaned noise filtered dataThe files prefixed with 'cleaned_noise_filtered' contain data that has been cleaned from NUMTs and other types of noise using the NEEAT algorithm, and further cleaned from clusters present in >5% of blanks. The files contain the same information as the cluster files, but only for clusters that passed the noise filtering and cleaning steps.
Additional filesThe files removed_control_tax_[SE|MG].tsv.gz contain the ASV clusters removed from each dataset as part of cleaning.
The files spikeins_tax_[SE|MG].tsv.gz contain the taxonomic assignments of the biological spike-ins identified.
References:
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
There is a paucity of data on research funding levels for male reproductive health (MRH). We investigated the research funding for MRH and infertility by examining publicly accessible webdatabases from the UK and USA government funding agencies. Information on the funding was collected from the UKRI-GTR, the NIHR’s Open Data Summary, and the USA’s NIH RePORT webdatabases. Funded projects between January 2016 and December 2019 were recorded and funding support was divided into three research categories: (i) male-based; (ii) female-based; and (iii) not-specified. Between January 2016 and December 2019, UK agencies awarded a total of £11,767,190 to 18 projects for male-based research and £29,850,945 to 40 projects for female-based research. There was no statistically significant difference in the median funding grant awarded within the male-based and female-based categories (p=0.56, W=392). The USA NIH funded 76 projects totalling $59,257,746 for male-based research and 99 projects totalling $83,272,898 for female-based research Again, there was no statistically significant difference in the median funding grant awarded between the two research categories (p=0.83, W=3834). This is the first study examining funding granted by main government research agencies from the UK and USA for MRH. These results should stimulate further discussion of the challenges of tackling male infertility and reproductive health disorders and formulating appropriate investment strategies. Methods Experimental Design: Publicly accessible UK Research and Innovation (UKRI), National Institute for Health Research (NIHR), and National Institutes of Health (NIH) funding agency databases covering awards from January 2016 to December 2019 were examined (see Supplementary Table 1). Following the inclusion and exclusion criteria outlined within Supplementary Tables 2,3, funding data were collected on research proposals investigating infertility and reproductive health. For simplicity, these are referred to collectively as ‘infertility research’. As the primary focus of this research is on infertility, the data were divided into three main categories: (i) male-based, (ii) female-based, and (iii) not-specified (Supplementary Table 2). The first two groups covered projects whose primary aim, based on the information presented in the research abstracts, timeline summaries and/or impact statements, was male- or female-focussed. “Not-specified” includes research projects that have either not specified a primary focus towards either male or female or have explicitly stated a focus on both. The process was conducted and reviewed by E.G. with C.L.R.B. Total funding for all three groups, funding over time, and comparison with overall funding for a particular agency was examined. Briefly, E.G. retrieved the primary data and produced the first set of data for discussion with C.L.R.B. Both went through the complete list and discussed each study/project and decided whether: (a) it should be included or not, and (b) what category does it fell under (male-, female-, or not-specified). The abstracts, which were almost always available and provided by each research study, were all examined and scrutinised by both E.G. and C.L.R.B together. If there was clear disagreement between E.G. and C.L.R.B, which were very rare, the project would not be included. UK Data Collection: From April 2018 the UK research councils, Innovate UK, and Research England are reported under one organization, the UKRI (2019). The councils independently fund research projects according to their respective visions and missions; however, until 2018/19, their annual funding expenditures were reported under the UKRI’s annual reports and budgets. The UKRI’s Gateway to Research (UKRI-GTR) web database allows users to analyse the information provided on taxpayer-funded research. Relevant search terms such as “male infertility” or “female reproductive health” (see Supplementary Table 2) were applied with appropriate database filters (Supplementary Table 1). The project award relevance was determined by assessing the objectives in project abstracts, timeline summaries, and planned impacts. Supplementary Tables 1, 2 and 3 provide the search filters and the reference criteria for inclusion/exclusion utilized for analysis. The UKRI-GTR provides the total funding granted to the projects within a designated period. Data obtained from the NIHR had minor differences. The NIHR has 6 datasets. The Open Data Summary View dataset was used as it provided details on funded projects, grants, summary abstracts, and project dates. Like the UKRI data, the NIHR excel datasheet had specific search terms and filters applied to sift out irrelevant projects (Supplementary Tables 1-3). The UKRI councils and NIHR report their annual expenditure and budgets for 1st April to 31st March. Thus, the projects will fall under the funding period of when their research activities begin (e.g. if a project’s research activities undergo between May 20th, 2017, to March 20th, 2019, this project will be categorized under the funding period 2017/18). The projects collected would begin their investigations between January 2016 to December 2019, therefore 5 consecutive funding periods were examined (2015/16, 2016/17, 2017/18, 2018/19, and 2019/20). The UK data collection period ran between October 2020 to December 2020. USA Data Collection: The NIH has a research portfolio online operating tools sites (RePORT) providing access to their research activities, such as previously funded research, active research projects, and information on NIH’s annual expenditures. The RePORT-Query database has similar features as the UKRI-GTR and NIHR such as providing information on project abstracts, research impact, start- and end-dates, funding grants, and type of research. Like the UK data collection, appropriate search terms were inputted with the database filters applied and followed the same inclusion-exclusion criteria (Supplementary Tables 1, 2, and 3). The UK and US agencies present data on funded research under different calendar and funding periods because the US’ federal tax policy requires federal bodies to report all funding expenses under a fiscal year (FY). The NIH’s FY follows a calendar period from October 1st to September 30th (e.g., FY2016 comprises funding activity from October 1st, 2015, to September 30th, 2016). Projects running over one calendar period are reported several times under consecutive fiscal years and the funds are divided according to the annual period of the project’s activity. During data collection, 74 projects were found as active with incomplete funding sums as the NIH divides the grants according to the budgeting period of every FY. The NIH are in the process of granting funds for the FY2021, so projects ending in 2020 or 2021 provide a complete funding sum. For the active projects ending after 2021, incomplete funding data is provided. It is assumed the funding will increase in value by the time the research ends in the future, but the final awarded sum is unknown. To remain consistent with the UK data, projects granted funding are totalled as one figure and recorded under the FY the project first began research, whether they are active or completed. Thus US funding is referred to as “Current Total Funding”. When going through the REPORTER database, the NIH present the same research project multiple times for every funded fiscal year with consecutive project reference IDs. Therefore, for simplicity, we only included the first project reference ID. For more information on deciphering NIH's project's IDs, see https://era.nih.gov/files/Deciphering_NIH_Application.pdf. For the USA, the initial data collection period ran between October 2020 to December 2020 but then restarted for a brief period in January 2021 to add up the remaining funding values for some of the active research projects. Data Analysis: The data was divided into three main groups and organized into the funding period or FY the project was first awarded. R-Studio (Version 1.3.1093) was utilized for the data analysis. Box-and-whisker plots are presented with rounded P-values. Kruskal-Wallis and Wilcoxon Rank Sum tests were generated to assess any statistical significance. The data was independently collected and does not assume a normal distribution, so the rank-based, non-parametric tests such as the Kruskal-Wallis and Wilcoxon Rank Sum were used. Research Project Details Included in the Collection Datasets: For both, the UK and USA data, we included the following details:
The project (or study) titles The Project IDs (also referred to as Project Reference or Project Number) The project Start and End Dates The project's Status (identified by the end dates or if explicitly stated in the database) The Funding Organisation (for the UK) and Admin Institute (for the USA) that are funding the research The project Category (i.e. Research Grants or Fellowships) The Amount Granted (for the USA, the funding values were summed up to the most recent awarding date).
Rearranging/Processing Data for Analysis: After the data collection has been completed, the data was processed into a simpler format in Notepad in order to perform the statistical analyses using RStudio. For that, only the essential details were included and organised that the RStudio system would recognise and analyse the information effectively and efficiently. The project Type (male, female or not-specifieded), funding sum for the respective research project Type, and the funding period (UK) / FY (USA) were included. These details were then arranged appropriately to produce box-and-whisker plots with P-values, perform the chosen statistical analysis tests, and produce the data statistics in RStudio. As mentioned earlier, the funding period/fiscal years were added following the timeframes set out by the respective countries.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of multi-resolution X-Ray micro-tomography images of two Bentheimer sandstone rock cores. The rock cores were first used experimentally in [1] with further modelling in [2]. This new dataset is used directly in the publication [3] - preprint available at https://arxiv.org/abs/2111.01270.
The original dataset from 1 is hosted on the BGS National Geoscience Data Centre, ID #130625 at dx.doi.org/10.5285/5f899de8-4085-4370-a45e-e613f27e8f1d and there is also a subvolume image dataset, for easier download available on the Digital Rocks Portal, project 229, DOI:10.17612/KT0B-SZ28 at digitalrocksportal.org/projects/229.
The images provided herein are from two distinct Bentheimer rock cores -- core 1 and core 2. The cores have diameter, 12.35mm, lengths 73.2mm and 64.7mm, core-averaged porosities of 0.203 and 0.223 and permeabilities of 1.636D and 0.681D for core 1 and 2, respectively. Core 2 has a clear low permeability lamination occurring at 2/3 of the total core length, whereas core 1 has a general fining towards the outlet of the core creating a reduction in porosity [1].
The images were acquired with a Zeiss Versa 510 X-Ray CT scanner. We acquired images of two sub volumes from each core, at locations 1/3rd (subvolume 1) and 2/3rds (subvolume 2) of the way along the core length, at resolutions of 2, 6 and 18 microns. We refer to the 2 micron images as high-resolution (HR), the 6 micron images as low-resolution (LR) and the 18 micron images as very-low-resolution (VLR). There are also super-resolution (SR) images created at 2 micron resolution from the LR images, using a deep-learning algorithm. There are also cubic interpolation images created from the LR image - these are labels bicubic. These have a resolution of 2 microns, and size equal to the HR and SR images. Details of the SR and LR Bicubic generation are found in [3]. The following scanning protocols were used for the direct imaging:
2 micron images: --We use a 4x microscope objective, an exposure time of 8s, 2x averaged binning, 9001 projections, a scan voltage of 80kV and a power of 7W. Each scan takes approximately 24 hours.
6 micron images: --We use a flat panel detector, an exposure time of 0.7s, 10x repeat frames, 1x averaged binning, 2401 projections, a scan voltage of 80kV and a power of 7W. The cone angle is 14.46 degrees and the fan angle is 22.2 degrees. Each scan takes approximately 1 hour.
18 micron images: --We use a 0.4x microscope objective, an exposure time of 1s, 10x repeat frames, 1x averaged binning, 2401 projections, a scan voltage of 80kV and a power of 7W. The cone angle is 12.65 degrees and the fan angle is 12.65 degrees. Each scan takes approximately 2 hours.
We present 4 sets of the images with different levels of processing. All images are mutual registered to each other. Each image filename has a Core#_Subvol#_resolution identifier, either with the actual resolution (e.g. 6) or the short form (e.g. LR). The following name endings are used
(1) - '_16bit_LE.raw'. These are the .raw images of little-endian format. Preceding this filename is also the cubic image side length in voxels, e.g. _75cube. 12 images in total.
(2) - '_16bit_LE_normalised.raw'. These are the .raw images of little-endian format with normalised greyscale values following the procedure in [1]. Preceding this filename is also the cubic image side length in voxels, e.g. _75cube. 12 images in total.
(3) - 'Core1_Subvol1_HR' etc. These are the .tiff images of (2) above, which have been converted to 8 bit. Includes bicubic interpolation images and SR images, but no 16 micron images, since these were not used in the analysis of [3]. 16 images in total.
(4) - 'Core1_Subvol1_HR_filtered' etc. These are the .tiff images from (3) above, which have filtered using non-local means filtering. More details are found in [3]. Note there are no SR images here since they are already essentially filtered, and included in (3) above. 12 images in total.
References
[1] Jackson, S.J., Lin, Q. and Krevor, S. 2020. Representative Elementary Volumes, Hysteresis, and Heterogeneity in Multiphase Flow from the Pore to Continuum Scale. Water Resources Research, 56(6), e2019WR026396
[2] Zahasky, C., Jackson, S.J., Lin, Q., and Krevor, S. 2020. Pore network model predictions of Darcy‐scale multiphase flow heterogeneity validated by experiments. Water Resources Research, 56(6), e e2019WR026708.
[3] Jackson, S.J, Niu, Y., Manoorkar, S., Mostaghimi, P. and Armstrong, R.T. 2021. Deep learning of multi-resolution X-Ray micro-CT images for multi-scale modelling. Under review, preprint available at https://arxiv.org/abs/2111.01270
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OFFICIALImportant: Our technical support team is available to assist you during business hours only. Please keep in mind that we can only address technical difficulties during these hours. When using the product to make decisions, please take this into consideration.AbstractThis spatial product shows ‘near real-time’ bushfire and prescribed burn extents for all jurisdictions who have the technical ability or appropriate licence conditions to provide this information into a national product.This is a scientific product and should not be used for safety of life decisions. Please refer to jurisdictional emergency response agencies for incident warnings and information.CurrencyDate Created: October 2025Modification Frequency: Every 15 MinutesData ExtentCoordinate Reference: WGS84Spatial ExtentNorth: -9°South: -44°East: 154°West: 112°Source informationPrevious project teams identified source data through jurisdictional websites and the Emergency Management LINK catalogue.Sources for the current iteration of this dataset have been confirmed by each jurisdiction through the Emergency Management Spatial Information Network (EMSINA) National and EMSINA Developers networks.This Webservice contains authoritative data sourced from:Australian Capital Territory - Emergency Service Agency (ESA)New South Wales - Rural Fire Service (RFS)Queensland - Queensland Fire and Emergency Service (QFES)South Australia - Country Fire Service (CFS)Tasmania - Tasmania Fire Service (TFS)Victoria – Department of Energy, Environment and Climate Action (DEECA)Western Australia – Department of Fire and Emergency Services (DFES) and Department of Biodiversity, Conservation and Attractions (DBCA)This webservice does not contain data from:Northern Territory – Bushfires NTKnown Limitations:This dataset does not contain information from the Northern Territory Government.This dataset contains a subset of the Queensland bushfire boundary data. The Queensland ‘Operational’ feed that is consumed within this National Database displays the last six (6) months of incident boundaries. In order to make this dataset best represent a ‘near real-time’ or current view of operational bushfire boundaries, Geoscience Australia has filtered the Queensland data to only incorporate the last one (1) week’s data.Geoscience Australia is aware that duplicate data (features) may appear within this dataset. This duplicate data is commonly represented in the regions around state borders where it is operationally necessary for one jurisdiction to understand cross border situations. Care must be taken when summing the values to obtain a total area burnt.The data within this aggregated national product is a spatial representation of the input data received from the custodian agencies. Therefore, data quality and data completion will vary. If you wish to assess more information about specific jurisdictional data and/or data feature(s) it is strongly recommended that you contact the appropriate custodian.The accuracy of the data attributes within this webservice is reliant on each jurisdictional source and the information they elect to publish into their Operational Bushfire Boundary webservices.Attribute Accuracy: The accuracy of the data attributes within this webservice is reliant on each jurisdictional source and the information they elect to publish into their Operational/Going Bushfire Boundary webservices.Data Completeness: The completeness of the data within this webservice is reliant on each jurisdictional source and the information they elect to publish into their Operational/Going Bushfire boundary webservices. In the case of Queensland’s data contribution: Please see the ‘Known Limitations’ section for full details.Schema: The following schema table covers all the core data fields: Note: Geoscience Australia has, where possible, attempted to align the data to the National Current Incident Extent Feeds Data Dictionary. However, this has not been possible in all cases. Geoscience Australia has not included attributes added automatically by spatial software processes in the table below.Catalog entry: National Bushfire Extents - Near Real-TimeLineage statementVersions 1 and 2 (2019/20): This dataset was first built by EMSINA, Geoscience Australia, and Esri Australia staff in early January 2020 in response to the Black Summer Bushfires. The product was aimed at providing a nationally consistent dataset of bushfire boundaries. Version 1 was released publicly on 8 January 2020 through Esri AGOL software.Version 2 of the product was released in mid-February as EMSINA and Geoscience Australia began automating the product. The release of version 2 exhibited a reformatted attributed table to accommodate these new automation scripts.The product was continuously developed by the three entities above until early May 2020 when both the scripts and data were handed over to the National Bushfire Recovery Agency. The EMSINA Group formally ended their technical involvement with this project on June 30, 2020.Version 3 (2020/21): A 2020/21 version of the National Operational Bushfire Boundaries dataset was agreed to by the Australian Government. It continued to extend upon EMSINA’s 2019/20 Version 2 product. This product was owned and managed by the Australian Government Department of Home Affairs, with Geoscience Australia identified as the technical partners responsible for development and delivery.Work on Version 3 began in August 2020 with delivery of this product occurring on 14 September 2020.Version 4 (2021/22): A 2021/22 version of the National Operational Bushfire Boundaries dataset was produced by Geoscience Australia. This product was owned and managed by Geoscience Australia, who provided both development and delivery.Work on Version 4 began in August 2021 with delivery of this product occurring on 1 September 2021. The dataset was discontinued in May 2022 because of insufficient Government funding to sustain the Project.Version 5 (2023/25): A 2023/25 version of the National Near Real-Time Bushfire Boundaries dataset is produced by Geoscience Australia under funding from the National Bushfire Intelligence Capability (NBIC) - CSIRO. NBIC and Geoscience Australia also partnered with the EMSINA Group to assist with accessing and delivering this dataset. This dataset was the first time where the jurisdictional attributes were aligned to AFAC’s draft National Going Bushfire Schema and Data Dictionary.Work on Version 5 began in August 2023 and was released in late 2023 under formal access arrangements with the States and Territories.Version 6 (2025/26) - Current Version A 2025/26 version of the National Near Real-Time Bushfire Extents dataset is produced by Geoscience Australia under project funding from the Department of Climate Change, Energy, the Environment and Water, and the National Emergency Management Agency, with contributions to the National Bushfire Intelligence Capability. This dataset is built directly off Version 5, incorporating improvements from AFAC's finalised National Going Bushfire Schema and Data Dictionary.Work on Version 6 started in September 2025 and was finalised and released in mid-October 2025. This iteration of the dataset is funded until 30 June 2026.Data dictionaryAttribute nameField TypeDescriptionfire_idStringID attached to fire (e.g. incident ID, Event ID, Burn ID).fire_nameStringIncident name. If available.fire_typeStringBinary variable to describe whether a fire was a bushfire or prescribed burn.ignition_dateDateThe date of the ignition of a fire event. Date and time are captured in jurisdiction local time and converted to UTC. Please note when viewed in ArcGIS Online, the date is converted from UTC to your local time.capt_dateDateThe date of the incident boundary was captured or updated. Date and time are captured in jurisdiction local time and converted to UTC. Please note when viewed in ArcGIS Online, the date is converted from UTC to your local time.capt_methodStringCategorical variable to describe the source of data used for defining the spatial extent of the fire.area_haDoubleBurnt area in Hectares. Currently calculated field so that all areas calculations are done in the same map projection. Jurisdiction supply area in appropriate projection to match state incident reporting system.perim_kmDoubleBurnt perimeter in Kilometres. Calculated field so that all areas calculations are done in the same map projection. Jurisdiction preference is that supplied perimeter calculations are used for consistency with jurisdictional reporting.stateStringState custodian of the data. NOTE: Currently some states use and have in their feeds cross border data.agencyStringAgency that is responsible for the incidentdate_retrievedDateThe date and time that Geoscience Australia retrieved this data from the jurisdictions, stored as UTC. Please note when viewed in ArcGIS Online, the date is converted from UTC to your local time.ContactClient Services at Geoscience Australia, clientservices@ga.gov.au
Facebook
TwitterThis is NOT a raw population dataset. We use our proprietary stack to combine detailed 'WorldPop' UN-adjusted, sex and age structured population data with a spatiotemporal OD matrix.
The result is a dataset where each record indicates how many people can be reached in a fixed timeframe (3 Hours in this case) from that record's location.
The dataset is broken down into sex and age bands at 5 year intervals, e.g - male 25-29 (m_25) and also contains a set of features detailing the representative percentage of the total that the count represents.
The dataset provides 76174 records, one for each sampled location. These are labelled with a h3 index at resolution 7 - this allows easy plotting and filtering in Kepler.gl / Deck.gl / Mapbox, or easy conversion to a centroid (lat/lng) or the representative geometry of the hexagonal cell for integration with your geospatial applications and analyses.
A h3 resolution of 7, is a hexagonal cell area equivalent to: - ~1.9928 sq miles - ~5.1613 sq km
Higher resolutions or alternate geographies are available on request.
More information on the h3 system is available here: https://eng.uber.com/h3/
WorldPop data provides for a population count using a grid of 1 arc second intervals and is available for every geography.
More information on the WorldPop data is available here: https://www.worldpop.org/
One of the main use cases historically has been in prospecting for site selection, comparative analysis and network validation by asset investors and logistics companies. The data structure makes it very simple to filter out areas which do not meet requirements such as: - being able to access 70% of the German population within 4 hours by Truck and show only the areas which do exhibit this characteristic.
Clients often combine different datasets either for different timeframes of interest, or to understand different populations, such as that of the unemployed, or those with particular qualifications within areas reachable as a commute.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.
Citation
If you use the GISE-51 dataset and/or the released code, please cite our paper:
Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021
Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
About GISE-51 and GISE-51-Mixtures
The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.
GISE-51
meta/lbl_map.csv for the complete vocabulary.silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.GISE-51-Mixtures
LICENSE
All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.
GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.
Baselines
Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.
Files
GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:
isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.meta.tar.gz: contains lbl_map.jsonnoises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generationmixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.
state_dicts for use with transfer learning experiments.license.tar.gz: contains dataset license info.silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.Contact
In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)
Facebook
TwitterB.1 Buildings Inventory
The Building Footprints data layer is an inventory of buildings in Southeast Michigan representing both the shape of the building and attributes related to the location, size, and use of the structure. The layer was first developed in 2010using heads-up digitizing to trace the outlines of buildings from 2010 one foot resolution aerial photography. This process was later repeated using six inch resolution imagery in 2015 and 2020 to add recently constructed buildings to the inventory. Due to differences in spatial accuracy between the 2010 imagery and later imagery sources, footprint polygons delineated in 2010 may appear shifted compared with imagery that is more recent.
Building Definition
For the purposes of this data layer, a building is defined as a structure containing one or more housing units AND/OR at least 250 square feet of nonresidential job space. Detached garages, pole barns, utility sheds, and most structures on agricultural or recreational land uses are therefore not considered buildings as they do not contain housing units or dedicated nonresidential job space.
How Current is the Buildings Footprints Layer
The building footprints data layer is current as of April, 2020. This date was chose to align with the timing of the 2020 Decennial Census, so that accurate comparisons of housing unit change can be made to evaluate the quality of Census data.
Temporal Aspects
The building footprints data layer is designed to be temporal in nature, so that an accurate inventory of buildings at any point in time since the origination of the layer in April 2010 can be visualized. To facilitate this, when existing buildings are demolished the demolition date is recorded but they are not removed from the inventory. To view only current buildings, you must filter the data layer using the expression, WHERE DEMOLISHED IS NULL.
B.2 Building Footprints Attributes
Table B-1 list the current attributes of the building footprints data layer. Additional information about certain fields follows the attribute list.
Table B-1 Building Footprints Attributes
FIELD | TYPE | DESCRIPTION |
BUILDING_ID | Long Integer | Unique identification number assigned to each building. |
PARCEL_ID | Long Integer | Identification number of the parcel on which the building is located. |
APN | Varchar(24) | Tax assessing parcel number of the parcel on which the building is located. |
CITY_ID | Integer | SEMCOG identification number of the municipality, or for Detroit, master plan neighborhood, in which the building is located. |
BUILD_TYPE | Integer | Building type. Please see section B.3 for a detailed description of the types. |
RES_SQFT | Long Integer | Square footage devoted to residential use. |
NONRES_SQFT | Long Integer | Square footage devoted to nonresidential activity. |
YEAR_BUILT | Integer | Year structure was built. A value of 0 indicates the year built is unknown. |
DEMOLISHED | Date | Date structure was demolished. |
STORIES | Float(5.2) | Number of stories. For single-family residential this number is expressed in quarter fractions from 1 to 3 stories: 1.00, 1.25, 1.50, etc. |
MEDIAN_HGT | Integer | Median height of the building from LiDAR surveys, NULL if unknown. |
HOUSING_UNITS | Integer | Number of residential housing units in the building. |
GQCAP | Integer | Maximum number of group quarters residents, if any. |
SOURCE | Varchar(10) | Source of footprint polygon: NEARMAP, OAKLAND, SANBORN, SEMCOG or AUTOMATIC. |
ADDRESS | Varchar(100) | Street address of the building. |
ZIPCODE | Varchar(5) | USPS postal code for the building address. |
REF_NAME | Varchar(40) | Owner or business name of the building, if known. |
CITY_ID
Please refer to the SEMCOG CITY_ID Code List for a list identifying the code for each municipality AND City of Detroit master plan neighborhood.
RES_SQFT and NONRES_SQFT
Square footage evenly divisible by 100 is an estimate, based on size and/or type of building, where the true value is unknown.
SOURCE
Footprints from OAKLAND County are derived from 2016 EagleView imagery. Footprints from SEMCOG are edits of shapes from another source. AUTOMATIC footprints are those created by algorithm to represent mobile homes in manufactured housing parks.
ADDRESS
Buildings with addresses on multiple streets will have each street address separated by the “ | “ symbol within the field.
B.3 Building Types
Each building footprint is assigned one of 26 building types to represent how the structure is currently being used. The overwhelming majority of buildings
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Important: Our technical support team is available to assist you during business hours only. Please keep in mind that we can only address technical difficulties during these hours. When using the product to make decisions, please take this into consideration.
Abstract This spatial product shows consistent ‘near real-time’ bushfire and prescribed burn boundaries for all jurisdictions who have the technical ability or appropriate licence conditions to provide this information. Currency Maintenance of the underlying data is the responsibility of the custodian. Geoscience Australia has automated methods of regularly checking for changes in source data. Once detected the dataset and feeds will be updated as soon as possible. NOTE: The update frequency of the underlying data from the jurisdictions varies and, in most cases, does not line up to this product’s update cycle. Date created: November 2023 Modification frequency: Every 15 Minutes Spatial Extent
West Bounding Longitude: 113° South Bounding Latitude: -44° East Bounding Longitude: 154° North Bounding Latitude: -10°
Source Information The project team initially identified a list of potential source data through jurisdictional websites and the Emergency Management LINK catalogue. These were then confirmed by each jurisdiction through the EMSINA National and EMSINA Developers networks. This Webservice contains authoritative data sourced from:
Australian Capital Territory - Emergency Service Agency (ESA)
New South Wales - Rural Fire Service (RFS)
Queensland - Queensland Fire and Emergency Service (QFES)
South Australia - Country Fire Service (CFS)
Tasmania - Tasmania Fire Service (TFS)
Victoria – Department of Environment, Land, Water and Planning (DELWP)
Western Australia – Department of Fire and Emergency Services (DFES)
The completeness of the data within this webservice is reliant on each jurisdictional source and the information they elect to publish into their Operational Bushfire Boundary webservices. Known Limitations:
This dataset does not contain information from the Northern Territory government. This dataset contains a subset of the Queensland bushfire boundary data. The Queensland ‘Operational’ feed that is consumed within this National Database displays a the last six (6) months of incident boundaries. In order to make this dataset best represent a ‘near-real-time’ or current view of operational bushfire boundaries Geoscience Australia has filtered the Queensland data to only incorporate the last two (2) weeks data. Geoscience Australia is aware of duplicate data (features) may appear within this dataset. This duplicate data is commonly represented in the regions around state borders where it is operationally necessary for one jurisdiction to understand cross border situations. Care must be taken when summing the values to obtain a total area burnt. The data within this aggregated National product is a spatial representation of the input data received from the custodian agencies. Therefore, data quality and data completion will vary. If you wish to assess more information about specific jurisdictional data and/or data feature(s) it is strongly recommended that you contact the appropriate custodian.
The accuracy of the data attributes within this webservice is reliant on each jurisdictional source and the information they elect to publish into their Operational Bushfire Boundary webservices.
Note: Geoscience Australia has, where possible, attempted to align the data to the (as of October 2023) draft National Current Incident Extent Feeds Data Dictionary. However, this has not been possible in all cases. Work to progress this alignment will be undertaken after the publication of this dataset, once this project enters a maintenance period.
Catalog entry: Bushfire Boundaries – Near Real-Time
Lineage Statement
Version 1 and 2 (2019/20):
This dataset was first built by EMSINA, Geoscience Australia, and Esri Australia staff in early January 2020 in response to the Black Summer Bushfires. The product was aimed at providing a nationally consistent dataset of bushfire boundaries. Version 1 was released publicly on 8 January 2020 through Esri AGOL software.
Version 2 of the product was released in mid-February as EMSINA and Geoscience Australia began automating the product. The release of version 2 exhibited a reformatted attributed table to accommodate these new automation scripts.
The product was continuously developed by the three entities above until early May 2020 when both the scripts and data were handed over to the National Bushfire Recovery Agency. The EMSINA Group formally ended their technical involvement with this project on June 30, 2020.
Version 3 (2020/21):
A 2020/21 version of the National Operational Bushfire Boundaries dataset was agreed to by the Australian Government. It continued to extend upon EMSINA’s 2019/20 Version 2 product. This product was owned and managed by the Australian Government Department of Home Affairs, with Geoscience Australia identified as the technical partners responsible for development and delivery.
Work on Version 3 began in August 2020 with delivery of this product occurring on 14 September 2020.
Version 4 (2021/22):
A 2021/22 version of the National Operational Bushfire Boundaries dataset was produced by Geoscience Australia. This product was owned and managed by Geoscience Australia, who provided both development and delivery.
Work on Version 4 began in August 2021 with delivery of this product occurring on 1 September 2021. The dataset was discontinued in May 2022 because of insufficient Government funding.
Version 5 (2023/25):
A 2023/25 version of the National Near-Real-Time Bushfire Boundaries dataset is produced by Geoscience Australia under funding from the National Bushfire Intelligence Capability (NBIC) - CSIRO. NBIC and Geoscience Australia have also partnered with the EMSINA Group to assist with accessing and delivering this dataset. This dataset is the first time where the jurisdictional attributes are aligned to AFAC’s National Bushfire Schema.
Work on Version 5 began in August 2023 and was released in late 2023 under formal access arrangements with the States and Territories.
Data Dictionary
Geoscience Australia has not included attributes added automatically by spatial software processes in the table below.
Attribute Name Description
fire_id ID attached to fire (e.g. incident ID, Event ID, Burn ID).
fire_name Incident name. If available.
fire_type Binary variable to describe whether a fire was a bushfire or prescribed burn.
ignition_date The date of the ignition of a fire event. Date and time are local time zone from the State where the fire is located and stored as a string.
capt_date The date of the incident boundary was captured or updated. Date and time are local time zone from the Jurisdiction where the fire is located and stored as a string.
capt_method Categorical variable to describe the source of data used for defining the spatial extent of the fire.
area_ha Burnt area in Hectares. Currently calculated field so that all areas calculations are done in the same map projection. Jurisdiction supply area in appropriate projection to match state incident reporting system.
perim_km ) Burnt perimeter in Kilometres. Calculated field so that all areas calculations are done in the same map projection. Jurisdiction preference is that supplied perimeter calculations are used for consistency with jurisdictional reporting.
state State custodian of the data. NOTE: Currently some states use and have in their feeds cross border data
agency Agency that is responsible for the incident
date_retrieved The date and time that Geoscience Australia retrieved this data from the jurisdictions, stored as UTC. Please note when viewed in ArcGIS Online, the date is converted from UTC to your local time.
Contact Geoscience Australia, clientservices@ga.gov.au
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Using geospatial data of wildlife presence to predict a species distribution across a geographic area is among the most common tools in management and conservation. The collection of high-quality presence-absence data through structured surveys is, however, expensive, and managers usually have access to larger amounts of low-quality presence-only data collected by citizen scientists, opportunistic observations, and culling returns for game species. Integrated Species Distribution Models (ISDMs) have been developed to make the most of the data available by combining the higher-quality, but usually scarcer and more spatially restricted presence-absence data, with the lower quality, unstructured, but usually more extensive presence-only datasets. Joint-likelihood ISDMs can be run in a Bayesian context using INLA (Integrated Nested Laplace Approximation) methods that allow the addition of a spatially structured random effect to account for data spatial autocorrelation. Here, we apply this innovative approach to fit ISDMs to empirical data, using presence-absence and presence-only data for the three prevalent deer species in Ireland: red, fallow and sika deer. We collated all deer data available for the past 15 years and fitted models predicting distribution and relative abundance at a 25 km2 resolution across the island. Models’ predictions were associated to spatial estimates of uncertainty, allowing us to assess the quality of the model and the effect that data scarcity has on the certainty of predictions. Furthermore, we checked the performance of the three species-specific models using two datasets, independent deer hunting returns and deer densities based on faecal pellet counts. Our work clearly demonstrates the applicability of spatially-explicit ISDMs to empirical data in a Bayesian context, providing a blueprint for managers to exploit unexplored and seemingly unusable data that can, when modelled with the proper tools, serve to inform management and conservation policies. Methods Presence absence (PA) data PA data for each species were obtained from Coillte based on surveys performed in a fraction of the 6,000 properties they manage (Table 1) by asking property managers (who visit the forests they manage on a regular basis) whether deer were present and, if so, what species. Properties range in size from less than one to around 2,900 ha, and to assign the PA value to a specific location, we calculated the centroid of each property using the function st_centroid() from the package sf in R (Pebesma 2018). The survey was mainly performed in 2010 and 2013, in addition to further data collected between 2014 and 2016. Some properties were surveyed only once in the period 2010–2016, but for those that were surveyed more than once, the value for that location was considered “absence” if deer had never been detected in the property in any of the surveys, and “presence” in all other cases. In addition to these surveys, Coillte commissioned density surveys based on faecal pellet sampling in a subset of their properties between the years 2007 and 2020. Any non-zero densities in these data were considered “presences”, and all zeros were considered “absences”. These data were also summarised across years when a property had been repeatedly sampled and counted as presence if deer had been detected in any of the sampling years. PA data for NI were obtained from a survey carried out by the British Deer Society in 2016. The survey divided the British territory into 100 km2 grid cells, and deer presence or absence was assessed based on public contributions, which were then reviewed and collated by BDS experts. Since 100 km2 grid cells are quite large, we did not, as with the Coillte properties, calculate the centroid of each cell and assign the PA value of the cell to it. Instead, we randomly simulated positions within each cell and assigned the presence or absence value of the cell to each of them. We performed a sensitivity analysis to calculate an optimal number of positions that would capture the environmental variability within each cell, which was set to 5 random positions per grid cell. After processing, we obtained a total of 920 PA data across NI. 2.2.2 Presence-only (PO) data PO data were collected from various sources, mainly (but not only) from citizen science initiatives. The National Biodiversity Data Centre (NBDC) is an Irish initiative that collates biodiversity data coming from different sources, from published studies to citizen contributions. From this repository, we obtained all contributions on the three species, a total of 1,430 records. To this, we added the 164 records of deer in Ireland downloaded from the iNaturalist site, another citizen-contributed database that collects the same type of data. From the resulting dataset, we (1) removed all observations with a spatial resolution lower than 1 km2; (2) did a visual inspection of the data and comments and removed all observations that were obviously incorrect (i.e. at sea or that the comment specified it was a different species); (3) filtered out all the fallow deer reported in Dublin’s enclosed city park (Phoenix Park) since the population there was introduced and is artificially maintained and disconnected from the rest of populations in Ireland; and (4) filtered duplicate observations by retaining only one observation per user, location, and day. The Centre for Environmental Data and Recording (CEDaR) is a data repository for Northern Ireland (NI) that operates in the same way as the NBDC. They provided 872 records of deer in NI, coming from different survey, scientific, and citizen science initiatives, from which we removed all records provided with a spatial resolution lower than 1 km2. The location and species of 469 deer culled between 2019 and 2021 in NI were obtained from the British Agri-Food and Biosciences Institute. For the observations that did not have specific coordinates, we derived them from the location name or postcode if provided. As part of a nationally funded initiative to improve deer monitoring in Ireland (SMARTDEER), we developed a bespoke online tool to facilitate the reporting of deer observations by the general public and all relevant stakeholders e.g. hunters, farmers, or foresters. Observations were reported in 2021 and 2022 by clicking on a map to indicate a 1 km2 area where deer have been observed. For each user and session, we calculated the area of the surface covered in squares, simulated a number of positions proportional to the size of the polygon, and distributed them within it to generate a number of exact positions equivalent to the area where the user had indicated an observation. In total, the SMARTDEER tool allowed us to collect 4,078 presences across Ireland and NI. 2.3.2 Covariate selection Raster environmental covariates used in the models were obtained from the Copernicus Land Monitoring Service (© European Union, Copernicus Land Monitoring Service 2018, European Environment Agency EEA), whereas the vector layers (roads, paths) were obtained from the Open Street Map service (OpenStreetMap contributors, 2017. Planet dump [Data file from January 2022]. https://planet.openstreetmap.org). Vector layers were transformed into distance layers (distance to roads, distance to paths) using the distance() function from the package raster, and into density layers (density of roads, paths) using the rasterize() function of the same package (Hijmans 2021). All raster layers were resampled to the lowest resolution available in the used covariates, resulting in a 1 km2 resolution. A full description of the process of covariate selection (including screening for collinearity) can be found in the supplementary material. The covariates eventually used in the model were elevation (m), slope (degrees), tree cover (%), small woody feature density (%), distances to forest edge (m, positive distances indicate a location outside a forest, negative distances indicate a location within a forest), and human footprint index (Venter et al. 2016, 2018). All covariates were scaled by subtracting the mean and dividing by the standard deviation before entering the model (function scale() from the raster package).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Effective population size (Ne) is a particularly useful metric for conservation as it affects genetic drift, inbreeding and adaptive potential within populations. Current guidelines recommend a minimum Ne of 50 and 500 to avoid short-term inbreeding and to preserve long-term adaptive potential, respectively. However, the extent to which wild populations reach these thresholds globally has not been investigated, nor has the relationship between Ne and human activities. Through a quantitative review, we generated a dataset with 4610 georeferenced Ne estimates from 3829 unique populations, extracted from 723 articles. These data show that certain taxonomic groups are less likely to meet 50/500 thresholds and are disproportionately impacted by human activities; plant, mammal, and amphibian populations had a <54% probability of reaching = 50 and a <9% probability of reaching = 500. Populations listed as being of conservation concern according to the IUCN Red List had a smaller median than unlisted populations, and this was consistent across all taxonomic groups. was reduced in areas with a greater Global Human Footprint, especially for amphibians, birds, and mammals, however relationships varied between taxa. We also highlight several considerations for future works, including the role that gene flow and subpopulation structure plays in the estimation of in wild populations, and the need for finer-scale taxonomic analyses. Our findings provide guidance for more specific thresholds based on Ne and help prioritize assessment of populations from taxa most at risk of failing to meet conservation thresholds. Methods Literature search, screening, and data extraction A primary literature search was conducted using ISI Web of Science Core Collection and any articles that referenced two popular single-sample Ne estimation software packages: LDNe (Waples & Do, 2008), and NeEstimator v2 (Do et al., 2014). The initial search included 4513 articles published up to the search date of May 26, 2020. Articles were screened for relevance in two steps, first based on title and abstract, and then based on the full text. For each step, a consistency check was performed using 100 articles to ensure they were screened consistently between reviewers (n = 6). We required a kappa score (Collaboration for Environmental Evidence, 2020) of ³ 0.6 in order to proceed with screening of the remaining articles. Articles were screened based on three criteria: (1) Is an estimate of Ne or Nb reported; (2) for a wild animal or plant population; (3) using a single-sample genetic estimation method. Further details on the literature search and article screening are found in the Supplementary Material (Fig. S1). We extracted data from all studies retained after both screening steps (title and abstract; full text). Each line of data entered in the database represents a single estimate from a population. Some populations had multiple estimates over several years, or from different estimation methods (see Table S1), and each of these was entered on a unique row in the database. Data on N̂e, N̂b, or N̂c were extracted from tables and figures using WebPlotDigitizer software version 4.3 (Rohatgi, 2020). A full list of data extracted is found in Table S2. Data Filtering After the initial data collation, correction, and organization, there was a total of 8971 Ne estimates (Fig. S1). We used regression analyses to compare Ne estimates on the same populations, using different estimation methods (LD, Sibship, and Bayesian), and found that the R2 values were very low (R2 values of <0.1; Fig. S2 and Fig. S3). Given this inconsistency, and the fact that LD is the most frequently used method in the literature (74% of our database), we proceeded with only using the LD estimates for our analyses. We further filtered the data to remove estimates where no sample size was reported or no bias correction (Waples, 2006) was applied (see Fig. S6 for more details). Ne is sometimes estimated to be infinity or negative within a population, which may reflect that a population is very large (i.e., where the drift signal-to-noise ratio is very low), and/or that there is low precision with the data due to small sample size or limited genetic marker resolution (Gilbert & Whitlock, 2015; Waples & Do, 2008; Waples & Do, 2010) We retained infinite and negative estimates only if they reported a positive lower confidence interval (LCI), and we used the LCI in place of a point estimate of Ne or Nb. We chose to use the LCI as a conservative proxy for in cases where a point estimate could not be generated, given its relevance for conservation (Fraser et al., 2007; Hare et al., 2011; Waples & Do 2008; Waples 2023). We also compared results using the LCI to a dataset where infinite or negative values were all assumed to reflect very large populations and replaced the estimate with an arbitrary large value of 9,999 (for reference in the LCI dataset only 51 estimates, or 0.9%, had an or > 9999). Using this 9999 dataset, we found that the main conclusions from the analyses remained the same as when using the LCI dataset, with the exception of the HFI analysis (see discussion in supplementary material; Table S3, Table S4 Fig. S4, S5). We also note that point estimates with an upper confidence interval of infinity (n = 1358) were larger on average (mean = 1380.82, compared to 689.44 and 571.64, for estimates with no CIs or with an upper boundary, respectively). Nevertheless, we chose to retain point estimates with an upper confidence interval of infinity because accounting for them in the analyses did not alter the main conclusions of our study and would have significantly decreased our sample size (Fig. S7, Table S5). We also retained estimates from populations that were reintroduced or translocated from a wild source (n = 309), whereas those from captive sources were excluded during article screening (see above). In exploratory analyses, the removal of these data did not influence our results, and many of these populations are relevant to real-world conservation efforts, as reintroductions and translocations are used to re-establish or support small, at-risk populations. We removed estimates based on duplication of markers (keeping estimates generated from SNPs when studies used both SNPs and microsatellites), and duplication of software (keeping estimates from NeEstimator v2 when studies used it alongside LDNe). Spatial and temporal replication were addressed with two separate datasets (see Table S6 for more information): the full dataset included spatially and temporally replicated samples, while these two types of replication were removed from the non-replicated dataset. Finally, for all populations included in our final datasets, we manually extracted their protection status according to the IUCN Red List of Threatened Species. Taxa were categorized as “Threatened” (Vulnerable, Endangered, Critically Endangered), “Nonthreatened” (Least Concern, Near Threatened), or “N/A” (Data Deficient, Not Evaluated). Mapping and Human Footprint Index (HFI) All populations were mapped in QGIS using the coordinates extracted from articles. The maps were created using a World Behrmann equal area projection. For the summary maps, estimates were grouped into grid cells with an area of 250,000 km2 (roughly 500 km x 500 km, but the dimensions of each cell vary due to distortions from the projection). Within each cell, we generated the count and median of Ne. We used the Global Human Footprint dataset (WCS & CIESIN, 2005) to generate a value of human influence (HFI) for each population at its geographic coordinates. The footprint ranges from zero (no human influence) to 100 (maximum human influence). Values were available in 1 km x 1 km grid cell size and were projected over the point estimates to assign a value of human footprint to each population. The human footprint values were extracted from the map into a spreadsheet to be used for statistical analyses. Not all geographic coordinates had a human footprint value associated with them (i.e., in the oceans and other large bodies of water), therefore marine fishes were not included in our HFI analysis. Overall, 3610 Ne estimates in our final dataset had an associated footprint value.