Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compositional data, which is data consisting of fractions or probabilities, is common in many fields including ecology, economics, physical science and political science. If these data would otherwise be normally distributed, their spread can be conveniently represented by a multivariate normal distribution truncated to the non-negative space under a unit simplex. Here this distribution is called the simplex-truncated multivariate normal distribution. For calculations on truncated distributions, it is often useful to obtain rapid estimates of their integral, mean and covariance; these quantities characterising the truncated distribution will generally possess different values to the corresponding non-truncated distribution.
In the paper Adams, Matthew (2022) Integral, mean and covariance of the simplex-truncated multivariate normal distribution. PLoS One, 17(7), Article number: e0272014. https://eprints.qut.edu.au/233964/, three different approaches that can estimate the integral, mean and covariance of any simplex-truncated multivariate normal distribution are described and compared. These three approaches are (1) naive rejection sampling, (2) a method described by Gessner et al. that unifies subset simulation and the Holmes-Diaconis-Ross algorithm with an analytical version of elliptical slice sampling, and (3) a semi-analytical method that expresses the integral, mean and covariance in terms of integrals of hyperrectangularly-truncated multivariate normal distributions, the latter of which are readily computed in modern mathematical and statistical packages. Strong agreement is demonstrated between all three approaches, but the most computationally efficient approach depends strongly both on implementation details and the dimension of the simplex-truncated multivariate normal distribution.
This dataset consists of all code and results for the associated article.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. This catalogue entry provides post-processed ERA5 hourly single-level data aggregated to daily time steps. In addition to the data selection options found on the hourly page, the following options can be selected for the daily statistic calculation:
The daily aggregation statistic (daily mean, daily max, daily min, daily sum*) The sub-daily frequency sampling of the original data (1 hour, 3 hours, 6 hours) The option to shift to any local time zone in UTC (no shift means the statistic is computed from UTC+00:00)
*The daily sum is only available for the accumulated variables (see ERA5 documentation for more details). Users should be aware that the daily aggregation is calculated during the retrieval process and is not part of a permanently archived dataset. For more details on how the daily statistics are calculated, including demonstrative code, please see the documentation. For more details on the hourly data used to calculate the daily statistics, please refer to the ERA5 hourly single-level data catalogue entry and the documentation found therein.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the anonymised transcripts of the interviews conducted between November and December 2021 at the department of Classical Philology and Italian Studies (FICLIT) at the University of Bologna. It further includes the qualitative data analysis of the interviews, carried out using a grounded theory approach and the open source software QualCoder version 2.9.
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Q: What average temperatures are projected for the future if we reduce and stabilize global emissions of heat-trapping gases within the next two decades? A: Colors show projected daily average temperature for each month from the 2020s through the 2090s, based on a stabilized-emissions future. In this case, the stabilized-emissions future represents a specific Representative Concentration Pathway (RCP) called RCP 4.5. Learn more about RCPs » « Go back to the Data Snapshots interface Q: Where do these measurements come from? A: Temperature projections in these images represent output from 32 global climate models that are all part of the Coupled Model Intercomparison Project Phase 5 (CMIP5). Projections labeled as “Stabilized emissions” represent a potential future in which global emissions peak around 2040, and then are reduced and stabilized. By 2100, the result of this pathway is climate forcing of 4.5 Watts per square meter at the top of the atmosphere. Based on the energy imbalance along this pathway, global climate models calculate temperature across Earth’s surface for future periods. The RCP 4.5 scenario is associated with warming of approximately 2°C above the modern climate normal. To produce regionally relevant projections, results from the global models were statistically downscaled using a method called Localized Constructed Analogs (LOCA). This technique uses observed local-scale weather and climate information to increase the spatial resolution of global-scale projections, and corrects for bias in the model simulations. Images of long-term averages from 1981 to 2010 (PRISM normals) show recent conditions; these maps provide a baseline for comparison with future projections. To produce the normals data, the PRISM group at Oregon State University gathered temperature and precipitation records from a range of federal, state, and international weather station networks, and then mapped them to a grid. To fill map areas between observation stations, the group used a digital elevation model as a predictor grid, and refined the model to account for local effects of mountains, distance from coasts, and other factors that affect climate in complex terrains. Q: What do the colors mean? A: Shades of blue show where average maximum temperature for the month was, or is projected to be, below 60°F during the period indicated. The darker the shade of blue, the lower the temperature. Areas shown in shades of orange and red had, or are projected to have, average maximum temperatures over 60°F. The darker the shade of orange or red, the higher the temperature. White or very light colors show where the average maximum temperature was, or is projected to be, near 60°F. Q: Why do these data matter? A: In order to meet future needs for energy, food, and public health, planners and other decision makers need to understand how temperatures are projected to change over the coming decades. As the climate system continues responding to the heat-trapping gases we have added to the atmosphere, temperatures will change at different rates in different regions. These images can help people get a sense of how much warming their region will experience each decade so they can plan ahead for new conditions. These data also provide people with a way to compare conditions projected for stabilized emissions with conditions projected for high emissions. Comparing the two potential futures may encourage people to take actions to reduce emissions. Q: How did you produce these snapshots? A: We used a suite of Python scripts to process and visualize LOCA (Localized Constructed Analogs) data. The processing scripts averaged the daily values for each month in a given decade from all 32 global climate models that comprise the LOCA dataset. We then calculated the median of all models in each month of the decade. The visualization scripts produced maps of the results within the contiguous United States. For further information, see the README file or access the scripts on GitHu
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including
Time series mean annual BAWAP rainfall from 1900 - 2012.
Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).
As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
There are 4 csv files here:
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.
BAWAP_P_annual_BA_SYB_GLO.csv
Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.
Source data: annual BILO rainfall
P_PET_monthly_BA_SYB_GLO.csv
long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month
Climatology_Trend_BA_SYB_GLO.csv
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend
Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Bioregional Assessment areas v03
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code associated with "The Observed Availability of Data and Code in Earth Science
and Artificial Intelligence" by Erin A. Jones, Brandon McClung, Hadi Fawad, and Amy McGovern.
Instructions: To reproduce figures, download all associated Python and CSV files and place
in a single directory.
Run BAMS_plot.py as you would run Python code on your system.
Code:
BAMS_plot.py: Python code for categorizing data availability statements based on given data
documented below and creating figures 1-3.
Code was originally developed for Python 3.11.7 and run in the Spyder
(version 5.4.3) IDE.
Libraries utilized:
numpy (version 1.26.4)
pandas (version 2.1.4)
matplotlib (version 3.8.0)
For additional documentation, please see code file.
Data:
ASDC_AIES.csv: CSV file containing relevant availability statement data for Artificial
Intelligence for the Earth Systems (AIES)
ASDC_AI_in_Geo.csv: CSV file containing relevant availability statement data for Artificial
Intelligence in Geosciences (AI in Geo.)
ASDC_AIJ.csv: CSV file containing relevant availability statement data for Artificial
Intelligence (AIJ)
ASDC_MWR.csv: CSV file containing relevant availability statement data for Monthly
Weather Review (MWR)
Data documentation:
All CSV files contain the same format of information for each journal. The CSV files above are
needed for the BAMS_plot.py code attached.
Records were analyzed based on the criteria below.
Records:
1) Title of paper
The title of the examined journal article.
2) Article DOI (or URL)
A link to the examined journal article. For AIES, AI in Geo., MWR, the DOI is
generally given. For AIJ, the URL is given.
3) Journal name
The name of the journal where the examined article is published. Either a full
journal name (e.g., Monthly Weather Review), or the acronym used in the
associated paper (e.g., AIES) is used.
4) Year of publication
The year the article was posted online/in print.
5) Is there an ASDC?
If the article contains an availability statement in any form, "yes" is
recorded. Otherwise, "no" is recorded.
6) Justification for non-open data?
If an availability statement contains some justification for why data is not
openly available, the justification is summarized and recorded as one of the
following options: 1) Dataset too large, 2) Licensing/Proprietary, 3) Can be
obtained from other entities, 4) Sensitive information, 5) Available at later
date. If the statement indicates any data is not openly available and no
justification is provided, or if no statement is provided is provided "None"
is recorded. If the statement indicates openly available data or no data
produced, "N/A" is recorded.
7) All data available
If there is an availability statement and data is produced, "y" is recorded
if means to access data associated with the article are given and there is no
indication that any data is not openly available; "n" is recorded if no means
to access data are given or there is some indication that some or all data is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
8) At least some data available
If there is an availability statement and data is produced, "y" is recorded
if any means to access data associated with the article are given; "n" is
recorded if no means to access data are given. If there is no availability
statement or no data is produced, the record is left blank.
9) All code available
If there is an availability statement and data is produced, "y" is recorded
if means to access code associated with the article are given and there is no
indication that any code is not openly available; "n" is recorded if no means
to access code are given or there is some indication that some or all code is
not openly available. If there is no availability statement or no data is
produced, the record is left blank.
10) At least some code available
If there is an availability statement and data is produced, "y" is recorded
if any means to access code associated with the article are given; "n" is
recorded if no means to access code are given. If there is no availability
statement or no data is produced, the record is left blank.
11) All data available upon request
If there is an availability statement indicating data is produced and no data
is openly available, "y" is recorded if any data is available upon request to
the authors of the examined journal article (not a request to any other
entity); "n" is recorded if no data is available upon request to the authors
of the examined journal article. If there is no availability statement, any
data is openly available, or no data is produced, the record is left blank.
12) At least some data available upon request
If there is an availability statement indicating data is produced and not all
data is openly available, "y" is recorded if all data is available upon
request to the authors of the examined journal article (not a request to any
other entity); "n" is recorded if not all data is available upon request to
the authors of the examined journal article. If there is no availability
statement, all data is openly available, or no data is produced, the record
is left blank.
13) no data produced
If there is an availability statement that indicates that no data was
produced for the examined journal article, "y" is recorded. Otherwise, the
record is left blank.
14) links work
If the availability statement contains one or more links to a data or code
repository, "y" is recorded if all links work; "n" is recorded if one or more
links do not work. If there is no availability statement or the statement
does not contain any links to a data or code repository, the record is left
blank.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the models for interpretable Word Sense Disambiguation (WSD) that were employed in Panchenko et al. (2017; the paper can be accessed at https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/EACL_Interpretability_FINAL_1_.pdf).
The files were computed on a 2015 dump from the English Wikipedia. Their contents:
Induced Sense Inventories: wp_stanford_sense_inventories.tar.gz This file contains 3 inventories (coarse, medium fine)
Language Model (3-gram): wiki_text.3.arpa.gz This file contains all n-grams up to n=3 and can be loaded into an index
Weighted Dependency Features: wp_stanford_lemma_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000.gz This file contains weighted word--context-feature combinations and includes their count and an LMI significance score
Distributional Thesaurus (DT) of Dependency Features: wp_stanford_lemma_BIM_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000_simsortlimit200_feature expansion.gz This file contains a DT of context features. The context feature similarities can be used for context expansion
For further information, consult the paper and the companion page: http://jobimtext.org/wsd/
Panchenko A., Ruppert E., Faralli S., Ponzetto S. P., and Biemann C. (2017): Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL'2017). Valencia, Spain. Association for Computational Linguistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Q: What's the temperature of water at the ocean's surface? A: Colors on the map show the temperature of water right at the ocean’s surface. The darkest blue shows the coldest water: floating sea ice is usually present in these areas. Lighter shades of blue show temperatures of up to 80°F. White and orange areas show where surface temperatures are higher than 80°F, warm enough to fuel tropical cyclones or hurricanes. Q: Where do these measurements come from? A: Satellite instruments measure sea surface temperature—often abbreviated as SST—by checking how much energy comes off the ocean at different wavelengths. Computer programs merge sea surface temperatures from ships and buoys with the satellite data, and incorporate information from maps of sea ice. To produce the daily maps, programs invoke mathematical filters to combine and smooth data from all three sources. Q: What do the colors mean? A: The darkest blue areas show sea surface temperatures as low as 28°F. Sea ice, which can look like anything from a slushy mix of floating ice crystals to a solid surface of white, is usually present in these areas. Progressively lighter shades of blue show increasingly warmer temperatures, up to 80°F. White and orange areas on the map show where the surface temperature is above 80°F. Tropical storms that cross these areas can strengthen to form cyclones and hurricanes. Q: Why do these data matter? A: While heat energy is stored and mixed throughout the depth of the ocean, the temperature of water right at the sea's surface—where the ocean is in direct contact with the atmosphere—plays a significant role in weather and short-term climate. Where sea surface temperatures are high, relatively large amounts of heat energy and moisture enter the atmosphere, sometimes producing powerful, drenching storms downwind. Conversely, lower sea surface temperatures mean less evaporation. Global patterns of sea surface temperatures are an important factor for weather forecasts and climate outlooks. Q: How did you produce these snapshots? A: Data Snapshots are derivatives of existing data products: to meet the needs of a broad audience, we present the source data in a simplified visual style. NOAA's Climate Data Records Program produces the Opitimum Interpolated Sea Surface Temperature files. To produce our images, we run a set of scripts that access the source files, re-project them into desired projections at various sizes, and output them with a custom color bar. Additional information Various scientific groups have produced datasets showing Sea Surface Temperature. The images in Data Snapshots represent the AVHRR-only 1/4° daily OISST dataset. Data snapshots presents just one daily OISST image every seven days References Optimum Interpolation Sea Surface Temperature Technical Notes [pdf] Climate Data Record (CDR) Program Climate Algorithm Theoretical Basis Document (C-ATBD) Daily 1/4° Optimum Interpolation Sea Surface Temperature (OISST) Richard W. Reynolds, Thomas M. Smith, Chunying Liu, Dudley B. Chelton, Kenneth S. Casey, and Michael G. Schlax, 2007: Daily High-Resolution-Blended Analyses for Sea Surface Temperature. J. Climate, 20, 5473–5496. doi: http://dx.doi.org/10.1175/2007JCLI1824.1 Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version 2.1 About Optimum Interpolation Sea Surface Temperature (OISST) v2.1 Source: https://www.climate.gov/maps-data/data-snapshots/data-source/sst-sea-surface-temperature This upload includes two additional files:* SST - Sea Surface Temperature _NOAA Climate.gov.pdf is a screenshot of the main Climate.gov site for these snapshots (https://www.climate.gov/maps-data/data-snapshots/data-source/sst-sea-surface-temperature)* Cimate_gov_ Data Snapshots.pdf is a screenshot of the data download page for the full-resolution files.
https://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf
This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.
Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.
The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.
An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
ACR3L2DM_1 is the Active Cavity Radiometer Irradiance Monitor (ACRIM) III Level 2 Daily Mean Data version 1 product consists of Level 2 total solar irradiance in the form of daily means gathered by the ACRIM III instrument on the ACRIMSAT satellite. The daily means are constructed from the shutter cycle results for each day.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Q: Where was the monthly temperature warmer or cooler than usual? A: Colors show where average monthly temperature was above or below its 1991-2020 average. Blue areas experienced cooler-than-usual temperatures while areas shown in red were warmer than usual. The darker the color, the larger the difference from the long-term average temperature. Q: Where do these measurements come from? A: Weather stations on every continent record temperatures over land, and ocean surface temperatures come from measurements made by ships and buoys. NOAA scientists merge the readings from land and ocean into a single dataset. To calculate difference-from-average temperatures—also called temperature anomalies—scientists calculate the average monthly temperature across hundreds of small regions, and then subtract each region’s 1991-2020 average for the same month. If the result is a positive number, the region was warmer than the long-term average. A negative result from the subtraction means the region was cooler than usual. To generate the source images, visualizers apply a mathematical filter to the results to produce a map that has smooth color transitions and no gaps. Q: What do the colors mean? A: Shades of red show where average monthly temperature was warmer than the 1991-2020 average for the same month. Shades of blue show where the monthly average was cooler than the long-term average. The darker the color, the larger the difference from average temperature. White and very light areas were close to their long-term average temperature. Gray areas near the North and South Poles show where no data are available. Q: Why do these data matter? A: Over time, these data give us a planet-wide picture of how climate varies over months and years and changes over decades. Each month, some areas are cooler than the long-term average and some areas are warmer. Though we don’t see an increase in temperature at every location every month, the long-term trend shows a growing portion of Earth’s surface is warmer than it was during the base period. Q: How did you produce these snapshots? A: Data Snapshots are derivatives of existing data products: to meet the needs of a broad audience, we present the source data in a simplified visual style. NOAA's Environmental Visualization Laboratory (NNVL) produces the source images for the Difference from Average Temperature – Monthly maps. To produce our images, we run a set of scripts that access the source images, re-project them into desired projections at various sizes, and output them with a custom color bar. Additional information Source images available through NOAA's Environmental Visualization Lab (NNVL) are interpolated from data originally provided by the National Center for Environmental Information (NCEI) - Weather and Climate. NNVL images are based on NOAA Merged Land Ocean Global Surface Temperature Analysis data (NOAAGlobalTemp, formerly known as MLOST). References NCEI Monthly Global Analysis NOAA View Temperature Anomaly Merged Land Ocean Global Surface Temperature Analysis Global Surface Temperature Anomalies Climate at a Glance - Data Information Source: https://www.climate.gov/maps-data/data-snapshots/data-source/temperature-global-monthly-difference-a...This upload includes two additional files:* Temperature - Global Monthly, Difference from Average _NOAA Climate.gov.pdf is a screenshot of the main Climate.gov site for these snapshots (https://www.climate.gov/maps-data/data-snapshots/data-source/temperature-global-monthly-difference-a...)* Cimate_gov_ Data Snapshots.pdf is a screenshot of the data download page for the full-resolution files.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This dataset comprises monthly mean data from a global, transient simulation with the Whole Atmosphere Community Climate Model eXtension (WACCM-X) from 2015 to 2070. WACCM-X is a global atmosphere model covering altitudes from the surface up to ~500 km, i.e., including the troposphere, stratosphere, mesosphere and thermosphere. WACCM-X version 2.0 (Liu et al., 2018) was used, part of the Community Earth System Model (CESM) release 2.1.0 (http://www.cesm.ucar.edu/models/cesm2) made available by the National Center for Atmospheric Research. The model was run in free-running mode with a horizontal resolution of 1.9 degrees latitude and 2.5 degrees longitude (giving 96 latitude points and 144 longitude points) and 126 vertical levels. Further description of the model and simulation setup is provided by Cnossen (2022) and references therein. A large number of variables is included on standard monthly mean output files on the model grid, while selected variables are also offered interpolated to a constant height grid or vertically integrated in height (details below). Zonal mean and global mean output files are included as well.
The data are provided in NetCDF format and file names have the following structure:
f.e210.FXHIST.f19_f19.h1a.cam.h0.[YYYY]-[MM][DFT].nc
where [YYYY] gives the year with 4 digits, [MM] gives the month (2 digits) and [DFT] specifies the data file type. The following data file types are included:
1) Monthly mean output on the full grid for the full set of variables; [DFT] =
2) Zonal mean monthly mean output for the full set of variables; [DFT] = _zm
3) Global mean monthly mean output for the full set of variables; [DFT] = _gm
4) Height-interpolated/-integrated output on the full grid for selected variables; [DFT] = _ht
A cos(latitude) weighting was used when calculating the global means.
Data were interpolated to a set of constant heights (61 levels in total) using the Z3GM variable (for variables output on midpoints, with 'lev' as the vertical coordinate) or the Z3GMI variable (for variables output on interfaces, with ilev as the vertical coordinate) stored on the original output files (type 1 above). Interpolation was done separately for each longitude, latitude and time.
Mass density (DEN [g/cm3]) was calculated from the M_dens, N2_vmr, O2, and O variables on the original data files before interpolation to constant height levels.
The Joule heating power QJ [W/m3] was calculated using Q_J = (sigma_P*B^2)*((u_i - U_n)^2 + (v_i-v_n)^2 + (w_i-w_n)^2) with sigma_P = Pedersen conductivity[S], B = geomagnetic field strength [T], ui, vi, and wi = zonal, meridional, and vertical ion velocities [m/s] and un, vn, and wn = neutral wind velocities [m/s]. QJ was integrated vertically in height (using a 2.5 km height grid spacing rather than the 61 levels on output file type 4) to give the JHH variable on the type 4 data files. The QJOULE variable also given is the Joule heating rate [K/s] at each of the 61 height levels.
All data are provided as monthly mean files with one time record per file, giving 672 files for each data file type for the period 2015-2070 (56 years).
References:
Cnossen, I. (2022), A realistic projection of climate change in the upper atmosphere into the 21st century, in preparation.
Liu, H.-L., C.G. Bardeen, B.T. Foster, et al. (2018), Development and validation of the Whole Atmosphere Community Climate Model with thermosphere and ionosphere extension (WACCM-X 2.0), Journal of Advances in Modeling Earth Systems, 10(2), 381-402, doi:10.1002/2017ms001232.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Q: How has the age of Arctic Sea Ice changed over time? A: Since the late 1900s, Arctic sea ice has thinned, and less sea ice has persisted in the Arctic over multiple melt seasons. The trend toward younger, thinner sea ice over time reflects warming temperatures in the Arctic. As older ice is thicker than younger ice, the reduced area of old ice also indicates a reduction in the total volume of ice. Q: Where do these measurements come from? A: Scientists estimate the age of sea ice by combining satellite observations of ice locations and extent with buoy data on winds and motion. Q: What do the colors mean? A: Colors show the age of sea ice floating in the Arctic Ocean. The darkest blue areas on the map show seasonal or first-year ice, which formed during the most recent winter. White areas show where ice is more than four years old. Ice thickness is strongly correlated with ice age. First year ice ranges from 4 to 12 inches (10 to 30 centimeters) thick, while multiyear ice ranges from 6 to 12 feet (2 to 4 meters) thick. This correlation means that in general, the brighter the color, the thicker ice. Q: Why do these data matter? A: In the mid-to-late 1900s, a core of thick, old year-round sea ice covered much of the Arctic Ocean. Around that core, seasonal ice formed each winter and melted each summer. North of Alaska, a looping current called the Beaufort Gyre historically acted as a nursery for young sea ice where ice could persist and thicken. Ice growth in the gyre roughly offset the steady transport of ice out of the Arctic Ocean through the Fram Strait east of Greenland. Since the year 2000, warmer summers have caused ice to melt in the southern stretch of the Beaufort Gyre, so less multiyear ice has persisted. The result is younger, thinner sea ice than in decades past. Today, the amount of thick, old ice in the Arctic is a small fraction of what it was in the 1980s. Because young, thin ice melts more easily than old, thick ice, the trend toward thinner ice is self-reinforcing. Q: How did you produce these snapshots? A: Data Snapshots are derivatives of existing data products: to meet the needs of a broad audience, we present the source data in a simplified visual style. Additional information These Arctic Sea Ice Age maps use NSIDC Quicklook Arctic Weekly EASE-Grid Sea Ice Age, Version 1 data from 2020 to now, while maps from 2019 and earlier use NSIDC EASE-Grid Sea Ice Age, Version 4 data. Both datasets are available as PNGs (.png) and NetCDF (.nc) files. References Perovich, D., Meier, W., Tschudi, M., Farrell, S., Hendricks, S., Gerland, S., Kaleschke, L., Ricker, R., Tian-Kunze, X., Webster, M., Woods, K. (2019). Sea ice. 2019 Arctic Report Card. Source: https://www.climate.gov/maps-data/data-snapshots/data-source/arctic-sea-ice-age This upload includes two additional files:* Arctic Sea Ice Age _NOAA Climate.gov.pdf is a screenshot of the main Climate.gov site for these snapshots (https://www.climate.gov/maps-data/data-snapshots/data-source/arctic-sea-ice-age )* Cimate_gov_ Data Snapshots.pdf is a screenshot of the data download page for the full-resolution files.
The National Energy Efficiency Data-Framework (NEED) was set up to provide a better understanding of energy use and energy efficiency in domestic and non-domestic buildings in Great Britain. The data framework matches data about a property together - including energy consumption and energy efficiency measures installed - at household level.
We identified 2 processing errors in this edition of the Domestic NEED Annual report and corrected them. The changes are small and do not affect the overall findings of the report, only the domestic energy consumption estimates. The revisions are summarised here:
Error 2: Some properties incorrectly excluded from the Scotland multiple attributes tables
We identified 2 processing errors in this edition of the Domestic NEED Annual report and corrected them. The changes are small and do not affect the overall findings of the report, only the domestic energy consumption estimates. The impact of energy efficiency measures analysis remains unchanged. The revisions are summarised here:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
***Starting on March 7th, 2024, the Los Angeles Police Department (LAPD) will adopt a new Records Management System for reporting crimes and arrests. This new system is being implemented to comply with the FBI's mandate to collect NIBRS-only data (NIBRS — FBI - https://www.fbi.gov/how-we-can-help-you/more-fbi-services-and-information/ucr/nibrs). During this transition, users will temporarily see only incidents reported in the retiring system. However, the LAPD is actively working on generating new NIBRS datasets to ensure a smoother and more efficient reporting system. ***
******Update 1/18/2024 - LAPD is facing issues with posting the Crime data, but we are taking immediate action to resolve the problem. We understand the importance of providing reliable and up-to-date information and are committed to delivering it.
As we work through the issues, we have temporarily reduced our updates from weekly to bi-weekly to ensure that we provide accurate information. Our team is actively working to identify and resolve these issues promptly.
We apologize for any inconvenience this may cause and appreciate your understanding. Rest assured, we are doing everything we can to fix the problem and get back to providing weekly updates as soon as possible. ******
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compositional data, which is data consisting of fractions or probabilities, is common in many fields including ecology, economics, physical science and political science. If these data would otherwise be normally distributed, their spread can be conveniently represented by a multivariate normal distribution truncated to the non-negative space under a unit simplex. Here this distribution is called the simplex-truncated multivariate normal distribution. For calculations on truncated distributions, it is often useful to obtain rapid estimates of their integral, mean and covariance; these quantities characterising the truncated distribution will generally possess different values to the corresponding non-truncated distribution.
In the paper Adams, Matthew (2022) Integral, mean and covariance of the simplex-truncated multivariate normal distribution. PLoS One, 17(7), Article number: e0272014. https://eprints.qut.edu.au/233964/, three different approaches that can estimate the integral, mean and covariance of any simplex-truncated multivariate normal distribution are described and compared. These three approaches are (1) naive rejection sampling, (2) a method described by Gessner et al. that unifies subset simulation and the Holmes-Diaconis-Ross algorithm with an analytical version of elliptical slice sampling, and (3) a semi-analytical method that expresses the integral, mean and covariance in terms of integrals of hyperrectangularly-truncated multivariate normal distributions, the latter of which are readily computed in modern mathematical and statistical packages. Strong agreement is demonstrated between all three approaches, but the most computationally efficient approach depends strongly both on implementation details and the dimension of the simplex-truncated multivariate normal distribution.
This dataset consists of all code and results for the associated article.