analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS. The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection). Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
(9_workflow_diagramme) this simple code can be used to plot a workflow diagram and is detached from the actual analysis.
Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, and Funding acquisition: Michael
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.
These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.
We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.
We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).
Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.
We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.
In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you can find the model results of the report:
De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.
This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.
We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.
A description of the methodology and the data sources with the references can be found into the report.
Linked resources
Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc
Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability
Update
[29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year
2m_temperature
, Celsius degrees) - ssrd: Surface solar radiation (surface_solar_radiation_downwards
, Watt per square meter) - ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky
, Watt per square meter) - ro: Runoff (runoff
, millimeters) There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind
and 10m_v_component_of_wind
, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind
and 100m_v_component_of_wind
, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition. For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2). The data is provided in two formats: - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16
type using a scale_factor
to minimise the size of the files. - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly) All the CSV files are stored in a zipped file for each variable. ## Methodology The time-series have been generated using the following workflow: 1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset 2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp
from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders. 3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R 4. The NetCDF are created using xarray
in Python 3.8. ## Example notebooks In the folder notebooks
on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray
and how to visualise them in several ways by using matplotlib or the enlopy package. There are currently two notebooks: - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them. - ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets. The notebook exploring-ERA-NUTS
is also available rendered as HTML. ## Additional files In the folder additional files
on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region. ## License This dataset is released under CC-BY-4.0 license.https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset comprises information on 1000 customers, with 84 features derived from their financial transactions and current financial standing. The primary objective is to leverage this dataset for credit risk estimation and predicting potential defaults.
Key Target Variables: - CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer) - DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0)
Description of Features: - INCOME: Total income in the last 12 months - SAVINGS: Total savings in the last 12 months - DEBT: Total existing debt - R_SAVINGS_INCOME: Ratio of savings to income - R_DEBT_INCOME: Ratio of debt to income - R_DEBT_SAVINGS: Ratio of debt to savings
Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT, GAMBLING, UTILITIES, TAX, FINES) are categorized. - T_{GROUP}_6: Total expenditure in that group in the last 6 months - T_GROUP_12: Total expenditure in that group in the last 12 months - R_[GROUP]: Ratio of T_[GROUP]_6 to T_[GROUP]_12 - R_[GROUP]_INCOME: Ratio of T_[GROUP]_12 to INCOME - R_[GROUP]_SAVINGS: Ratio of T_[GROUP]_12 to SAVINGS - R_[GROUP]_DEBT: Ratio of T_[GROUP]_12 to DEBT
Categorical Features: - CAT_GAMBLING: Gambling category (none, low, high) - CAT_DEBT: 1 if the customer has debt; 0 otherwise - CAT_CREDIT_CARD: 1 if the customer has a credit card; 0 otherwise - CAT_MORTGAGE: 1 if the customer has a mortgage; 0 otherwise - CAT_SAVINGS_ACCOUNT: 1 if the customer has a savings account; 0 otherwise - CAT_DEPENDENTS: 1 if the customer has any dependents; 0 otherwise
See XAI course based on this dataset: https://adataodyssey.com/courses/xai-with-python/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
A set of Monte Carlo simulated events, for the evaluation of top quarks' (and their child particles') momentum reconstruction, produced using the HEPData4ML package [1]. Specifically, the entries in this dataset correspond with top quark jets, and the momentum of the jets' constituent particles. This is a newer version of the "Top Quark Momentum Reconstruction Dataset" [2], but with sufficiently large changes to warrant this separate posting.
The dataset is saved in HDF5 format, as sets of arrays with keys (as detailed below). There are ~1.5M events, approximately broken down into the following sets:
Training: 700k events (files with "_train" suffix)
Validation: 200k events (files with "_valid" suffix)
Testing (small): 100k events (files with "_test" suffix)
Testing (large): 500k events (files with "_test_large" suffix)
The two separate types of testing files -- small and large -- are independent from one another, the former for conveniently running quicker testing and the latter for testing with a larger sample.
There are four version of the dataset present, with the versions indicated by the filenames. The different versions correspond with whether or not fast detector simulation was performed (versus truth-level jets), and whether or not the W-boson mass was modified: One version of the dataset uses the nominal value of (m_W = 80.385 \text{ GeV}) as used by Pythia8 [3], whereas another uses a variable mW taking on 101 values evenly-spaced as (m_W \in { 64.308,96.462 } \text{ GeV}). The dataset naming scheme is as follows:
train.h5 : jets clustered from truth-level, nominal mW
train_mW.h5: jets clustered from truth-level, variable mW
train_delphes.h5: jets clustered from Delphes outputs, nominal mW
train_delphes_mW.h5: jets clustered from Delphes outputs, variable mW
Description
13 TeV center-of-mass energy, fully hadronic top quark decays, simulated with Pythia8. ((t \rightarrow W \, b, \; W\rightarrow q \, q'))
Events are generated with leading top quark pT in [550,650] GeV. (set via Pythia8's (\hat{p}_{T,\text{ min}}) and (\hat{p}_{T,\text{ max}}) variables)
No inital- or final-state radiation (ISR/FSR), nor multi-parton interactions (MPI)
Where applicable, detector simulation is done using DELPHES [4], with the ATLAS detector card.
Clustering of particles/objects is done via FastJet [5], using the anti-kT algorithm, with (R=0.8) .
For the truth-level data, inputs to jet clustering are truth-level, final-state particles (i.e. clustering "truth jets").
For the data with detector simulation, the inputs are calorimeter towers from DELPHES.
Tower
objects from DELPHES (not E-flow objects, no tracking information)
Each entry in the dataset corresponds with a single top quark jet, extracted from a (t\bar{t}) event.
All jets are matched to a parton-level top quark within (\Delta R < 0.8) . We choose the jet nearest the parton-level top quark.
Jets are required to have (|\eta| < 2), and (p_{T} > 15 \text{ GeV}).
The 200 leading (highest-pT) jet constituent four-momenta are stored in Cartesian coordinates (E,px,py,pz), sorted by decreasing pT, with zero-padding.
The jet four-momentum is stored in Cartesian coordinates (E, px, py, pz), as well as in cylindrical coordinates ((p_T,\eta,\phi,m)).
The truth (parton-level) four-momenta of the top quark, the bottom quark the W-boson, and the quarks to which the W-boson decays, are stored in Cartesian coordinates.
In addition, the momenta of the 120 leading stable daughter particles of the W-boson are stored in Cartesian coordinates.
Description of data fields & metadataBelow is a brief description of the various fields in the dataset. The dataset also contains metadata fields, stored using HDF5's "attributes". This is used for fields that are common across many events, and stores information such as generator-level configurations (in principle, all the information is stored as to be able to recreate the dataset with the HEPData4ML tool).
Note that fields whose keys have the prefix "jh_" correspond with output from the Johns Hopkins top tagger [6], as implemented in FastJet.
Also note that for the keys corresponding with four-momenta in Cartesian coordinates, there are rotated versions of these fields -- the data has been rotated so that the W-boson is at ((\theta=0, \phi=0)), and the b-quark is in the ((\theta=0, \phi < 0)) plane. This rotation is potentially useful for visualizations of the events.
Nobj: The number of constituents in the jet.
Pmu: The four-momenta of the jet constituents, in (E, px, py, pz). Sorted by decreasing pT and zero-padded to a length of 200.
Pmu_rot: Rotated version.
contained_daughter_sum_Pmu: Four-momentum sum of the stable daughter particles of the W-boson that fall within (\Delta R < 0.8) of the jet centroid.
contained_daughter_sum_Pmu_rot: Rotated version.
cross_section: Cross-section for the corresponding process, reported by Pythia8.
cross_section_uncertainty: Cross-section uncertainty for the corresponding process, reported by Pythia8.
energy_ratio smeared: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total smeared energy in this calorimeter tower.
Only relevant for the DELPHES datasets.
energy_ratio_truth: Ratio of the true energy of W-boson daughter particles contributing to this calorimeter tower, divided by the total true energy of particles contributing to this calorimeter tower.
The above definition is relevant only for the DELPHES datasets. For the truth-level datasets, this field is repurposed to store a value (0 or 1) indicating whether or not the given particle (whose momentum is in the Pmu
field) is a W-boson daughter.
event_idx: Redundant -- used for event indexing during the event generation process.
is_signal: Redundant -- indicates whether an event is signal or background, but this is a fully signal dataset. Potentially useful if combining with other datasets produced with HEPData4ML.
jet_Pmu: Four-momentum of the jet, in (E, px, py, pz).
jet_Pmu_rot: Rotated version.
jet_Pmu_cyl: Four-momentum of the jet, in ((pT_,\eta,\phi,m)).
jet_bqq_contained_dR06: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.
jet_bqq_contained_dR08: Boolean flag indicating whether or not the truth-level b and the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.
jet_bqq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},b \right), \; \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).
jet_qq_contained_dR06: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.6) of the jet centroid.
jet_qq_contained_dR08: Boolean flag indicating whether or not the two quarks from W decay are contained within (\Delta R < 0.8) of the jet centroid.
jet_qq_dr_max: Maximum of (\big\lbrace \Delta R \left( \text{jet},q \right), \; \Delta R \left( \text{jet},q' \right) \big\rbrace).
jet_top_daughters_contained_dR08: Boolean flag indicating whether the final-state daughters of the top quark are within (\Delta R < 0.8) of the jet centroid. Specifically, the algorithm for this flag checks that the jet contains the stable daughters of both the b quark and the W boson. For the b and W each, daughter particles are allowed to be uncontained as long as (for each particle) the (p_T) of the sum of uncontained daughters is below (2.5 \text{ GeV}).
jh_W_Nobj: Number of constituents in the W-boson candidate identified by the JH tagger.
jh_W_Pmu: Four-momentum of the JH tagger W-boson candidate, in (E, px, py, pz).
jh_W_Pmu_rot: Rotated version.
jh_W_constituent_Pmu: Four-momentum of the constituents of the JH tagger W-boson candidate, in (E, px, py, pz).
jh_W_constituent_Pmu_rot: Rotated version.
jh_m: Mass of the JH W-boson candidate.
jh_m_resolution: Ratio of JH W-boson candidate mass, versus the true W-boson mass.
jh_pt: (p_T) of the JH W-boson candidate.
jh_pt_resolution: Ratio of JH W-boson candidate (p_T), versus the true W-boson mass.
jh_tag: Whether or not a jet was tagged by the JH tagger.
mc_weight: Monte Carlo weight for this event, reported by Pythia8.
process_code: Process code reported by Pythia8.
rotation_matrix: Rotation matrix for rotating the events' 3-momenta as to produce the rotated copies stored in the dataset.
truth_Nobj: Number of truth-level particles (saved in truth_Pmu).
truth_Pdg: PDG codes of the truth-level particles.
truth_Pmu: Truth-level particles: The top quark, bottom quark, W boson, q, q', and 120 leading, stable W-boson daughter particles, in (E, px, py, pz). A few of these are also stored in separate keys:
truth_Pmu_0: Top quark.
truth_Pmu_0_rot: Rotated version.
truth_Pmu_1: Bottom quark.
truth_Pmu_1_rot: Rotated version.
truth_Pmu_2: W-boson.
truth_Pmu_2_rot: Rotated version.
truth_Pmu_3: q from W decay.
truth_Pmu_3_rot: Rotated version.
truth_Pmu_4: q' from W decay.
truth_Pmu_4_rot: Rotated version.
truth_Pmu_0_rot: Rotated version of truth_Pmu
.
The following fields correspond with metadata -- they provide the index of the corresponding metadata entry for each event:
command_line_arguments: The command-line arguments passed to HEPData4ML's run.py
script.
config_file: The contents of the Python configuration file used for HEPData4ML. This, together with the command-line arguments, defines how the tool was run, what processes, jet clustering and post-processing was done, etc.
git_hash: Git hash for HEPData4ML.
timestamp: Timestamp for when the dataset was created
The high-frequency phone survey of refugees monitors the economic and social impact of and responses to the COVID-19 pandemic on refugees and nationals, by calling a sample of households every four weeks. The main objective is to inform timely and adequate policy and program responses. Since the outbreak of the COVID-19 pandemic in Ethiopia, two rounds of data collection of refugees were completed between September and November 2020. The first round of the joint national and refugee HFPS was implemented between the 24 September and 17 October 2020 and the second round between 20 October and 20 November 2020.
Household
Sample survey data [ssd]
The sample was drawn using a simple random sample without replacement. Expecting a high non-response rate based on experience from the HFPS-HH, we drew a stratified sample of 3,300 refugee households for the first round. More details on sampling methodology are provided in the Survey Methodology Document available for download as Related Materials.
Computer Assisted Telephone Interview [cati]
The Ethiopia COVID-19 High Frequency Phone Survey of Refugee questionnaire consists of the following sections:
A more detailed description of the questionnaire is provided in Table 1 of the Survey Methodology Document that is provided as Related Materials. Round 1 and 2 questionnaires available for download.
DATA CLEANING At the end of data collection, the raw dataset was cleaned by the Research team. This included formatting, and correcting results based on monitoring issues, enumerator feedback and survey changes. Data cleaning carried out is detailed below.
Variable naming and labeling: • Variable names were changed to reflect the lowercase question name in the paper survey copy, and a word or two related to the question. • Variables were labeled with longer descriptions of their contents and the full question text was stored in Notes for each variable. • “Other, specify” variables were named similarly to their related question, with “_other” appended to the name. • Value labels were assigned where relevant, with options shown in English for all variables, unless preloaded from the roster in Amharic.
Variable formatting:
• Variables were formatted as their object type (string, integer, decimal, time, date, or datetime).
• Multi-select variables were saved both in space-separated single-variables and as multiple binary variables showing the yes/no value of each possible response.
• Time and date variables were stored as POSIX timestamp values and formatted to show Gregorian dates.
• Location information was left in separate ID and Name variables, following the format of the incoming roster. IDs were formatted to include only the variable level digits, and not the higher-level prefixes (2-3 digits only.)
• Only consented surveys were kept in the dataset, and all personal information and internal survey variables were dropped from the clean dataset. • Roster data is separated from the main data set and kept in long-form but can be merged on the key variable (key can also be used to merge with the raw data).
• The variables were arranged in the same order as the paper instrument, with observations arranged according to their submission time.
Backcheck data review: Results of the backcheck survey are compared against the originally captured survey results using the bcstats command in Stata. This function delivers a comparison of variables and identifies any discrepancies. Any discrepancies identified are then examined individually to determine if they are within reason.
The following data quality checks were completed: • Daily SurveyCTO monitoring: This included outlier checks, skipped questions, a review of “Other, specify”, other text responses, and enumerator comments. Enumerator comments were used to suggest new response options or to highlight situations where existing options should be used instead. Monitoring also included a review of variable relationship logic checks and checks of the logic of answers. Finally, outliers in phone variables such as survey duration or the percentage of time audio was at a conversational level were monitored. A survey duration of close to 15 minutes and a conversation-level audio percentage of around 40% was considered normal. • Dashboard review: This included monitoring individual enumerator performance, such as the number of calls logged, duration of calls, percentage of calls responded to and percentage of non-consents. Non-consent reason rates and attempts per household were monitored as well. Duration analysis using R was used to monitor each module's duration and estimate the time required for subsequent rounds. The dashboard was also used to track overall survey completion and preview the results of key questions. • Daily Data Team reporting: The Field Supervisors and the Data Manager reported daily feedback on call progress, enumerator feedback on the survey, and any suggestions to improve the instrument, such as adding options to multiple choice questions or adjusting translations. • Audio audits: Audio recordings were captured during the consent portion of the interview for all completed interviews, for the enumerators' side of the conversation only. The recordings were reviewed for any surveys flagged by enumerators as having data quality concerns and for an additional random sample of 2% of respondents. A range of lengths were selected to observe edge cases. Most consent readings took around one minute, with some longer recordings due to questions on the survey or holding for the respondent. All reviewed audio recordings were completed satisfactorily. • Back-check survey: Field Supervisors made back-check calls to a random sample of 5% of the households that completed a survey in Round 1. Field Supervisors called these households and administered a short survey, including (i) identifying the same respondent; (ii) determining the respondent's position within the household; (iii) confirming that a member of the the data collection team had completed the interview; and (iv) a few questions from the original survey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.
The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).
The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.
Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).
Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.
We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:
Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)
For example, calculations for the 3-day cumulative wind speed are:
(1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)
Similarly, we calculated the cumulative weighted wind Direction using the formula:
Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)
We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.
To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.
To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.
Limitations: It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.
Format:
It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
This is the HadISDH.land 4.6.0.2023f version of the Met Office Hadley Centre Integrated Surface Dataset of Humidity (HadISDH). HadISDH.land is a near-global gridded monthly mean land surface humidity climate monitoring product. It is created from in situ observations of air temperature and dew point temperature from weather stations. The observations have been quality controlled and homogenised. Uncertainty estimates for observation issues and gridbox sampling are provided (see data quality statement section below). The data are provided by the Met Office Hadley Centre and this version spans 1/1/1973 to 31/12/2023.
The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD).
This version extends the previous version to the end of 2023. Users are advised to read the update document in the Docs section for full details on all changes from the previous release.
As in previous years, the annual scrape of NOAAs Integrated Surface Dataset for HadISD.3.4.0.2023f, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data.
To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.
For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/
References:
When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):
Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.
Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.
Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704-708, doi:10.1175/2011BAMS3015.1
We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:
Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.
This is version 1.0.3.2014f of HadISD (27 April 2015) the Met Office Hadley Centre's global sub-daily data, extending v1.0.2.2013f to span 1/1/1973 - 31/12/2014. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19730101-20141231_v1-0-3-2014f.nc T. The station codes can be found under the docs tab or on the archive beside the station_data folder. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep up to date with updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
These are the Supplementary Material, R scripts and numerical results accompanying Bartoszek, Fuentes Gonzalez, Mitov, Pienaar, Piwczyński, Puchałka, Spalik and Voje "Model Selection Performance in Phylogenetic Comparative Methods under multivariate Ornstein–Uhlenbeck Models of Trait Evolution". The four data files concern two datasets. Ungulates: measurements of muzzle width, unworn lower third molar crown height, unworn lower third molar crown width and feeding style and their phylogeny; Ferula: measurements of ratio of canals, periderm thickness, wing area, wing thickness, and fruit mass, and their phylogeny. Methods Ungulates The compiled ungulate dataset involves two key components: phenotypic data (Data.csv) and phylogenetic tree (Tree.tre), which consist on the following (full references for the citations presented below are provided in the paper linked to this repository, which also provides further details on the compiled dataset):The phenotypic data includes three continuous variables and one categorical variable. The continuous variables (MZW: muzzle width; HM3: unworn lower third molar crown height; WM3: unworn lower third molar crown width), measured in cm, come from Mendoza et al. (2002; J. Zool.). The categorical variable (FS, i.e. feeding style: B=browsers, G=grazers, M=mixed feeders) is based on Pérez–Barbería and Gordon (2001; Proc. R. Soc. B: Biol. Sci.). Taxonomic mismatches between these two sources were resolved based on Wilson and Reeder (2005; Johns Hopkins University Press). Only taxa with full entries for all these variables were included (i.e. no missing data allowed).The phylogenetic tree is pruned from the unsmoothed mammalian timetree of Hedges et al. (2015; MBE) to only include the 104 ungulate species for which there is complete phenotypic data available. Wilson and Reeder (2005; Johns Hopkins University Press) was used again to resolve taxonomic mismatches with the phenotypic data. The branch lengths of the tree are scaled to unit height and thus informative of relative time. Ferula 1) The phenotypic data are divided into two data sets: first containing five continuous variables (no_ME) measured on mericarps (the dispersal unit of fruit in Apiaceae), whereas the second having the same variables together with measurement error (ME; see paper for computational details) for 75 species of Ferula and three species of Leutea. Three continuous variables were measured on anatomical cross sections (ratio_canals_ln – the proportion of oil ducts covering the space between median and lateral ribs [dimensionless], mean_gr_peri_ln_um – periderm (fruit wall) thickness [μm], thick_wings_ln_um – wing thickness [μm]); the remaining two on whole mericarps (Wings_area_ln_mm – wings area [mm2], Seed_mass_ln_mg – seed mass [mg]) 2) The phylogenetic tree was pruned from the tree obtained from the recent taxonomic revision of the genus (Panahi et al. 2018) to only include the 78 species for which the phenotypic data were generated. This tree and the associated alignment, composed of one nuclear and three plastid markers (Panahi et al. 2018), constituted an input to mcmctree software (Yang 2007) to obtain dated tree using a secondary calibration point for the root based on Banasiak et al.’s (2013) work. The branch lengths of the final tree (Ferula_fruits_tree.txt) were scaled to unit height and thus informative of relative time. The R setup for the manuscript was as follows: R version 3.6.1 (2019-09-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: openSUSE Leap 42.3The exact output can depend on the random seed. However, in the script we have the option of rerunning the analyses as it was in the manuscript, i.e.the random seeds that were used to generate the results are saved, included and can be read in.The code is divided into several directories with scripts, random seeds and result files.1) LikelihoodTesting Directory contains the script test_rotation_invariance_mvSLOUCH.R that demonstrates that mvSLOUCH's likelihood calculations are rotation invariant. 2) Carnivorans Directory contains files connected to the Carnivrons' vignette in mvSLOUCH. 2.1) Carnivora_mvSLOUCH_objects_Full.RData Full output of running the R code in the vignette. With mvSLOUCH is a very bare-minimum subset of this file that allows for the creation of the vignette. 2.2) Carnivora_mvSLOUCH_objects.RData Reduced objects from Carnivora_mvSLOUCH_objects_Full.RData that are included with mvSLOUCH's vignette. 2.3) Carnivora_mvSLOUCH_objects_remove_script.R R script to reduce Carnivora_mvSLOUCH_objects_Full.RData to Carnivora_mvSLOUCH_objects.RData . 2.4) mvSLOUCH_Carnivorans.Rmd The vignette itself. 2.5) refs_mvSLOUCH.bib Bib file for the vignette. 2.6) ScaledTree.png, ScaledTree2.png, ScaledTree3.png, ScaledTree4.png Plots of phylogenetic trees for vignette.3) SimulationStudy Directory contains all the output of the simulation study presented in the manuscript and scripts that allow for replication (the random number generator seeds are also provided) or running ones own simulation study, and scripts to generate graphs, and model comparison summary. This study was done using version 2.6.2 of mvSLOUCH. If one reruns using mvSLOUCH >= 2.7, then one will obtain different (corrected) values of R2 and an additional R2 version. 4) Ungulates Directory contains files connected to the "Feeding styles and oral morphology in ungulates" analyses performed for the manuscript. 4.1) Data.csv The phenotypic data includes three continuous variables and one categorical variable. Continuous variables (MZW: muzzle width; HM3: unworn lower third molar crown height; WM3: unworn lower third molar crown width) from Mendoza et al. (2002), measured in cm. Categorical variable (FS, i.e. feeding style: B=browsers, G=grazers, M=mixed feeders) based on Pérez–Barbería and Gordon (2001). Phylogeny pruned from Hedges et al. (2015). Taxonomic mismatches among these sources were resolved based on Wilson and Reeder (2005). Hedges, S. B., J. Marin, M. Suleski, M. Paymer, and S. Kumar. 2015. Tree of life reveals clock-like speciation and diversification. Molecular Biology and Evolution 32:835-845. Mendoza, M., C. M. Janis, and P. Palmqvist. 2002. Characterizing complex craniodental patterns related to feeding behaviour in ungulates: a multivariate approach. Journal of Zoology 258:223-246 Pérez–Barbería, F. J., and I. J. Gordon. 2001. Relationships between oral morphology and feeding style in the Ungulata: a phylogenetically controlled evaluation. Proceedings of the Royal Society of London. Series B: Biological Sciences 268:1023-1032. Wilson, D. E., and D. M. Reeder. 2005. Mammal species of the world: A taxonomic and geographic reference. Johns Hopkins University Press, Baltimore, Maryland. 4.2) Tree.tre Ungulates' phylogeny, extracted from the mammalian phylogeny of Hedges, S. B., J. Marin, M. Suleski, M. Paymer, and S. Kumar. 2015. Tree of life reveals clock–like speciation and diversification. Mol. Biol. Evol. 32:835–845. 4.3) OUB.R, OUF.R, OUG.R R scripts for the analyses performed in the manuscript. Different files correspond to different regime setups of the feeding style variable. 4.4) OU1.txt, OUB.txt, OUF.txt, OUG.txt Outputs of the model comparison conducted under the R scripts presented above (4.3). Different files correspond to different regime setups of the feeding style variable. 5) Ferula analyses In the models_ME directory there are input and output files from the mvSLOUCH analyzes of Ferula data with measurement error included, while in the models_no_ME directory the analyzes of data without measurement error. In each directory, one can find the following files:- input files: Data_ME.csv (with mesurment error) or Data_no_ME.csv (without measurement error) and tree file in Newick format (Ferula_fruits_tree.txt); the trait names in data files are abbreviated as follows: ration_canals – the proportion of oil ducts covering the space between median and lateral ribs, mean_gr_peri – periderm thickness, wings_area – wing area, thick_wings – wing thickness and seed_mass – seed mass,- the results for 8 analyzed models (see Fig. 2 in the main text), each in separate directory named model1, model2 and so on,- each model directory comprises the following files: two R scripts (for analyzes with diagonal and with upper triangular matrix Σyy; each model was run 1000 times), two csv files included information such as number of repetition (i), seed for preliminary analyzes generating starting point (seed_start_point), seed for the main analyses (seed) and AIC, AICc, SIC, BIC, R2 and loglik for each model run (these csv files are sorted according to AICc values), two directories containing results for 1000 analyzes, and two files extracted from these directories showing parameter estimation for the best models (with UpperTri and Diagonal matrix Σyy)
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This is the HadISDH.blend 1.3.0.2021f version of the Met Office Hadley Centre Integrated Surface Dataset of Humidity (HadISDH). HadISDH.blend is a near-global gridded monthly mean surface humidity climate monitoring product. It is created from in situ observations of air temperature and dew point temperature from ships and weather stations. The observations have been quality controlled and homogenised / bias adjusted. Uncertainty estimates for observation issues and gridbox sampling are provided (see data quality statement section below). These data are provided by the Met Office Hadley Centre. This version spans 1/1/1973 to 31/12/2021.
The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD).
This version extends the previous version to the end of 2021. It combines the latest version of HadISDH.land and HadISDH.marine. and therefore their respective update notes. Users are advised to read the update documents in the Docs section for full details.
To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.
For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/
References:
When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):
Willett, K. M., Dunn, R. J. H., Kennedy, J. J. and Berry, D. I., 2020: Development of the HadISDH marine humidity climate monitoring dataset. Earth System Sciences Data, 12, 2853-2880, https://doi.org/10.5194/essd-12-2853-2020
Freeman, E., Woodruff, S. D., Worley, S. J., Lubker, S. J., Kent, E. C., Angel, W. E., Berry, D. I., Brohan, P., Eastman, R., Gates, L., Gloeden, W., Ji, Z., Lawrimore, J., Rayner, N. A., Rosenhagen, G. and Smith, S. R., ICOADS Release 3.0: A major update to the historical marine climate record. International Journal of Climatology. doi:10.1002/joc.4775.
Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.
Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.
Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704-708, doi:10.1175/2011BAMS3015.1
We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:
Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset supplements the publication:
Schütz, I. & Einhäuser, W. (2018). Visual awareness in binocular rivalry modulates induced pupil fluctuations.
Raw data are available in two formats: the actual raw EDF data files as returned by the eyetracking device (.edf) for reference, and as MATLAB files into which all relevant information has been extracted (.mat) for analysis.
Variables in the pft_*.mat files include - raw eye position in the variable scan (t,x,y,p), where t is the timestamp of the eyetracker, x/y the position on the screen and p the pupil diameter in arbitrary units - fixation, saccade and blink events in variables fix, sac and blink - raw button presses in cell arrays buttonDn and buttonUp (6 - left button, 7 - right button) including timestamps - stimulus presentation cycle timestamps in the variable stimperiod - timestamps for auditory attentional instruction in the variable attends
For analysis, data from all participants and sessions is imported into pftdata.mat, where it is stored in cell arrays of the format "samples{subject_no, condition}".
rawdata.zip
analysis.zip
Run the following functions in the listed order to reproduce figures and data in results/.
preprocessing.m:
figure1_methods.m:
figure2_example_plot.m:
figure3_averaged_response.m
figure4_complex_plane.m:
stats_auc_decoding.m:
pft_statistics.R:
results/ - Paper Figures (not layouted): figure1.tif, figure2.png, figure3.png, figure4.png - stats_responses.txt: behavioral analyses results, - stats_Zvalues.txt: complex plane analysis results - stats_AUC.txt: moment-by-moment AUC decoding results
Not seeing a result you expected?
Learn how you can add new datasets to our index.
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D