Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here were used to produce the following paper:
Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.
The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588
For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za
Description of file(s):
File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"
The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)
File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low
File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high
File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
shtspec species name
species_code species code
genus genus
rainclass low/medium/high
seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species
Surv_slope coefficient of the model predicting survival from age of clip for this species
GR_intercept coefficient of the model predicting stem diameter from seedling age for this species
GR_slope coefficient of the model predicting stem diameter from seedling age for this species
species_code species code
max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species
duiker_sd standard deviation of bite diameter for a duiker for this species
max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species
kudu_sd standard deviation of bite diameter for a kudu for this species
mean_bite_diam_duiker_mm mean etc
duiker_mean_sd standard devaition etc
mean_bite_diameter_kudu_mm mean etc
kudu_mean_sd standard deviation etc
AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite
AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite
AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite
AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite
AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite
AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterEPIC Tropospheric Ozone Data ProductThe Earth Polychromatic Imaging Camera (EPIC) on the Deep Space Climate Observatory (DSCOVR) spacecraft provides measurements of Earth-reflected radiances from the entire sunlit portion of the Earth. The measurements from four EPIC UV (Ultraviolet) channels reconstruct global distributions of total ozone. The tropospheric ozone columns (TCO) are then derived by subtracting independently measured stratospheric ozone columns from the EPIC total ozone. TCO data product files report gridded synoptic maps of TCO measured over the sunlit portion of the Earth disk on a 1-2 hour basis. Sampling times for these hourly TCO data files are the same as for the EPIC L2 total ozone product. Version 1.0 of the TCO product is based on Version 3 of the EPIC L1 product and the Version 3 Total Ozone Column Product. The stratospheric columns were derived from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2) ozone fields (Gelaro et al., 2017).In contrast to the EPIC total ozone maps that are reported at a high spatial resolution of 18 × 18 km2 near the center of the image, the TCO maps are spatially averaged over several EPIC pixels and written on a regular spatial grid (1° latitude x 1° longitude). Kramarova et al. (2021) describe the EPIC TCO product and its evaluation against independent sonde and satellite measurements. Table 1 lists all of the variables included in the TCO product files. Ozone arrays in the product files are integrated vertical columns in Dobson Units (DU; 1 DU = 2.69×1020 molecules m-2).Filename ConventionThe TCO product files are formatted HDF5 and represent a Level-4 (L4) product. The filenames have the following naming convention:”DSCOVR_EPIC_L4_TrO3_01_YYYYMMDDHHMMSS_03.h5” Where “TrO3” means tropospheric column ozone, “01” means that this is version 01 for this product, “YYYYMMDDHHMMSS” is the UTC measurement time with “YYYY” for year (2015-present), “MM” for month (01-12), “DD” for day of the month (1-31), and “HHMMSS” denotes hours-minutes-seconds, and “03” signifies that v3 L1b measurements were used to derive the EPIC total ozone and consequently TCO.Column Weighting Function AdjustmentThere are two TCO gridded arrays in each hourly data file for the user to choose from; one is denoted TroposphericColumnOzone, and the other is TroposphericColumnOzoneAdjusted. The latter TCO array includes an adjustment to correct for reduced sensitivity of the EPIC UV measurements in detecting ozone in the low troposphere/boundary layer. The adjustment depended on latitude and season and was derived using simulated tropospheric ozone from the GEOS-Replay model (Strode et al. 2020) constrained by the MERRA-2 meteorology through the replay method. Our analysis (Kramarova et al., 2021) indicated that the adjusted TCO array is more accurate and precise. Flagging Bad DataKramarova et al. (2021) note that the preferred EPIC total ozone measurements used for scientific study are those where the L2 “AlgorithmFlag” parameter equals 1, 101, or 111. In this TCO product, we have included only L2 total ozone pixels with these algorithm flag values. The TCO product files provide a gridded version of the AlgorithmFlag parameter as a comparison reference. Still, it is not needed by the user for applying data quality filtering.Another parameter in the EPIC L2 total ozone files for filtering questionable data is the “ErrorFlag.” The TCO product files include a gridded version of this ErrorFlag parameter that the user should apply. Only TCO-gridded pixels with an ErrorFlag value of zero should be used.TCO measurements at high satellite-look angles and/or high solar zenith angles should also be filtered out for analysis. The TCO files include a gridded version of the satellite look angle and the solar zenith angle denoted as “SatelliteLookAngle” and “SolarZenithAngle,” respectively. For scientific applications, users should filter TCO array data and use only pixels with SatelliteLookAngle and SolarZenithAngle < 70° to avoid retrieval errors near the Earth view edge. In summary, filtering the TCO arrays is optional, but for scientific analysis, we recommend applying the following two filters: (1) filter out all gridded pixels where ErrorFlag ≠ 0; (2) filter out all pixels where SatelliteLookAngle or SolarZenithAngle > 70°.Summary of the Derivation of the tropospheric column ozone productWe briefly summarize the derivation of EPIC TCO, stratospheric column ozone, and tropopause pressure. An independent measure of the stratospheric column ozone is needed to derive EPIC TCO. We use MERRA-2 ozone fields (Gelaro et al., 2017) to derive stratospheric ozone columns subtracted from EPIC total ozone (TOZ) to obtain TCO. The MERRA-2 data assimilation system ingests Aura OMI (Ozone Monitoring Instrument) v8.5 total ozone and MLS (Microwave Limb Sounder) v4.2 stratospheric ozone profiles to produce global synoptic maps of profile ozone from the surface to the top of the atmosphere; for our analyses, we use MERRA-2 ozone profiles reported every three hours (0, 3, 6, …, 21 UTC) at a resolution of 0.625° longitude × 0.5° latitude. MERRA-2 ozone profiles were integrated vertically from the top of the atmosphere down to tropopause pressure to derive maps of stratospheric column ozone. Tropopause pressure was determined from MERRA-2 re-analyses using standard PV-θ definition (2.5 PVU and 380K). The resulting maps of stratospheric column ozone at 3-hour intervals from MERRA-2 were then space-time collocated with EPIC footprints and subtracted from the EPIC total ozone, thus producing daily global maps of residual TCO sampled at the precise EPIC pixel times. These tropospheric ozone measurements were further binned to 1° latitude x 1° longitude resolution. ReferencesGelaro, R., W. McCarty, M.J. Suárez, R. Todling, A. Molod, L. Takacs, C.A. Randles, A. Darmenov, M.G. Bosilovich, R. Reichle, K. Wargan, L. Coy, R. Cullather, C. Draper, S. Akella, V. Buchard, A. Conaty, A.M. da Silva, W. Gu, G. Kim, R. Koster, R. Lucchesi, D. Merkova, J.E. Nielsen, G. Partyka, S. Pawson, W. Putman, M. Rienecker, S.D. Schubert, M. Sienkiewicz, and B. Zhao, The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2), J. Climate, 30, 5419–5454, https://doi.org/10.1175/JCLI-D-16-0758.1, 2017.Kramarova N. A., J. R. Ziemke, L.-K. Huang, J. R. Herman, K. Wargan, C. J. Seftor, G. J. Labow, and L. D. Oman, Evaluation of Version 3 total and tropospheric ozone columns from EPIC on DSCOVR for studying regional-scale ozone variations, Front. Rem. Sens., in review, 2021.Table 1. List of parameters and data arrays in the EPIC tropospheric ozone hourly product files. The left column lists the variable name, the second column lists the variable description and units, and the third column lists the variable data type and dimensions.Product Variable Name Description and units Data Type and DimensionsNadirLatitude Nadir latitude in degrees Real4 numberNadirLongitude Nadir longitude in degrees Real4 numberLatitude Center latitude of grid-point in degrees Real4 array with 180 elementsLongitude Center longitude of grid-point in degrees Real4 array with 360 elementsTroposphericColumnOzone Tropospheric column ozone in Dobson Units Real4 array with dimensions 360 × 180TroposphericColumnOzoneAdjusted Tropospheric column ozone with BL adjustment in Dobson Units Real4 array with dimensions 360 × 180StratosphericColumnOzone Stratospheric column ozone in Dobson Units Real4 array with dimensions 360 × 180TotalColumnOzone Total column ozone in Dobson Units Real4 array with dimensions 360 × 180Reflectivity Reflectivity (no units) Real4 array with dimensions 360 × 180RadiativeCloudFraction Radiative cloud fraction (no units) Real4 array with dimensions 360 × 180TropopausePressure Tropopause pressure in units hPa Real4 array with dimensions 360 × 180CWF1 Column weighting function for layer 1 (506.6-1013.3 hPa) Real4 array with dimensions 360 × 180ErrorFlag Error flag for TCO data Real4 array with dimensions 360 × 180AlgorithmFlag Algorithm flag for TCO data Real4 array with dimensions 360 × 180SatelliteLookAngle Satellite Look Angle in degrees Real4 array with dimensions 360 × 180SolarZenithAngle Solar Zenith Angle in degrees Real4 array with dimensions 360 × 180
Facebook
TwitterW0005 R/V Wecoma 29 May - 17 June 2000
SeaSoar data from the U.S. GLOBEC Northeast Pacific Program
are available from the SeaSoar Web Site
at Oregon State University
Contact Jack Barth at OSU (Phone: 541-737-1607; email barth@oce.orst.edu)
SeaSoar data are available in two formats:
\"1Hz data\" or \"gridded\".
Each of these is described below.
1Hz Data
--------
The *.dat2c files give final 1Hz SeaSoar CTD data.
Here is the first line of inshore.line1.dat2c:
44.64954 -125.25666 108.7 8.6551 33.5239 26.0164 8.6439 26.0181 155.64019
000603152152 0001 0.069 0.288 0.476 0.23
The format of the *.dat2c files is given by:
col 1: latitude (decimal degrees)
col 2: longitude (decimal degrees)
col 3: pressure (dbars)
col 4: temperature (C)
col 5: salinity (psu)
col 6: Sigma-t (kg/cubic meter)
col 7: potential temperature (C)
col 8: sigma-theta (kg/cubic meter)
col 9: time (decimal year-day of 2000)
col 10: date and time (integer year, month, day, hour, minute, second)
col 11: flag
col 12: PAR (volts)
col 13: FPK010 FL (violet filter) (volts)
col 14: FPK016 FL (green filter) (volts)
col 15: chlorophyl-a (micro g/liter)
The ones place of the flags variable indicates which of the
two sensor pairs was selected as the preferred sensor, giving
the values for T, S, and sigma-t:
0 indicates use of sensor pair 1 (T1, C1)
1 indicates use of sensor pair 2 (T2, C2)
Voltage values (columns 12 - 14) are in the range of 0-5 volts.
A value of 9.999 indicates \"no value\" for those columns
Chlorophyll was calculated based on the voltage values of
the green filtered FPK016; if that FPAK was 9-filled, then the
chlorophyll value was set at 999.99; if the calibrated value
was negative (due to noise in the calibration) the chlorophyll
value was set at 0.00; otherwise the calibration equation
used was:
chl_a = 7.6727(volts) - 3.4208
Gridded Data
------------
The *1.25km files give the final SeaSoar CTD data gridded
at a spacing of 1.25 km in the horizontal, and 2 db in the
vertical. In general this was used for the mapping surveys
that were on the continental shelf.
The *2.5km files give the final SeaSoar CTD data gridded at
a spacing of 2.5 km in the horizontal (and 2 db in the
vertical). These were used for the deeper, offshore survey.
Here is the first line of inshore.line1.1.25km:
6.25 155.92008 44.651726 -124.13853 1.0 9 9.5228777
33.127800 25.569221 240.63866 9.5227690
0.24063867E-01 3.7872221 1.1320001 0.78988892
The format of the *km files is given by:
col 1 = distance (km)
col 2 = julian day + fractional day (noon, Jan 1 = 1.5)
col 3 = latitude (decimal degrees)
col 4 = longitude (decimal degress)
col 5 = pressure (dbar)
col 6 = count
col 7 = temperature (degrees C)
col 8 = salinity (psu)
col 9 = density (sigma-t) (kg/cubic meter)
col 10 = specific vol anomaly (1.0E-8 cubic meter/kg)
col 11 = potential temperature (degrees C)
col 12 = dynamic height (dynamic meters)
col 13 = PAR (volts)
col 14 = FPK010 (volts) (violet filter)
col 15 = FPK016 (volts) (green filter)
\"missing data\" was set at 1.0e35
columns 1 - 4 give the average location and time of the
values contained in the column at that location. The
column gives values for every two dbars of depth, starting at
1db and extending down to a value at 121 db. The column
then shifts to the next location, 1.25km further along the
line. If we are working with the 2.5km sections, then the column
goes down to a value of 329 db, and the next column then shifts
2.5km further along the line.
For the E-W lines, column 1 gives the distance from the coastline;
for the N-S lines, column 1 gives the distance from southernmost point.
column 6 (count) gives the number of samples in that 2db bin
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).
Fish-AIR: This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:
extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.
The outputs from the Minnow_Segmented_Traits workflow are:
sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.
presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.
heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.
minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).
burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains water column process rate measurements from water samples collected aboard the R/V Endeavor cruise EN586 in the northern Gulf of Mexico from 2016-07-25 to 2016-07-27. Samples were collected during R/V Endeavor cruise EN586 using a CTD-rosette. The objective was to determine water column process rates at ECOGIG seep and other study sites. Water samples collected with the CTD-rosette were incubated under simulated in-situ conditions after addition of 15N2 and either 13C-bicarbonate or 13C -methane tracers for 24 hours (DIC label) or 48 hours (CH4 label). Experiments were terminated by gentle pressure filtration onto a 10 µm sieve and pre-combusted GF/F filters to collect the small and large size fractions of particles. Filters were dried, then pelletized in Sn (tin) capsules for isotopic analysis. N and C isotopic abundances were measured by continuous-flow isotope ratio mass spectrometry using a Micromass Optima IRMS interfaced to a CE NA2500 elemental analyzer. Rates were calculated using a mass balance approach. The dataset also includes the date, depth, and locations (latitudes and longitudes) of the sample collection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains data in CSV format from Tables 2 to 6 of, "An accurate new method of calculating absolute magnitudes and K-corrections applied to the Sloan filter set", which has been submitted to the "Astrophysical Journal". The 10 tables list second order polynomial coefficients for use in determining absolute magnitudes from observed colors, two alternative colors being given for each of the Sloan u, g, r, i, z-bands, as described in the paper.
We describe an accurate new method for determining absolute magnitudes, and hence also K-corrections, which is simpler than most previous methods, being based on a quadratic function of just one suitably chosen observed color. The method relies on the extensive and accurate new set of 129 empirical galaxy template SEDs from Brown et al. (2014). A key advantage of our method is that we can reliably estimate random errors in computed absolute magnitudes due to galaxy diversity, photometric error and redshift error. We derive K-corrections for the five Sloan Digital Sky Survey filters and provide parameter tables for use by the astronomical community. Using the New York Value-Added Galaxy Catalog we compare our K-corrections with those from kcorrect. Our K-corrections produce absolute magnitudes that are generally in good agreement with kcorrect. Absolute g, r, i, z-band magnitudes differ by less than 0.02 mag, and those in the u-band by ~0.04 mag. The evolution of rest-frame colors as a function of redshift is better behaved using our method, with relatively few galaxies being assigned anomalously red colors and a tight red sequence being observed across the whole 0.0 < z < 0.5 redshift range
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Carbon monoxide (CO) is an important atmospheric constituent affecting air quality, and methane (CH4) is the second most important greenhouse gas contributing to human-induced climate change. Detailed and continuous observations of these gases are necessary to better assess their impact on climate and atmospheric pollution. While surface and airborne measurements are able to accurately determine atmospheric abundances on local scales, global coverage can only be achieved using satellite instruments. The TROPOspheric Monitoring Instrument (TROPOMI) onboard the Sentinel-5 Precursor satellite, which was successfully launched in October 2017, is a spaceborne nadir-viewing imaging spectrometer measuring solar radiation reflected by the Earth in a push-broom configuration. It has a wide swath on the terrestrial surface and covers wavelength bands between the ultraviolet (UV) and the shortwave infrared (SWIR), combining a high spatial resolution with daily global coverage. These characteristics enable the determination of both gases with an unprecedented level of detail on a global scale, introducing new areas of application. Abundances of the atmospheric column-averaged dry air mole fractions XCO and XCH4 are simultaneously retrieved from TROPOMIs radiance measurements in the 2.3µm spectral range of the SWIR part of the solar spectrum using the scientific retrieval algorithm Weighting Function Modified Differential Optical Absorption Spectroscopy (WFM-DOAS). This algorithm is intended to be used with the operational algorithms for mutual verification and to provide new geophysical insights. We introduce the algorithm in detail, including expected error characteristics based on synthetic data, a machine-learning-based quality filter, and a shallow learning calibration procedure applied in the post-processing of the XCH4 data. The quality of the results based on real TROPOMI data is assessed by validation with ground-based Fourier transform spectrometer (FTS) measurements providing realistic error estimates of the satellite data: the XCO data set is characterised by a random error of 5.1ppb (5.8%) and a systematic error of 1.9ppb (2.1%); the XCH4 data set exhibits a random error of 14.0ppb (0.8%) and a systematic error of 4.3ppb (0.2%). The natural XCO and XCH4 variations are well-captured by the satellite retrievals, which is demonstrated by a high correlation with the validation data (R=0.97 for XCO and R=0.91 for XCH4 based on daily averages).
Schneising, O., Buchwitz, M., Reuter, M., Bovensmann, H., Burrows, J. P., Borsdorff, T., Deutscher, N. M., Feist, D. G., Griffith, D. W. T., Hase, F., Hermans, C., Iraci, L. T., Kivi, R., Landgraf, J., Morino, I., Notholt, J., Petri, C., Pollard, D. F., Roche, S., Shiomi, K., Strong, K., Sussmann, R., Velazco, V. A., Warneke, T., and Wunch, D.: A scientific algorithm to simultaneously retrieve carbon monoxide and methane from TROPOMI onboard Sentinel-5 Precursor, Atmos. Meas. Tech., 12, 6771–6802, https://doi.org/10.5194/amt-12-6771-2019, 2019.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reference tables for TfNSW data. A complete list of agencies and how they are defined in each GTFS feed.\r \r \r Using Complete GTFS + Real-Time GTFS feeds \r \r The reference tables illustrates agencies that appear in multiple GTFS feeds. If you are using the Complete GTFS bundle in conjuction with the real-time GTFS feeds for each mode then you will need to filter out the Complete GTFS agencies and use the corresponding real-time agencies.\r \r \r Turn Up and Go/Frequency Services \r \r High Frequency services run to an operational timetable as per the relevant GTFS bundles, however they will adjust to headway in response to operational requirements throughout the day. Therefore we recommend ignoring the delay information passed on in GTFS-R feeds for these routes, and only showing real-time arrival/departure times to customers.\r \r \r Train Run Numbers \r \r The list of trips in the reference tables represent services that appear in both the Sydney Trains real-time feed and NSW Trains rural and regional real-time feed. If you are using both feeds, TfNSW recommends filtering out these services from the Sydney Trains feed and preferentially using the NSW Trains feed.\r \r \r
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version release associated with GitHub repository: https://github.com/leahcrowe-otago/FBD_measurements/tree/main
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
filter.py selects the best hit for each marker protein in each genome based on the highest scoring alignment region. SubsampleAlignmentRadomly.py randomly deletes 50% of alignment columns. phylobayes_convergence_statistics.r calculates Rhat and ESS values for molecular clock outputs.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here were used to produce the following paper:
Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.
The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588
For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za
Description of file(s):
File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"
The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)
File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low
File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high
File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
shtspec species name
species_code species code
genus genus
rainclass low/medium/high
seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species
Surv_slope coefficient of the model predicting survival from age of clip for this species
GR_intercept coefficient of the model predicting stem diameter from seedling age for this species
GR_slope coefficient of the model predicting stem diameter from seedling age for this species
species_code species code
max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species
duiker_sd standard deviation of bite diameter for a duiker for this species
max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species
kudu_sd standard deviation of bite diameter for a kudu for this species
mean_bite_diam_duiker_mm mean etc
duiker_mean_sd standard devaition etc
mean_bite_diameter_kudu_mm mean etc
kudu_mean_sd standard deviation etc
AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite
AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite
AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite
AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite
AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite
AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite