Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here were used to produce the following paper:
Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.
The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588
For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za
Description of file(s):
File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"
The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)
File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low
File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high
File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
shtspec species name
species_code species code
genus genus
rainclass low/medium/high
seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species
Surv_slope coefficient of the model predicting survival from age of clip for this species
GR_intercept coefficient of the model predicting stem diameter from seedling age for this species
GR_slope coefficient of the model predicting stem diameter from seedling age for this species
species_code species code
max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species
duiker_sd standard deviation of bite diameter for a duiker for this species
max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species
kudu_sd standard deviation of bite diameter for a kudu for this species
mean_bite_diam_duiker_mm mean etc
duiker_mean_sd standard devaition etc
mean_bite_diameter_kudu_mm mean etc
kudu_mean_sd standard deviation etc
AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite
AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite
AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite
AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite
AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite
AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To help extend the life and reach of the bee taxonomy embedded in Dorey et al. 2023, this publication includes a versioned copy of the original data as obtained on 2023-01-05 via https://open.flinders.edu.au/ndownloader/files/43331472 , as well as a tab-separated-values file as converted using an R script `rds2tsv.R` containing:
write.table(readRDS(xzfile('/dev/stdin')), sep='\t', na='', row.names=F, quote=F)
Please cite the original work as well as DiscoverLife when using this derived and re-packaged dataset:
Dorey, J.B., Fischer, E.E., Chesshire, P.R. et al. A globally synthesised and flagged bee occurrence dataset and cleaning workflow. Sci Data 10, 747 (2023). https://doi.org/10.1038/s41597-023-02626-w
Ascher, J. S. and J. Pickering. 2022. Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). http://www.discoverlife.org/mp/20q?guide=Apoidea_species Draft-56, 21 August, 2022
filename | content id |
bee-taxonomy.rds | hash://sha256/76d8bac0e8ba193afa3278108d1aed0e08d4de1497d27ff22e5aaee3195232b4 |
bee-taxonomy.rds | hash://md5/9cd3653a3553202eb9a3fdd684a86b6e |
bee-taxonomy.tsv | hash://sha256/b512043ddf994537ae1ed8068e44bf3a5cb9ab5e44ddf257cc82e67fe034e0e6 |
bee-taxonomy.tsv | hash://md5/ef136f270301830126deb3ced4da2383 |
Below are the first 10 rows of bee-taxonomy.tsv tab-delimited file output that can be downloaded below. The majority of the rows are derived from the Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila) (Asscher & Pickering 2022). Rows in the file include scientific name, taxonomic status, and higher taxonomy including subgenus.
flags | taxonomic_status | source | accid | id | kingdom | phylum | class | order | family | subfamily | tribe | subtribe | validName | canonical | canonical_withFlags | genus | subgenus | species | infraspecies | authorship | taxon_rank | valid | notes |
accepted | DiscoverLife | 0 | 4 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum argentinum (Friese, 1906) | Acamptopoeum argentinum | Acamptopoeum argentinum | Acamptopoeum | argentinum | (Friese, 1906) | Species | TRUE | |||||
synonym | DiscoverLife | 4 | 5 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Panurgini | Perditina | Perdita argentina Friese, 1906 | Perdita argentina | Perdita argentina | Perdita | argentina | Friese, 1906 | Species | FALSE | ||||
accepted | DiscoverLife | 0 | 6 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum calchaqui Compagnucci, 2004 | Acamptopoeum calchaqui | Acamptopoeum calchaqui | Acamptopoeum | calchaqui | Compagnucci, 2004 | Species | TRUE | |||||
accepted | DiscoverLife | 0 | 7 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum colombiense Shinn, 1965 | Acamptopoeum colombiense | Acamptopoeum colombiense | Acamptopoeum | colombiense | Shinn, 1965 | Species | TRUE | |||||
synonym | DiscoverLife | 7 | 8 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum colombiensis_sic Shinn, 1965 | Acamptopoeum colombiensis | Acamptopoeum colombiensis_sic | Acamptopoeum | colombiensis | Shinn, 1965 | Species | FALSE | species _sic | ||||
accepted | DiscoverLife | 0 | 9 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum fernandezi Gonzalez, 2004 | Acamptopoeum fernandezi | Acamptopoeum fernandezi | Acamptopoeum | fernandezi | Gonzalez, 2004 | Species | TRUE | |||||
accepted | DiscoverLife | 0 | 10 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum inauratum (Cockerell, 1926) | Acamptopoeum inauratum | Acamptopoeum inauratum | Acamptopoeum | inauratum | (Cockerell, 1926) | Species | TRUE | |||||
synonym | DiscoverLife | 10 | 11 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Panurgini | Camptopoeina | Camptopoeum (Acamptopoeum) inauratum Cockerell, 1926 | Camptopoeum (Acamptopoeum) inauratum | Camptopoeum (Acamptopoeum) inauratum | Camptopoeum | Acamptopoeum | inauratum | Cockerell, 1926 | Species | FALSE | |||
accepted | DiscoverLife | 0 | 12 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum melanogaster Compagnucci, 2004 | Acamptopoeum melanogaster | Acamptopoeum melanogaster | Acamptopoeum | melanogaster | Compagnucci, 2004 | Species | TRUE | |||||
accepted | DiscoverLife | 0 | 13 | Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Panurginae | Calliopsini | Acamptopoeum nigritarse (Vachal, 1909) | Acamptopoeum nigritarse | Acamptopoeum nigritarse | Acamptopoeum | nigritarse | (Vachal, 1909) | Species | TRUE |
Please also credit the original dataset Dorey et al. 2023 when using this derived product.
The versioned workflow was captured by Preston (Elliot et al. 2020, 2023) with identifier hash://sha256/56e0b3d68f221ee79d61f6da7bfdfad927d63ab86700b856d0a42a133841779c and history
preston ls --anchor hash://sha256/56e0b3d68f221ee79d61f6da7bfdfad927d63ab86700b856d0a42a133841779c
https://preston.guoda.bio http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/ns/prov#SoftwareAgent urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
https://preston.guoda.bio http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/ns/prov#Agent urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
https://preston.guoda.bio http://purl.org/dc/terms/description "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/ns/prov#Activity urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc http://purl.org/dc/terms/description "An activity that assigns an alias to a content hash"@en urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc http://www.w3.org/ns/prov#startedAtTime "2024-01-06T00:02:37.973Z"^^http://www.w3.org/2001/XMLSchema#dateTime urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc http://www.w3.org/ns/prov#wasStartedBy https://preston.guoda.bio urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
https://doi.org/10.5281/zenodo.1410543 http://www.w3.org/ns/prov#usedBy urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc urn:uuid:96a86205-ef3f-4a0f-9c72-d8c1552c9fbc .
https://doi.org/10.5281/zenodo.1410543 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/dc/dcmitype/Software
ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='... Visit https://dataone.org/datasets/sha256%3Aae94fefb74f928a3d482eee20abf33cf04d988555ef2beef2977eba7d5504bd7 for complete metadata about this dataset.
The authors have built a catalog of 219 Fanaroff and Riley class I edge-darkened radio galaxies (FR Is), called FRICAT, that is selected from a published sample and obtained by combining observations from the NVSS, FIRST, and SDSS surveys. They included in the catalog the sources with an edge-darkened radio morphology, redshift <= 0.15, and extending (at the sensitivity of the FIRST images) to a radius r larger than 30 kpc from the center of the host. The authors also selected an additional sample (sFRICAT) of 14 smaller (10 < r < 30 kpc) FR Is, limiting to z < 0.05. The hosts of the FRICAT sources are all luminous (-21 >~ Mr >~ 24), red early-type galaxies with black hole masses in the range 108 <~ MBH <~ 3 x 109 solar masses); the spectroscopic classification based on the optical emission line ratios indicates that they are all low excitation galaxies. Sources in the FRICAT are then indistinguishable from the FR Is belonging to the Third Cambridge Catalogue of Radio Sources (3C) on the basis of their optical properties. Conversely, while the 3C-FR Is show a strong positive trend between radio and [O III] emission line luminosity, these two quantities are unrelated in the FRICAT sources; at a given line luminosity, they show radio luminosities spanning about two orders of magnitude and extending to much lower ratios between radio and line power than 3C-FR Is. The authors' main conclusion is that the 3C-FR Is represent just the tip of the iceberg of a much larger and diverse population of FR Is. This HEASARC table contains both the 219 radio galaxies in the main FRICAT sample listed in Table B.1 of the reference paper and the 14 radio galaxies in the additional sFRICAT sample listed in Table B.2 of the reference paper. To enable users to distinguish from which sample an entry has been taken, the HEASARC created a parameter galaxy_sample which is set to 'M' for galaxies from the main sample, and to 'S' for galaxies from the supplementary sFRICAT sample. Throughout the paper, the authors adopted a cosmology with H0 = 67.8 km s-1 Mpc-1, OmegaM = 0.308, and OmegaLambda = 0.692 (Planck Collaboration XIII 2016). This table was created by the HEASARC in February 2017 based on electronic versions of Tables B.1 and B.2 that were obtained from the Astronomy & Astrophysics website. This is a service provided by NASA HEASARC .
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
We present R-band images covering more than 11 square degrees of sky that were obtained in preparation for the Spitzer Space Telescope First-Look Survey (FLS). The FLS was designed to characterize the mid-infrared sky at depths 2 orders of magnitude deeper than previous surveys. The extragalactic component is the first cosmological survey done with Spitzer. Source catalogs extracted from the R-band images are also presented. The R-band images were obtained using the Mosaic-1 camera on the 4m Mayall Telescope of the Kitt Peak National Observatory. Two relatively large regions of the sky were observed to modest depth: the main FLS extragalactic field (17h18m00s, +59{deg}30'00.0"[J2000]; l=88.3{deg}, b=+34.9{deg}) and the ELAIS-N1 field (16h10m01s, +54{deg}30'36.0"; l=84.2{deg}, b=+44.9{deg}). While both these fields were in early plans for the FLS, only a single deep-pointing test observation was made at the ELAIS-N1 location. The larger Legacy program SWIRE will include this region among its surveyed areas. The data products of our KPNO imaging (images and object catalogs) are made available to the community through the World Wide Web (via the Spitzer Science Center and NOAO Science Archive, http://ssc.spitzer.caltech.edu/fls/). The overall quality of the images is high. The measured positions of sources detected in the images have rms uncertainties in their absolute positions on the order of 0.35" with possible systematic offsets on the order of 0.1", depending on the reference frame of comparison. The relative astrometric accuracy is much better than 1/10 of an arcsecond. Typical delivered image quality in the images is 1.1" full width at half-maximum. The images are relatively deep, since they reach a median 5{sigma} depth limiting magnitude of R=25.5 (Vega) as measured within a 1.35 FWHM aperture, for which the signal-to-noise ratio (S/N) is maximal. Catalogs have been extracted with SExtractor, using thresholds in area and flux for which the number of false detections is below 1% at R=25. Only sources with S/N>3 have been retained in the final catalogs. Comparing the galaxy number counts from our images with those of deeper R-band surveys, we estimate that our observations are 50% complete at R=24.5. These limits in depth are sufficient to identify a substantial fraction of the infrared sources that will be detected by Spitzer. Use of the data: Use of these data must be accompanied by citation of the paper and acknowledgment: "The National Optical Astronomy Observatory (NOAO) is operated by the Association of Universities for Research in Astronomy (AURA), Inc. under cooperative agreement with the National Science Foundation."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
The English Longitudinal Study of Ageing (ELSA) is a longitudinal survey of ageing and quality of life among older people that explores the dynamic relationships between health and functioning, social networks and participation, and economic position as people plan for, move into and progress beyond retirement. The main objectives of ELSA are to:
Further information may be found on the "https://www.elsa-project.ac.uk/"> ELSA project website, the or Natcen Social Research: ELSA web pages.
Wave 11 data has been deposited - May 2025
For the 45th edition (May 2025) ELSA Wave 11 core and pension grid data and documentation were deposited. Users should note this dataset version does not contain the survey weights. A version with the survey weights along with IFS and financial derived datasets will be deposited in due course. In the meantime, more information about the data collection or the data collected during this wave of ELSA can be found in the Wave 11 Technical Report or the User Guide.
Wave 10 Accelerometry data has been deposited - August 2025
For the 46th edition (August 2025) ELSA Wave 10 Accelerometry data and documentation, along with a new version of the Wave 10 Technical Report, have been deposited. Between June 2021 and October 2022, approximately 75% of ELSA households (including core members and partners) were randomly selected and invited to wear an Axivity AX3 tri-axial accelerometer for eight days and nights. Accelerometer has been used to objectively measure movement behaviours for the first time in ELSA. Four datasets including data collected by accelerometers were deposited. Datasets include: output from the Biobank accelerometer analysis (bbaa), and 24 hour movement behaviours; the step count data; overnight sleep and sleep stage data.
Wave 10 HCAP2 End of Life data has been deposited - September 2025:
For the 47th edition (September 2025), the HCAP2 (Wave 10) End of Life interview data and questionnaire documentation were deposited. The End of Life interview completes the information collected at previous waves of ELSA by interviewing a close friend or relative of the deceased ELSA sample member after their death. Previous End of Life interviews were carried out alongside Waves 2, 3, 4, and 6 of ELSA. The fieldwork for HCAP2 (Wave 10) End of Life took place between 2022-2024. For more information please refer to the questionnaire documentation. The End of Life User Guide will be updated at a later date.
Health conditions research with ELSA - June 2021
The ELSA Data team have found some issues with historical data measuring health conditions. If you are intending to do any analysis looking at the following health conditions, then please read the ELSA User Guide or if you still have questions contact elsadata@natcen.ac.uk for advice on how you should approach your analysis. The affected conditions are: eye conditions (glaucoma; diabetic eye disease; macular degeneration; cataract), CVD conditions (high blood pressure; angina; heart attack; Congestive Heart Failure; heart murmur; abnormal heart rhythm; diabetes; stroke; high cholesterol; other heart trouble) and chronic health conditions (chronic lung disease; asthma; arthritis; osteoporosis; cancer; Parkinson's Disease; emotional, nervous or psychiatric problems; Alzheimer's Disease; dementia; malignant blood disorder; multiple sclerosis or motor neurone disease).
For information on obtaining data from ELSA that are not held at the UKDS, see the ELSA Genetic data access and Accessing ELSA data webpages.
Harmonized dataset:
Users of the Harmonized dataset who prefer to use the Stata version will need access to Stata MP software, as the version G3 file contains 11,779 variables (the limit for the standard Stata 'Intercooled' version is 2,047).
ELSA COVID-19 study:
A separate ad-hoc study conducted with ELSA respondents, measuring the socio-economic effects/psychological impact of the lockdown on the aged 50+ population of England, is also available under SN 8688,
English Longitudinal Study of Ageing COVID-19 Study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the ESA CCI Soil Moisture science data records community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of 12-lead ECGs with annotations. The dataset contains 345 779 exams from 233 770 patients. It was obtained through stratified sampling from the CODE dataset ( 15% of the patients). The data was collected by the Telehealth Network of Minas Gerais in the period between 2010 and 2016.
This repository contains the files `exams.csv` and the files `exams_part{i}.zip` for i = 0, 1, 2, ... 17.
In python, one can read this file using h5py.
```python
import h5py
f = h5py.File(path_to_file, 'r')
# Get ids
traces_ids = np.array(self.f['id_exam'])
x = f['signal']
```
The `signal` dataset is too large to fit in memory, so don't convert it to a numpy array all at once.
It is possible to access a chunk of it using: ``x[start:end, :, :]``.
The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil. The dataset is described
Ribeiro, Antônio H., Manoel Horta Ribeiro, Gabriela M. M. Paixão, Derick M. Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton P. S. Ferreira, et al. “Automatic Diagnosis of the 12-Lead ECG Using a Deep Neural Network.” Nature Communications 11, no. 1 (2020): 1760. https://doi.org/10.1038/s41467-020-15432-4
The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the code dataset is described in and used for assessing model performance:
"Deep neural network estimated electrocardiographic-age as a mortality predictor"
Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232
The companion code for reproducing the experiments in the two papers described above can be found, respectively, in:
- https://github.com/antonior92/automatic-ecg-diagnosis; and in,
- https://github.com/antonior92/ecg-age-prediction.
Note about authorship: Antônio H. Ribeiro, Emilly M. Lima and Gabriela M.M. Paixão contributed equally to this work.
This dataset is designed to accompany the paper submitted to Data Science Journal: O'Brien et al, "Earth Science Data Repositories: Implementing the CARE Principles". This dataset shows examples of activities that data repositories are likely to undertake as they implement the CARE principles. These examples were constructed as part of a discussion about the challenges faced by data repositories when acquiring, curating, and disseminating data and other information about Indigenous Peoples, communities, and lands. For clarity, individual repository activities were very specific. However, in practice, repository activities are not carried out singly, but are more likely to be performed in groups or in sequence. This dataset shows examples of how activities are likely to be combined in response to certain triggers. See related dataset O'Brien, M., R. Duerr, R. Taitingfong, A. Martinez, L. Vera, L. Jennings, R. Downs, E. Antognoli, T. ten Brink, N. Halmai, S.R. Carroll, D. David-Chavez, M. Hudson, and P. Buttigieg. 2024. Alignment between CARE Principles and Data Repository Activities. Environmental Data Initiative. https://doi.org/10.6073/pasta/23e699ad00f74a178031904129e78e93 (Accessed 2024-03-13), and the paper for more information about development of the activities and their categorization, raw data of relationships between specific activities and a discussion of the implementation of CARE Principles by data repositories.
Data in this table are organized into groups delineated by a triggering event in the
first column. For example, the first group consists of 9 rows; while the second group has 7
rows. The first row of each group contains the event that triggers the set of actions
described in the last 4 columns of the spreadsheet. Within each group, the associated rows
in each column are given in numerical not temporal order, since activities will likely vary
widely from repository to repository.
For example, the first group of rows is about what likely needs to happen if a
repository discovers that it holds Indigenous data (O6). Clearly, it will need to develop
processes to identify communities to engage (R6) as well as processes for contacting those
communities (R7) (if it doesn't already have them). It will also probably need to review and
possibly update its data management policies to ensure that they are justifiable (R2). Based
on these actions, it is likely that the repository's outreach group needs to prepare for
working with more communities (O3) including ensuring that the repository's governance
protocols are up-to-date and publicized (O5) and that the repository practices are
transparent (O4). If initial contacts go well, it is likely that the repository will need
ongoing engagement with the community or communities (S1). This may include adding
representation to the repository's advisory board (O2); clarifying data usage with the
communities (O9), facilitating relationships between data providers and communities (O1);
working with the community to identify educational opportunities (O10); and sharing data
with them (O8). It may also become necessary to liaise with whomever is maintaining the
vocabularies in use at the repository (O7).
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/I9E6EXhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/I9E6EX
Enclosed are all the replication material for Consolidating Progress: The Selection of Female Ministers in Autocracies and Democracies We are using R to conduct the analyses in the paper. Four files are needed to repliate the analysis, one data file and three r-scripts. Below we briefly go through each script. The dataset is found in "df_consolidatingprogress.csv". We only use this dataset in the analysis, and there are no other datasets. All variables are gathered from publicly available datasets and we discuss each in detail in Appendix B. The script which merges these datasets is not public, but please contact the authors if you are interested in this script. The first part of the analysis is found in "1_descriptive_analysis.R". This is the script which replicates Figure 1 and Figure 2. The second part of the analysis is found in 2_analysis.R". This is the script which replicates Figure 3, Table 2, Figure 4, Figure 5, and Figure 6. The script which replicates the appendix is found in "3_appendix.R". This file creates all figures and tables found in the appendix. In addition to these files, we have also uploaded 0_createdata which cleans, transforms, merges, and otherwise process the data prior to the analysis. The original datasets have also been uploaded. These are: 1_WhoGov_within_V2.0.xlsx 2_WhoGov_crosssectional_V2.0.xlsx 3_vdem_V12.xlsx 4_bmr_V4_edited.xlsx 5_bjornskovrode_V4.2.xlsx 6_polityiv.rds 7_autocraciesoftheworld.xlsx 8_pwt_v10.0.xlsx 9_qog_std_ts_jan22 10_qog_std_cs_jan22.xlsx 11_wb.xlsx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts in this fileset are those used in the PLOS ONE publication "A snapshot of translational research funded by the National Institutes of Health (NIH): A case study using behavioral and social science research awards and Clinical and Translational Science Awards funded publications." The article can be accessed here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196545This consists of all R scripts used for data cleaning, data manipulation, and statistical analysis used in the publication.There are eleven files in total:1. "Step1a.bBSSR.format.grants.and.publications.data.R" combines all bBSSR 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 2. "Step1b.BSSR.format.grants.and.publications.data.R" combines all BSSR-only 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 3. "Step2a.bBSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated bBSSR publication data.4. "Step2b.BSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated BSSR-only publication data.5. "Step3.summary.stats.R" performs summary statistics6. "Step4.time.to.first.publication.R" performs time to first publication analysis.7. "Step5.time.to.citation.analysis.R" performs time to first citation and time to overall citation analyses.8. "Step6.combine.NIH.iCite.data.R" combines NIH iCite citation data.9. "Step7.iCite.data.analysis.R" performs citation analysis on combined iCite data.10. "Step8.MeSH.descriptors.R" queries PubMed and pulls down all MeSH descriptors for all publications11. "Step9.CTSA.publications.R" compares the percent of translational publications among bBSSR, BSSR-only, and CTSA publications.
By City of San Francisco [source]
This dataset explores the late night departing runways used by aircraft at San Francisco International Airport (SFO). From 1:00 a.m. to 6:00 a.m., aircraft are directed to either 10L/R, 01L/R or 28L/R with an immediate right turn when safety and weather conditions permit to reduce noise in the area's surrounding residential communities by following over-water departure procedures, directing aircraft over the bay instead. Providing insight into which late night runways are most frequently used, data from this dataset is broken down by runway, month and year of departure as well as what percent of total departures for each month come from each runway - allowing for a comprehensive look at SFO's preferential late night use of airport runways!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to analyze the degree of aircraft late night departure from San Francisco Airport in order to study the impact of runway usage on air and noise pollution in residential communities. This dataset contains information about departures from each runway (01L/R, 10L/R, 19L/R and 28L/R) at San Francisco Airport for a specified year and month. By studying the percentage of total departures by runway we can understand how much aircraft are using which runways during late night hours.
To use this dataset one needs to first become familiar with the column names such as Year, Month, 01L/R(number of departures from 01L/R runway),01L/R Percent of Departures (percentage of departures from 01LR runway) etc. It is also important to become more familiar with terms such as departure and late-night which are prominently used in this dataset.
Once you have familiarized yourself with these details you can start exploring the data for further insights into how specific runways are being used for late night flight operations in San Francisco Airport and also note any patterns or trends that may emerge when looking at multiple months or years within this data set. Additionally, by comparing percentages between different runways we can measure which runways are preferred more often than others during times when congested traffic is more common such as holidays or summer months when residents take trips more often
- To identify areas of the San Francisco Airport prone to noise pollution from aircraft and develop ways to limit it.
- To analyze the impacts of changing departure runway preferences on noise pollution levels over residential communities near the airport.
- To monitor seasonal trends in aircraft late night departures by runways, along with identifying peak hours for each runway, in order to inform flight controllers and develop improved flight control regulations and procedures at the San Francisco Airport
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: late-night-preferential-runway-use-1.csv | Column name | Description | |:--------------------------------|:--------------------------------------------------------| | Year | The year of the data. (Integer) | | Month | The month of the data. (String) | | 01L/R | The number of departures from runway 01L/R. (Integer) | | 01L/R Percent of Departures | The percentage of departures from runway 01L/R. (Float) | | 10L/R | The number of departures from runway 10L/R. (Integer) | | 10L/R Percent of Departures | The percentage of departures from runway 10L/R. (Float) | | 19L/R | The number of departures from runway 19L/R. (Integer) | | 19L/R Percent of Departures | The percentage of departures from runway 19L/R. (Float) | | 28L/R | The number of departures from runway 28L/R. (Integer) | | 28L/R Percent of Departures | The percentage of departures from runway 28L/R. (Float) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit City of San Francisco.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MeteoSerbia1km is the first daily gridded meteorological dataset at a 1-km spatial resolution across Serbia for the 2000–2019 period. The dataset consists of five daily variables: maximum, minimum and mean temperature, mean sea level pressure, and total precipitation. Besides daily summaries, it contains monthly and annual summaries, daily, monthly, and annual long term means (LTM). Daily gridded data were interpolated using the Random Forest Spatial Interpolation methodology based on Random Forest and using nearest observations and distances to them as spatial covariates, together with environmental covariates. Complete script in R and datasets used for modelling, tuning, validation, and prediction of daily meteorological variables are available here. If you discover a bug, artifact or inconsistency in the MeteoSerbia1km maps, or if you have a question please use this channel. File naming convention of .zip files and containing MeteoSerbia1km files: Daily summaries per year: day_yyyy_proj.zip var_day_yyyymmdd_proj.tif Monthly summaries: mon_proj.zip var_mon_yyyymm_proj.tif Annual summaries: ann_proj.zip var_ann_yyyy_proj.tif Daily, monthly and annual LTM: ltm_proj.zip daily LTM: var_ltm_day_mmdd_proj.tif monthly LTM: var_ltm_mon_mm_proj.tif annual LTM: var_ltm_ann_proj.tif where: var is a daily meteorological variable name - tmax, tmin, tmean, slp, or prcp proj is a dataset projection - wgs84 or utm34 Units of the dataset values are temperature (Tmean, Tmax, and Tmin) - tenths of a degree in the Celsius scale (℃) SLP - tenths of a mbar PRCP - tenths of a mm All dataset values are stored as integers (INT32 data type) in order to reduce the size of the GeoTIFF files, i.e., temperature values should be divided by 10 to obtain degrees Celsius, and the same for SLP and PRCP to obtain millibars and millimeters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.
The building process of the dataset is detailed in the companion paper:
A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.
available here.
This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.
The data uses several aggregation and interpolation methods to estimate the measurement for all days.
The files in the record, renamed according to their version (es. .._v_3_0_0), are:
Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.
Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.
Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.
Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.
Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.
Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.
AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.
The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data
UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0
A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.
In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.
A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.
UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2
A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.
A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.
UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1
minor bug fixed
UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0
A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:
Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)
Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)
Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)
Non-rigid 3D objects are commonly seen in our surroundings. However, previous efforts have been mainly devoted to the retrieval of rigid 3D models, and thus comparing non-rigid 3D shapes is still a challenging problem in content-based 3D object retrieval. Therefore, we organize this track to promote the development of non-rigid 3D shape retrieval. The objective of this track is to evaluate the performance of 3D shape retrieval approaches on the subset of a publicly available non-rigid 3D models database----McGill Articulated Shape Benchmark database. Task description: The task is to evaluate the dissimilarity between every two objects in the database and then output the dissimilarity matrix. Data set: The McGill Articulated Shape Benchmark database consists of 255 non-rigid 3D models which are classified into 10 categories. The maximum number of the objects in a class is 31, while the minimum number is 20. 200 models are selected (or modified) to generate our test database to ensure that every class contains equal number of models. The models are represented as watertight triangle meshes and the file format is selected as the ASCII Object File Format (*.off). The original database is publicly available on the website: http://www.cim.mcgill.ca/~shape/benchMark/ Evaluation Methodology: We will employ the following evaluation measures: Precision-Recall curve; Average Precision (AP) and Mean Average Precision (MAP); E-Measure; Discounted Cumulative Gain; Nearest Neighbor, First-Tier (Tier1) and Second-Tier (Tier2). Please Cite the paper: SHREC'10 Track: Non-rigid 3D Shape Retrieval., Z. Lian, A. Godil, T. Fabry, T. Furuya, J. Hermans, R. Ohbuchi, C. Shu, D. Smeets, P. Suetens, D. Vandermeulen, S. Wuhrer In: M. Daoudi, T. Schreck, M. Spagnuolo, I. Pratikakis, R. Veltkamp (eds.), Proceedings of the Eurographics/ACM SIGGRAPH Symposium on 3D Object Retrieval, 2010.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}