86 datasets found

Data from: Regression with Empirical Variable Selection: Description of a...
plos.figshare.com
txt
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0034338
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
d
Data release for solar-sensor angle analysis subset associated with the...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data release for solar-sensor angle analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-solar-sensor-angle-analysis-subset-associated-with-the-journal-article-so
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Western United States, United States
Description
This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
SDSS Galaxy Subset
zenodo.org
application/gzip
Updated Sep 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Ramos Carvalho; Nuno Ramos Carvalho (2022). SDSS Galaxy Subset [Dataset]. http://doi.org/10.5281/zenodo.6696565
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6696565
Dataset updated
Sep 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nuno Ramos Carvalho; Nuno Ramos Carvalho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 60247 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.

The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):

objid: unique SDSS object identifier

mjd: MJD of observation

plate: plate identifier

tile: tile identifier

fiberid: fiber identifier

run: run number

rerun: rerun number

camcol: camera column

field: field number

ra: right ascension

dec: declination

class: spectroscopic class (only objetcs with GALAXY are included)

subclass: spectroscopic subclass

modelMag_u: better of DeV/Exp magnitude fit for band u

modelMag_g: better of DeV/Exp magnitude fit for band g

modelMag_r: better of DeV/Exp magnitude fit for band r

modelMag_i: better of DeV/Exp magnitude fit for band i

modelMag_z: better of DeV/Exp magnitude fit for band z

redshift: final redshift from SDSS data z

stellarmass: stellar mass extracted from the eBOSS Firefly catalog

w1mag: WISE W1 "standard" aperture magnitude

w2mag: WISE W2 "standard" aperture magnitude

w3mag: WISE W3 "standard" aperture magnitude

w4mag: WISE W4 "standard" aperture magnitude

gz2c_f: Galaxy Zoo 2 classification from Willett et al 2013

gz2c_s: simplified version of Galaxy Zoo 2 classification (labels set)

Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:

sdss-gs/ ├── data.csv ├── fits ├── img ├── spectra └── ssel

Where, each directory contains:

img: RGB images from the object in JPEG format, 150x150 pixels, generated using the SkyServer DR16 API

fits: FITS data subsets around the object across the u, g, r, i, z bands; cut is done using the ImageCutter library

spectra: full best fit spectra data from SDSS between 4000 and 9000 wavelengths

ssel: best fit spectra data from SDSS for specific selected intervals of wavelengths discussed by Sánchez Almeida 2010

Changelog

v0.0.3 - Increase number of objects to ~80k.

v0.0.2 - Increase number of objects to ~60k.

v0.0.1 - Initial import.
OpenML R Bot Benchmark Data (final subset)
figshare.com
application/gzip
Updated May 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl (2018). OpenML R Bot Benchmark Data (final subset) [Dataset]. http://doi.org/10.6084/m9.figshare.5882230.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5882230.v2
Dataset updated
May 18, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.
Source Code - Characterizing Variability and Uncertainty for Parameter...
catalog.data.gov
s.cnmilf.com
Updated May 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models [Dataset]. https://catalog.data.gov/dataset/source-code-characterizing-variability-and-uncertainty-for-parameter-subset-selection-in-p
Explore at:
Dataset updated
May 1, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Source Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.
Common Crawl Micro Subset English
kaggle.com
zip
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil R (2025). Common Crawl Micro Subset English [Dataset]. https://www.kaggle.com/datasets/nikhilr612/common-crawl-micro-subset-english
Explore at:
zip(5504236429 bytes)Available download formats
Dataset updated
Apr 10, 2025
Authors
Nikhil R
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A subset of Common Crawl, extracted from Colossally Cleaned Common Crawl (C4) dataset with the additional constraint that extracted text safely encodes to ASCII. A Unigram tokenizer of vocabulary 12.228k tokens is provided, along with pre-tokenized data.
I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
d
Data release for winter peak extent analysis subset, 2003-2018, associated...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data release for winter peak extent analysis subset, 2003-2018, associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-winter-peak-extent-analysis-subset-2003-2018-associated-with-the-journal-
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Western United States, United States
Description
This dataset is provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and code provided allow users to replicate, test, or further explore results. The dataset includes 2 raster datasets (folder:Rasters): 1) 'cntWinterPks2003_2018DR' provides a count of years with winter peaks from 2003-2018 in an 11-state area in the western United States. 2) 'VegClassGte5_2003_2018' raster, within the zip file 'WinterPeaksVegTypes.zip' identifies the broad vegetation types for locations with common winter peaks (5 or more years out of 16). The dataset also includes Google Earth Engine and R code files used to create the datasets. Additional files/folders provided include 1) Google Earth Engine scripts used to download MODIS data the GEE - javascript interface (folder: 'Code'). 2) Scripts used to manipulate rasters and to calculate and map the occurrence winter NDVI peaks from 2003-2018 using the statistical software package 'R'. 3) Supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study, for example the folders 'Rproj.user', and 'packrat', and files '.RData', and 'WinterPeakExtentPR.Rproj'. 4) Empty folders ('GEE_DataAnnPeak', 'GEE_DataLoose', and 'GEE_DataStrict') that should be used to contain the output from the GEE code files as follows: 'GEE_DataAnnPeak' should contain output from the S3 and S4 scripts, 'GEE_DataLoose' should contain output from the S1 script, and 'GEE_DataStrict' should contain output from the S2 script. 5) Graphic file 'Fig_9_MapsOfExtentPortrait2.jpg' shows temporal and ecosystem distribution of winter NDVI peaks in the western continental US, 2003 to 2018, derived from MODIS MCD43A4 product. TOP: Number of years with winter peaks in areas that meet defined thresholds for biomass (median annual peak NDVI >= 0.15) and temperature (mean December minimum daily temperature <= 0°C). BOTTOM: Predominant LANDFIRE Existing Vegetation Type physiognomy (i.e., mode of each 500-m MODIS pixel) in areas with >= 5 years of winter peaks. Present in lesser proportions but not identified on the map for legibility reasons are conifer-hardwood, exotics, riparian, and sparsely vegetated physiognomic categories as well as non-natural/non-terrestrial ecosystem categories. State abbreviations are AZ (Arizona), CA (California), CO (Colorado), ID (Idaho), MT (Montana), NV (Nevada), NM (New Mexico), OR (Oregon), WA (Washington), and WY (Wyoming). The final steps of overlaying common winter peak extent data on the Landfire data were done using ArcGIS and the publicly available Landfire dataset (see source datasets section of metadata and process steps). To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation within this metadata along with the workflow described in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Data from: Effects of nutrient enrichment on freshwater macrophyte and...
zenodo.org
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Floris K. Neijnens; Floris K. Neijnens; Hadassa Moreira; Hadassa Moreira; Melinda M.J. De Jonge; Melinda M.J. De Jonge; Bart B.H.P. Linssen; Mark A.J. Huijbregts; Mark A.J. Huijbregts; Gertjan W. Geerling; Gertjan W. Geerling; Aafke M. Schipper; Aafke M. Schipper; Bart B.H.P. Linssen (2023). Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis [Dataset]. http://doi.org/10.5281/zenodo.10372444
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10372444
Dataset updated
Dec 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Floris K. Neijnens; Floris K. Neijnens; Hadassa Moreira; Hadassa Moreira; Melinda M.J. De Jonge; Melinda M.J. De Jonge; Bart B.H.P. Linssen; Mark A.J. Huijbregts; Mark A.J. Huijbregts; Gertjan W. Geerling; Gertjan W. Geerling; Aafke M. Schipper; Aafke M. Schipper; Bart B.H.P. Linssen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The zip-file contains the data and code accompanying the paper 'Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis'. Together, these files should allow for the replication of the results.

The 'raw_data' folder contains the 'MA_database.csv' file, which contains the extracted data from all primary studies that are used in the analysis. Furthermore, this folder contains the file 'MA_database_description.txt', which gives a description of each data column in the database.

The 'derived_data' folder contains the files that are produced by the R-scripts in this study and used for data analysis. The 'MA_database_processed.csv' and 'MA_database_processed.RData' files contain the converted raw database that is suitable for analysis. The 'DB_IA_subsets.RData' file contains the 'Individual Abundance' (IA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria. The 'DB_IA_VCV_matrices.RData' contains for all IA data subsets the variance-covariance (VCV) matrices. The 'DB_AM_subsets.RData' file contains the 'Total Abundance' (TA) and 'Mean Abundance' (MA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria.

The 'output_data' folder contains maps with the output data for each data subset (i.e. for each metric, taxonomic group and set of inclusion criteria). For each data subset, the map contains random effects selection results ('Results1_REsel_

The 'scripts' folder contains all R-scripts that we used for this study. The 'PrepareData.R' script takes the database as input and adjusts the file so that it can be used for data analysis. The 'PrepareDataIA.R' and 'PrepareDataAM.R' scripts make subsets of the data and prepare the data for the meta-regression analysis and mixed-effects regression analysis, respectively. The regression analyses are performed in the 'SelectModelsIA.R' and 'SelectModelsAM.R' scripts to calculate the regression model results for the IA metric and MA/TA metrics, respectively. These scripts require the 'RandomAndFixedEffects.R' script, containing the random and fixed effects parameter combinations, as well as the 'Functions.R' script. The 'CreateMap.R' script creates a global map with the location of all studies included in the analysis (figure 1 in the paper). The 'CreateForestPlots.R' script creates plots showing the IA data distribution for both taxonomic groups (figure 2 in the paper). The 'CreateHeatMaps.R' script creates heat maps for all metrics and taxonomic groups (figure 3 in the paper, figures S11.1 and S11.2 in the appendix). The 'CalculateStatistics.R' script calculates the descriptive statistics that are reported throughout the paper, and creates the figures that describe the dataset characteristics (figures S3.1 to S3.5 in the appendix). The 'CreateFunnelPlots.R' script creates the funnel plots for both taxonomic groups (figures S6.1 and S6.2 in the appendix) and performs Egger's tests. The 'CreateControlGraphs.R' script creates graphs showing the dependency of the nutrient response to control concentrations for all metrics and taxonomic groups (figures S10.1 and S10.2 in the appendix).

The 'figures' folder contains all figures that are included in this study.
f
Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama
smithsonian.figshare.com
zip
Updated Apr 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao (2024). [Dataset:] Data from Tree Censuses and Inventories in Panama [Dataset]. http://doi.org/10.5479/data.stri.2016.0622
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5479/data.stri.2016.0622
Dataset updated
Apr 18, 2024
Dataset provided by
Smithsonian Tropical Research Institute
Authors
Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Panama
Description
Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.Description:marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories in Panama, in R format. The column 'sp' in this table is a code identifying the species in the full census tables (marena.full and marena.stem, components 1-4 and 5-8 above). RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (components 1-4 above). RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (component 5-8 above). PanamaPlot.txt: Locations of all tree plots and inventories in Panama.
Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big...
plos.figshare.com
doc
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexey Miroshnikov; Erin M. Conlon (2023). Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big Data and Analytics [Dataset]. http://doi.org/10.1371/journal.pone.0108425.s001
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0108425.s001
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Alexey Miroshnikov; Erin M. Conlon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Remarks on kernels and bandwidth selection for semiparametric density product estimator method. (DOC)
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
d
HUN AWRA-R calibration nodes v01
data.gov.au
researchdata.edu.au
zip
Updated Nov 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). HUN AWRA-R calibration nodes v01 [Dataset]. https://data.gov.au/data/dataset/activity/f2da394a-3d08-4cf4-8c24-bf7751ea06a1
Explore at:
zip(11340)Available download formats
Dataset updated
Nov 20, 2019
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset is a shapefile which is a subset for the Hunter subregion containing geographical locations and other characteristics (see below) of streamflow gauging stations. T

There are 3 files that have been extracted from the Hydstra database to aid in identifying sites in the Hunter subregion and the type of data collected from each on.

the 3 files are:

Site - lists all sites available in Hydstra from data providers. The data provider is listed in the #Station as _xxx. For example, sites in NSW are _77, QLD are _66.

Some sites do not have locational information and will not be able to be plotted.

Period - the period table lists all the variables that are recorded at each site and the period of record.

Variable - the variable table shows variable codes and names which can be linked to the period table.

Purpose

Locations are used as pour points in order to define reach areas for river system modelling.

Dataset History

Subset of data for the Hunter subregion that was extracted from the Bureau of Meteorology's hydstra system and includes all gauges where data has been received from the lead water agency of each jurisdiction. The gauges shapefile for all bioregions was intersected with the Hunter subregion boundary to identify and extract gauges within the subregion.

Dataset Citation

Bioregional Assessment Programme (2016) HUN AWRA-R calibration nodes v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/f2da394a-3d08-4cf4-8c24-bf7751ea06a1.

Dataset Ancestors

Derived From Gippsland Project boundary

Derived From Bioregional Assessment areas v04

Derived From Natural Resource Management (NRM) Regions 2010

Derived From Bioregional Assessment areas v03

Derived From Victoria - Seamless Geology 2014

Derived From Bioregional Assessment areas v05

Derived From National Surface Water sites Hydstra

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Geological Provinces - Full Extent

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Data Mining Project - Boston
kaggle.com
zip
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
g
Indonesian Family Life Study, merged subset
laurabotzet.github.io
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAND corporation (2016). Indonesian Family Life Study, merged subset [Dataset]. https://laurabotzet.github.io/birth_order_ifls/2_codebook.html
Explore at:
Dataset updated
2016
Authors
RAND corporation
Time period covered
2014 - 2015
Area covered
000 individuals living in 13 of the 27 provinces in the country. See URL for more., 13 Indonesian provinces. The sample is representative of about 83% of the Indonesian population and contains over 30
Variables measured
a1, a2, c1, c3, e1, e3, n2, n3, o1, o2, and 138 more
Description
Data from the IFLS, merged across waves, most outcomes taken from wave 5. Includes birth order, family structure, Big 5 Personality, intelligence tests, and risk lotteries

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

[truncated]

Note

This dataset was automatically described using the codebook R package (version 0.8.2).
D
Data and R Scripts for: "Transcriptome network analysis implicates...
dataverse.nl
application/gzip +12
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas J.W. Kuiper; Jonas J.W. Kuiper (2023). Data and R Scripts for: "Transcriptome network analysis implicates CX3CR1-positive type 3 dendritic cells in non-infectious uveitis" [Dataset]. http://doi.org/10.34894/9Q0FVO
Explore at:
pdf(1605113), application/x-rlang-transport(85779928), application/x-rlang-transport(111824201), xls(30720), docx(107821), application/gzip(43562825), pdf(490016), application/x-rlang-transport(6100640566), pdf(694017), docx(41249), pdf(712703), xlsx(9842), application/gzip(321248), html(3302914), bin(5032), xls(29184), pdf(740294), application/x-rlang-transport(2187635), application/x-rlang-transport(2027), bin(4763), bin(29703), application/x-rlang-transport(1150434470), application/x-rlang-transport(3482780973), bin(7642), txt(12073123), application/x-rlang-transport(262424), xls(29696), application/x-rlang-transport(218660), ods(15603), application/gzip(179465), html(2027217), html(1219322), csv(31893), application/x-rlang-transport(258525), xls(22528), application/x-rlang-transport(69572892), application/gzip(181888), csv(8440), pdf(1737584), pdf(3203857), application/gzip(180931), application/x-rlang-transport(94396276), pdf(193950), xls(21504), xlsx(16799), zip(137070301), pdf(833176), application/x-rlang-transport(2720466), pdf(126798), html(2954689), docx(104730), xlsx(362643), application/gzip(176928), pdf(4032468), application/x-rlang-transport(980068500), application/x-rlang-transport(40698194), bin(25042), docx(340510), pdf(552839), xlsx(9502), application/gzip(72567), xlsx(12324), application/gzip(121944), pdf(43656921), application/x-rlang-transport(8269), xls(19968), application/gzip(4259683), xlsx(22523395), bin(34980), html(1180211), pdf(529564), xlsx(23236), html(1055275), application/x-rlang-transport(544138), type/x-r-syntax(6900), txt(21477), application/x-rlang-transport(31900), application/gzip(304728), pdf(2219432), application/x-rlang-transport(41524124), txt(9953), pdf(1491935)Available download formats
Unique identifier
https://doi.org/10.34894/9Q0FVO
Dataset updated
Jan 10, 2023
Dataset provided by
DataverseNL
Authors
Jonas J.W. Kuiper; Jonas J.W. Kuiper
License
https://dataverse.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/9Q0FVOhttps://dataverse.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/9Q0FVO
Dataset funded by
Uitzicht
Description
This dataset contains the processed RNA sequencing data of purified CD1c-positive conventional type 2 dendritic cells (CD1c+ cDC2s), functional enrichment analysis, manual and automatic gating data of (i.e., flowSOM) flow cytometry, and multiplex cytokine analyses as outlined in Hiddingh et al. 2022 "Transcriptome network analysis implicates CX3CR1-positive type 3 dendritic cells in non-infectious uveitis see preprint on BioRxiv Data are from two cohorts (cohort I, n=36, and cohort II, n=42) of in total 51 patients with non-infectious uveitis (HLA-B27-positive acute anterior uveitis, idiopathic intermediate uveitis, HLA-A29-positive Birdshot Uveitis (Birdshot chorioretinopathy), and 27 sex/age-matched healthy controls without ocular inflammatory disease). All raw sequencing data are available at NCBI SRA under the accession number: GSE195501 (FACS-sorted cohort I). GSE194060 (MACS-sorted cohort II). This dataverseNL dataset contains additional raw, processed, and metadata (see readme file and reproducible R notebooks (R script and Image) used for the analysis in the manuscript: R scripts (markdown + R image) with step-by-step analyses Figure_1.rmd (see "Figure_1.html") Figure_2.rmd (see "Figure_2.html") Figure_3.rmd (see "Figure_3.html") Figure_4.rmd (see "Figure_4.html") Figure_5.rmd(see "Figure_5.html") Processed RNA seq data (including WGCNA) (see folder Uveitis_mDC in files) Experimental data Manual gating data of MACS-sorted fractions cohort I (see here) Manual gating data for CD1c+ cDC2 subsets in PBMCs (see here) Manual gating CD14+ and CD14- CD1c+ cDC2 fractions from Buffy (see here) qPCR data for CX3CR1,CCR5,CCR2,IRF8,TLR7,RUNX3 and CD36 in sorted CD14+ and CD14- CD1c+ DCs (see here) qPCR data (fold change compared to medium) for RUNX3 and CD36 in overnight stimulated cDC2 cultures (see here) Cell phenotypes identified by flowSOM (7x7 grid) using the cDC2-subset flow cytometry panel (see here) IL-23 ELISA concentration in supernatant of overnight LTA-stimulated cDC2 subset cultures (see here) Luminex Multiplex Cytokine analysis of supernatant of overnight LTA-stimulated cDC2 subset cultures (see here) Other transcriptomic data used in the R scripts (above) WT Untreated cDC2 versus cDC2 from Runx3-11cKO miceGSE48590 generated by Dicken et al., PLoS One 2013 WT Untreated cDC2 versus cDC2 from Notch2-11cKO miceGSE119242 generated by Briseño et al., Proc Natl Acad Sci U S A 2018 Sorted CD14+CD5-CD163+ and CD14-CD5-CD163+ cDC2s from SLE and Scleroderma patients GSE136731 generated by Dutertre et al., Immunity 2019 Single-cell RNA-seq of aqueous humor from 4 HLA-B27-positive uveitis patients and control GSE178833 generated by Kasper et al., Elife 2021 Inflammatory [inf-]cDC2sGSE149619 generated by Bosteels et al., Immunity 2020 RNA-seq data from cDC2s generated from murine bone marrow cells in co-culture with stromal OP-9 cell line transduced with or without expression of the Notch ligand Delta-like 1GSE110577 generated by Kirkling et al., Cell Rep 2018
d
Bottom-trawl and gill-net data from the Upper Great Lakes, collected by R/V...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Bottom-trawl and gill-net data from the Upper Great Lakes, collected by R/V Cisco, 1952–1962 [Dataset]. https://catalog.data.gov/dataset/bottom-trawl-and-gill-net-data-from-the-upper-great-lakes-collected-by-r-v-cisco-19521962
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
U.S. Geological Survey
Description
The data release includes part of the bottom-trawl and gill-net survey data collected between 1952 and 1962 from the research vessel R/V Cisco. The bottom-trawl dataset includes tables for fishing operations and effort (BT_OP.csv), fish catch (BT_Catch.csv), and individual length-weight-sex-maturity (LWSM) records (BT_Fish.csv) for only a subset of species (details below). The gill-net dataset includes tables for fishing operations (GN_OP.csv), fishing effort (GN_Effort.csv), fish catch (GN_Catch.csv), and individual LWSM records (GN_Fish.csv) for only a subset of species (details below). Two reference tables, BT_Spec.csv and Species.csv, are used for bottom trawl specifications and fish species names, respectively.

This data release is part of the project "Historical habitat use by Coregonus artedi in the Upper Great Lakes and critical embayments" funded by Environmental Protection Agency's Great Lakes Restoration Initiative. Due to the scope of this project, the catch tables (GN_Catch.csv and BT_Catch.csv) only include data for alewife (Alosa pseudoharengus), rainbow smelt (Osmerus mordax), and cisco (Coregonus artedi). LWSM data tables (GN_Fish.csv and BT_Fish.csv) only include data for cisco. Additional data collected from the surveys are archived at the Great Lakes Science Center.
d
Data release for phenocam analysis subset associated with the journal...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data release for phenocam analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-phenocam-analysis-subset-associated-with-the-journal-article-solar-and-se
Explore at:
Dataset updated
Oct 1, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This dataset provides calculated camera-NDVI data for individual regions-of-interest (ROI's) for the phenocam named 'GRCA1PJ' (part of the Phenocam Network, https://phenocam.sr.unh.edu/webcam/). The GRCA1PJ phenocam is within a pinyon-juniper woodland in Grand Canyon National Park. Camera-NDVI refers to a modified version of NDVI calculated by the phenopix package (Filippa et al., 2016). The camera-calculated NDVI data are in the folder FinalOutput. File attributes within that folder are described in detail in the entity and attribute information section of this metadata. It should be possible for the user to use only the ROI definitions, image data downloaded from the phenocam network, and the phenopix R-package to reproduce the final NDVI dataset. However, the dataset also contains scripts and intermediate files that may be helpful in reproducing or extending the processing, but are not essential to reproducing the data. The complete dataset release includes 1) A workflow spreadsheet file that describes the processing steps, associated scripts, and output filenames (filename:Workflow_With_Filenames.ods). 2) R-code script files used in processing (folder:'Code'). 3) ROI boundary files and jpg images for the ROIs presented in the linked publication. (folder:"Phenocamdata/grca1pj/ROI") 4) Ancillary files used to create the NDVI dataset; these include exposure coordinates and training files (folder:'Phenocamdata/grca1pj/Ancillary'). 5) Files listing exposures for individual photos within the initial processing time period (folder:'Exposures'). 6) Screening parameters for cloud and poor-light-condition screening of photos, as well as a list of photos that meet the cloud-screening standards (folder:'Phenocamdata/grca1pj/BlueSkyScreening'). 7) Vegetation index files produced by the phenopix package, organized by ROI and month-year group (folder:"Phenocamdata/grca1pj/VI_Tables"). 8) Supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). 9) The graphic 'Fig_4_ROIWithLabels.jpg' shows the phenocam field of view with labelled ROIs. outline colors correspond to juniper (red), pinyon (blue), 238 and other species (yellow). Labels correspond to NDVI curves in 'Fig_7_PhenocamCurves.JPG', (also included in this data release). The composite area comprises the field of view beneath the approximate horizon line labelled ‘J’ (gray). This image corresponds to Figure 4 in the associated journal article. 10) The graphic 'Fig_7_PhenocamCurves.JPG' shows NDVI curves derived from phenocam images from September 2017 - December 2018 for individual regions of interest (ROIs). Letter designations correspond to ROI labels in Fig_4_ROIWithLabels.jpg (also included in this data release). Data were screened to remove cloudy photos during Aqua and Terra flyover hours. Black ellipses indicate times when the ROI target vegetation was shaded. Red ellipses indicate times when the background of the ROI was shaded. To improve visibility, the Y axis is restricted and excludes 37 extreme values out of a total of 6698 values. The exposure adjustment method used by the phenopix package produces NDVI values that have a strong linear correlation with spectroradiometer-derived NDVI but are negatively shifted so that vegetated areas often have NDVI values below zero. This image corresponds to Figure 7 in the associated journal article. The file types .Rdata or .rds are commonly used in this release because these are the types created by the phenopix processing package, and these files will be needed (or the user will need to recreate new versions) for further processing. The scripts enable the user to replicate processing or to extend it to different times or areas of interest; however, these scripts require as additional input phenocam imagery that the user must download. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs. Species-specific phenological curves included in the NDVI output section this dataset: Juniperus osteosperma, Pinus edulis, Purshia stansburiana, Artemisia tridentata, and Chamaebatiaria millefolium
E
CELEX Dutch lexical database - Frequency Subset
catalogue.elra.info
live.european-language-grid.eu
Updated Oct 5, 2005
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). CELEX Dutch lexical database - Frequency Subset [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-L0029_07/
Explore at:
Dataset updated
Oct 5, 2005
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;· syntax: word class, subcategorisations per word class;· frequency of the entries: disambiguated for homographic lemmata.
WABI Subset: Police
researchdata.edu.au
Updated Jul 29, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State Library of Western Australia (2016). WABI Subset: Police [Dataset]. https://researchdata.edu.au/wabi-subset-police/2994547
Explore at:
Dataset updated
Jul 29, 2016
Dataset provided by
Data.govhttps://data.gov/
Authors
State Library of Western Australia
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
This index was compiled by Miss Mollie Bentley from various records she has used relating to the police. These include: Almanac listings, Colonial Secretary's Office Records, Police Gazettes, various police department occurrence books and letter books, police journals, government gazettes, estimates, York police records etc.\r \r Entry is by name of policeman. Information given varies but is usually about appointments, promotions, retirements, transfers etc.\r \r The Western Australian Biographical Index (WABI) is a highly used resource at the State Library of Western Australia. A recent generous contribution by the Friends of Battye Library (FOBS) has enabled SLWA to have the original handwritten index cards scanned and later transcribed.\r \r The dataset contains: several csv files with data describing card number, card text and url link to image of the original handwritten card.\r \r The transcription was crowd-sourced and we are aware that there are some data quality issues including:\r \r * Some cards are missing\r * Transcripts are crowdsourced so may contain spelling errors and possibly missing information\r * Some cards are crossed out. Some of these are included in the collection and some are not\r * Some of the cards contain relevant information on the back (usually children of the person mentioned). This info should be on the next consecutive card\r * As the information is an index, collected in the 1970s from print material, it is incomplete. It is also unreferenced.\r It is still a very valuable dataset as it contains a wealth of information about early settlers in Western Australia. It is of particular interest to genealogists and historians.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

Explore at:

38 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0034338

Dataset updated

Jun 8, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Anne E. Goodenough; Adam G. Hart; Richard Stafford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

Clear search

Close search

Google apps

Main menu

Data from: Regression with Empirical Variable Selection: Description of a...

Data release for solar-sensor angle analysis subset associated with the...

SDSS Galaxy Subset

OpenML R Bot Benchmark Data (final subset)

Source Code - Characterizing Variability and Uncertainty for Parameter...

Common Crawl Micro Subset English

Self-citation analysis data based on PubMed Central subset (2002-2005)

Data release for winter peak extent analysis subset, 2003-2018, associated...

Data from: Effects of nutrient enrichment on freshwater macrophyte and...

Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big...

Film Circulation dataset

HUN AWRA-R calibration nodes v01

Abstract

Purpose

Dataset History

Dataset Citation

Dataset Ancestors

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Indonesian Family Life Study, merged subset

Table of variables

Note

Data and R Scripts for: "Transcriptome network analysis implicates...

Bottom-trawl and gill-net data from the Upper Great Lakes, collected by R/V...

Data release for phenocam analysis subset associated with the journal...

CELEX Dutch lexical database - Frequency Subset

WABI Subset: Police

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets