Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
Facebook
TwitterThis dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 60247 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.
The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):
Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:
sdss-gs/
├── data.csv
├── fits
├── img
├── spectra
└── ssel
Where, each directory contains:
Changelog
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.
Facebook
TwitterSource Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A subset of Common Crawl, extracted from Colossally Cleaned Common Crawl (C4) dataset with the additional constraint that extracted text safely encodes to ASCII. A Unigram tokenizer of vocabulary 12.228k tokens is provided, along with pre-tokenized data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Facebook
TwitterThis dataset is provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and code provided allow users to replicate, test, or further explore results. The dataset includes 2 raster datasets (folder:Rasters): 1) 'cntWinterPks2003_2018DR' provides a count of years with winter peaks from 2003-2018 in an 11-state area in the western United States. 2) 'VegClassGte5_2003_2018' raster, within the zip file 'WinterPeaksVegTypes.zip' identifies the broad vegetation types for locations with common winter peaks (5 or more years out of 16). The dataset also includes Google Earth Engine and R code files used to create the datasets. Additional files/folders provided include 1) Google Earth Engine scripts used to download MODIS data the GEE - javascript interface (folder: 'Code'). 2) Scripts used to manipulate rasters and to calculate and map the occurrence winter NDVI peaks from 2003-2018 using the statistical software package 'R'. 3) Supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study, for example the folders 'Rproj.user', and 'packrat', and files '.RData', and 'WinterPeakExtentPR.Rproj'. 4) Empty folders ('GEE_DataAnnPeak', 'GEE_DataLoose', and 'GEE_DataStrict') that should be used to contain the output from the GEE code files as follows: 'GEE_DataAnnPeak' should contain output from the S3 and S4 scripts, 'GEE_DataLoose' should contain output from the S1 script, and 'GEE_DataStrict' should contain output from the S2 script. 5) Graphic file 'Fig_9_MapsOfExtentPortrait2.jpg' shows temporal and ecosystem distribution of winter NDVI peaks in the western continental US, 2003 to 2018, derived from MODIS MCD43A4 product. TOP: Number of years with winter peaks in areas that meet defined thresholds for biomass (median annual peak NDVI >= 0.15) and temperature (mean December minimum daily temperature <= 0°C). BOTTOM: Predominant LANDFIRE Existing Vegetation Type physiognomy (i.e., mode of each 500-m MODIS pixel) in areas with >= 5 years of winter peaks. Present in lesser proportions but not identified on the map for legibility reasons are conifer-hardwood, exotics, riparian, and sparsely vegetated physiognomic categories as well as non-natural/non-terrestrial ecosystem categories. State abbreviations are AZ (Arizona), CA (California), CO (Colorado), ID (Idaho), MT (Montana), NV (Nevada), NM (New Mexico), OR (Oregon), WA (Washington), and WY (Wyoming). The final steps of overlaying common winter peak extent data on the Landfire data were done using ArcGIS and the publicly available Landfire dataset (see source datasets section of metadata and process steps). To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation within this metadata along with the workflow described in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zip-file contains the data and code accompanying the paper 'Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis'. Together, these files should allow for the replication of the results.
The 'raw_data' folder contains the 'MA_database.csv' file, which contains the extracted data from all primary studies that are used in the analysis. Furthermore, this folder contains the file 'MA_database_description.txt', which gives a description of each data column in the database.
The 'derived_data' folder contains the files that are produced by the R-scripts in this study and used for data analysis. The 'MA_database_processed.csv' and 'MA_database_processed.RData' files contain the converted raw database that is suitable for analysis. The 'DB_IA_subsets.RData' file contains the 'Individual Abundance' (IA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria. The 'DB_IA_VCV_matrices.RData' contains for all IA data subsets the variance-covariance (VCV) matrices. The 'DB_AM_subsets.RData' file contains the 'Total Abundance' (TA) and 'Mean Abundance' (MA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria.
The 'output_data' folder contains maps with the output data for each data subset (i.e. for each metric, taxonomic group and set of inclusion criteria). For each data subset, the map contains random effects selection results ('Results1_REsel_
The 'scripts' folder contains all R-scripts that we used for this study. The 'PrepareData.R' script takes the database as input and adjusts the file so that it can be used for data analysis. The 'PrepareDataIA.R' and 'PrepareDataAM.R' scripts make subsets of the data and prepare the data for the meta-regression analysis and mixed-effects regression analysis, respectively. The regression analyses are performed in the 'SelectModelsIA.R' and 'SelectModelsAM.R' scripts to calculate the regression model results for the IA metric and MA/TA metrics, respectively. These scripts require the 'RandomAndFixedEffects.R' script, containing the random and fixed effects parameter combinations, as well as the 'Functions.R' script. The 'CreateMap.R' script creates a global map with the location of all studies included in the analysis (figure 1 in the paper). The 'CreateForestPlots.R' script creates plots showing the IA data distribution for both taxonomic groups (figure 2 in the paper). The 'CreateHeatMaps.R' script creates heat maps for all metrics and taxonomic groups (figure 3 in the paper, figures S11.1 and S11.2 in the appendix). The 'CalculateStatistics.R' script calculates the descriptive statistics that are reported throughout the paper, and creates the figures that describe the dataset characteristics (figures S3.1 to S3.5 in the appendix). The 'CreateFunnelPlots.R' script creates the funnel plots for both taxonomic groups (figures S6.1 and S6.2 in the appendix) and performs Egger's tests. The 'CreateControlGraphs.R' script creates graphs showing the dependency of the nutrient response to control concentrations for all metrics and taxonomic groups (figures S10.1 and S10.2 in the appendix).
The 'figures' folder contains all figures that are included in this study.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.Description:marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories in Panama, in R format. The column 'sp' in this table is a code identifying the species in the full census tables (marena.full and marena.stem, components 1-4 and 5-8 above). RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (components 1-4 above). RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (component 5-8 above). PanamaPlot.txt: Locations of all tree plots and inventories in Panama.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Remarks on kernels and bandwidth selection for semiparametric density product estimator method. (DOC)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset is a shapefile which is a subset for the Hunter subregion containing geographical locations and other characteristics (see below) of streamflow gauging stations. T
There are 3 files that have been extracted from the Hydstra database to aid in identifying sites in the Hunter subregion and the type of data collected from each on.
the 3 files are:
Site - lists all sites available in Hydstra from data providers. The data provider is listed in the #Station as _xxx. For example, sites in NSW are _77, QLD are _66.
Some sites do not have locational information and will not be able to be plotted.
Period - the period table lists all the variables that are recorded at each site and the period of record.
Variable - the variable table shows variable codes and names which can be linked to the period table.
Locations are used as pour points in order to define reach areas for river system modelling.
Subset of data for the Hunter subregion that was extracted from the Bureau of Meteorology's hydstra system and includes all gauges where data has been received from the lead water agency of each jurisdiction. The gauges shapefile for all bioregions was intersected with the Hunter subregion boundary to identify and extract gauges within the subregion.
Bioregional Assessment Programme (2016) HUN AWRA-R calibration nodes v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/f2da394a-3d08-4cf4-8c24-bf7751ea06a1.
Derived From Gippsland Project boundary
Derived From Bioregional Assessment areas v04
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Bioregional Assessment areas v03
Derived From Victoria - Seamless Geology 2014
Derived From Bioregional Assessment areas v05
Derived From National Surface Water sites Hydstra
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterData from the IFLS, merged across waves, most outcomes taken from wave 5. Includes birth order, family structure, Big 5 Personality, intelligence tests, and risk lotteries
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
[truncated]
This dataset was automatically described using the codebook R package (version 0.8.2).
Facebook
Twitterhttps://dataverse.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/9Q0FVOhttps://dataverse.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/9Q0FVO
This dataset contains the processed RNA sequencing data of purified CD1c-positive conventional type 2 dendritic cells (CD1c+ cDC2s), functional enrichment analysis, manual and automatic gating data of (i.e., flowSOM) flow cytometry, and multiplex cytokine analyses as outlined in Hiddingh et al. 2022 "Transcriptome network analysis implicates CX3CR1-positive type 3 dendritic cells in non-infectious uveitis see preprint on BioRxiv Data are from two cohorts (cohort I, n=36, and cohort II, n=42) of in total 51 patients with non-infectious uveitis (HLA-B27-positive acute anterior uveitis, idiopathic intermediate uveitis, HLA-A29-positive Birdshot Uveitis (Birdshot chorioretinopathy), and 27 sex/age-matched healthy controls without ocular inflammatory disease). All raw sequencing data are available at NCBI SRA under the accession number: GSE195501 (FACS-sorted cohort I). GSE194060 (MACS-sorted cohort II). This dataverseNL dataset contains additional raw, processed, and metadata (see readme file and reproducible R notebooks (R script and Image) used for the analysis in the manuscript: R scripts (markdown + R image) with step-by-step analyses Figure_1.rmd (see "Figure_1.html") Figure_2.rmd (see "Figure_2.html") Figure_3.rmd (see "Figure_3.html") Figure_4.rmd (see "Figure_4.html") Figure_5.rmd(see "Figure_5.html") Processed RNA seq data (including WGCNA) (see folder Uveitis_mDC in files) Experimental data Manual gating data of MACS-sorted fractions cohort I (see here) Manual gating data for CD1c+ cDC2 subsets in PBMCs (see here) Manual gating CD14+ and CD14- CD1c+ cDC2 fractions from Buffy (see here) qPCR data for CX3CR1,CCR5,CCR2,IRF8,TLR7,RUNX3 and CD36 in sorted CD14+ and CD14- CD1c+ DCs (see here) qPCR data (fold change compared to medium) for RUNX3 and CD36 in overnight stimulated cDC2 cultures (see here) Cell phenotypes identified by flowSOM (7x7 grid) using the cDC2-subset flow cytometry panel (see here) IL-23 ELISA concentration in supernatant of overnight LTA-stimulated cDC2 subset cultures (see here) Luminex Multiplex Cytokine analysis of supernatant of overnight LTA-stimulated cDC2 subset cultures (see here) Other transcriptomic data used in the R scripts (above) WT Untreated cDC2 versus cDC2 from Runx3-11cKO miceGSE48590 generated by Dicken et al., PLoS One 2013 WT Untreated cDC2 versus cDC2 from Notch2-11cKO miceGSE119242 generated by Briseño et al., Proc Natl Acad Sci U S A 2018 Sorted CD14+CD5-CD163+ and CD14-CD5-CD163+ cDC2s from SLE and Scleroderma patients GSE136731 generated by Dutertre et al., Immunity 2019 Single-cell RNA-seq of aqueous humor from 4 HLA-B27-positive uveitis patients and control GSE178833 generated by Kasper et al., Elife 2021 Inflammatory [inf-]cDC2sGSE149619 generated by Bosteels et al., Immunity 2020 RNA-seq data from cDC2s generated from murine bone marrow cells in co-culture with stromal OP-9 cell line transduced with or without expression of the Notch ligand Delta-like 1GSE110577 generated by Kirkling et al., Cell Rep 2018
Facebook
TwitterThe data release includes part of the bottom-trawl and gill-net survey data collected between 1952 and 1962 from the research vessel R/V Cisco. The bottom-trawl dataset includes tables for fishing operations and effort (BT_OP.csv), fish catch (BT_Catch.csv), and individual length-weight-sex-maturity (LWSM) records (BT_Fish.csv) for only a subset of species (details below). The gill-net dataset includes tables for fishing operations (GN_OP.csv), fishing effort (GN_Effort.csv), fish catch (GN_Catch.csv), and individual LWSM records (GN_Fish.csv) for only a subset of species (details below). Two reference tables, BT_Spec.csv and Species.csv, are used for bottom trawl specifications and fish species names, respectively.
This data release is part of the project "Historical habitat use by Coregonus artedi in the Upper Great Lakes and critical embayments" funded by Environmental Protection Agency's Great Lakes Restoration Initiative. Due to the scope of this project, the catch tables (GN_Catch.csv and BT_Catch.csv) only include data for alewife (Alosa pseudoharengus), rainbow smelt (Osmerus mordax), and cisco (Coregonus artedi). LWSM data tables (GN_Fish.csv and BT_Fish.csv) only include data for cisco. Additional data collected from the surveys are archived at the Great Lakes Science Center.
Facebook
TwitterThis dataset provides calculated camera-NDVI data for individual regions-of-interest (ROI's) for the phenocam named 'GRCA1PJ' (part of the Phenocam Network, https://phenocam.sr.unh.edu/webcam/). The GRCA1PJ phenocam is within a pinyon-juniper woodland in Grand Canyon National Park. Camera-NDVI refers to a modified version of NDVI calculated by the phenopix package (Filippa et al., 2016). The camera-calculated NDVI data are in the folder FinalOutput. File attributes within that folder are described in detail in the entity and attribute information section of this metadata. It should be possible for the user to use only the ROI definitions, image data downloaded from the phenocam network, and the phenopix R-package to reproduce the final NDVI dataset. However, the dataset also contains scripts and intermediate files that may be helpful in reproducing or extending the processing, but are not essential to reproducing the data. The complete dataset release includes 1) A workflow spreadsheet file that describes the processing steps, associated scripts, and output filenames (filename:Workflow_With_Filenames.ods). 2) R-code script files used in processing (folder:'Code'). 3) ROI boundary files and jpg images for the ROIs presented in the linked publication. (folder:"Phenocamdata/grca1pj/ROI") 4) Ancillary files used to create the NDVI dataset; these include exposure coordinates and training files (folder:'Phenocamdata/grca1pj/Ancillary'). 5) Files listing exposures for individual photos within the initial processing time period (folder:'Exposures'). 6) Screening parameters for cloud and poor-light-condition screening of photos, as well as a list of photos that meet the cloud-screening standards (folder:'Phenocamdata/grca1pj/BlueSkyScreening'). 7) Vegetation index files produced by the phenopix package, organized by ROI and month-year group (folder:"Phenocamdata/grca1pj/VI_Tables"). 8) Supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). 9) The graphic 'Fig_4_ROIWithLabels.jpg' shows the phenocam field of view with labelled ROIs. outline colors correspond to juniper (red), pinyon (blue), 238 and other species (yellow). Labels correspond to NDVI curves in 'Fig_7_PhenocamCurves.JPG', (also included in this data release). The composite area comprises the field of view beneath the approximate horizon line labelled ‘J’ (gray). This image corresponds to Figure 4 in the associated journal article. 10) The graphic 'Fig_7_PhenocamCurves.JPG' shows NDVI curves derived from phenocam images from September 2017 - December 2018 for individual regions of interest (ROIs). Letter designations correspond to ROI labels in Fig_4_ROIWithLabels.jpg (also included in this data release). Data were screened to remove cloudy photos during Aqua and Terra flyover hours. Black ellipses indicate times when the ROI target vegetation was shaded. Red ellipses indicate times when the background of the ROI was shaded. To improve visibility, the Y axis is restricted and excludes 37 extreme values out of a total of 6698 values. The exposure adjustment method used by the phenopix package produces NDVI values that have a strong linear correlation with spectroradiometer-derived NDVI but are negatively shifted so that vegetated areas often have NDVI values below zero. This image corresponds to Figure 7 in the associated journal article. The file types .Rdata or .rds are commonly used in this release because these are the types created by the phenopix processing package, and these files will be needed (or the user will need to recreate new versions) for further processing. The scripts enable the user to replicate processing or to extend it to different times or areas of interest; however, these scripts require as additional input phenocam imagery that the user must download. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs. Species-specific phenological curves included in the NDVI output section this dataset: Juniperus osteosperma, Pinus edulis, Purshia stansburiana, Artemisia tridentata, and Chamaebatiaria millefolium
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;· syntax: word class, subcategorisations per word class;· frequency of the entries: disambiguated for homographic lemmata.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This index was compiled by Miss Mollie Bentley from various records she has used relating to the police. These include: Almanac listings, Colonial Secretary's Office Records, Police Gazettes, various police department occurrence books and letter books, police journals, government gazettes, estimates, York police records etc.\r \r Entry is by name of policeman. Information given varies but is usually about appointments, promotions, retirements, transfers etc.\r \r The Western Australian Biographical Index (WABI) is a highly used resource at the State Library of Western Australia. A recent generous contribution by the Friends of Battye Library (FOBS) has enabled SLWA to have the original handwritten index cards scanned and later transcribed.\r \r The dataset contains: several csv files with data describing card number, card text and url link to image of the original handwritten card.\r \r The transcription was crowd-sourced and we are aware that there are some data quality issues including:\r \r * Some cards are missing\r * Transcripts are crowdsourced so may contain spelling errors and possibly missing information\r * Some cards are crossed out. Some of these are included in the collection and some are not\r * Some of the cards contain relevant information on the back (usually children of the person mentioned). This info should be on the next consecutive card\r * As the information is an index, collected in the 1970s from print material, it is incomplete. It is also unreferenced.\r It is still a very valuable dataset as it contains a wealth of information about early settlers in Western Australia. It is of particular interest to genealogists and historians.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.