100+ datasets found

Data from: Regression with Empirical Variable Selection: Description of a...
plos.figshare.com
txt
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0034338
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.
n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Field Museum of Natural History
Beijing Normal University
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Reddit /r/datasets Dataset
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
Explore at:
zip(9619636 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

By SocialGrep [source]

About this dataset

A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

Research Ideas

Finding correlations between different types of datasets

Determining which datasets are most popular on Reddit

Analyzing the sentiments of post and comments on Reddit's /r/datasets board

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
u
Optimization and Evaluation Datasets for PiMine
fdr.uni-hamburg.de
zip
Updated Sep 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias (2023). Optimization and Evaluation Datasets for PiMine [Dataset]. http://doi.org/10.25592/uhhfdm.13228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25592/uhhfdm.13228
Dataset updated
Sep 11, 2023
Dataset provided by
ZBH Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany
Authors
Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]

The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.

Data Set description:

The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.

The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.

The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.

Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.

References:

[1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
[2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
[3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
[4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.
H
A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...
dataverse.harvard.edu
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RNSWRH
Dataset updated
Jan 19, 2021
Dataset provided by
Harvard Dataverse
Authors
Lianfa, Li; Jiajie, Wu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 2015 - Dec 31, 2018
Area covered
China
Description
We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.
Clustering of samples and variables with mixed-type data
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0188274
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
Z
Benchmark Multi-Omics Datasets for Methods Comparison
data.niaid.nih.gov
zenodo.org
Updated Nov 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Odom, Gabriel; Wang, Lily (2021). Benchmark Multi-Omics Datasets for Methods Comparison [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5683001
Explore at:
Dataset updated
Nov 14, 2021
Dataset provided by
Florida International University
University of Miami
Authors
Odom, Gabriel; Wang, Lily
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pathway Multi-Omics Simulated Data

These are synthetic variations of the TCGA COADREAD data set (original data available at http://linkedomics.org/data_download/TCGA-COADREAD/). This data set is used as a comprehensive benchmark data set to compare multi-omics tools in the manuscript "pathwayMultiomics: An R package for efficient integrative analysis of multi-omics datasets with matched or un-matched samples".

There are 100 sets (stored as 100 sub-folders, the first 50 in "pt1" and the second 50 in "pt2") of random modifications to centred and scaled copy number, gene expression, and proteomics data saved as compressed data files for the R programming language. These data sets are stored in subfolders labelled "sim001", "sim002", ..., "sim100". Each folder contains the following contents: 1) "indicatorMatricesXXX_ls.RDS" is a list of simple triplet matrices showing which genes (in which pathways) and which samples received the synthetic treatment (where XXX is the simulation run label: 001, 002, ...), (2) "CNV_partitionA_deltaB.RDS" is the synthetically modified copy number variation data (where A represents the proportion of genes in each gene set to receive the synthetic treatment [partition 1 is 20%, 2 is 40%, 3 is 60% and 4 is 80%] and B is the signal strength in units of standard deviations), (3) "RNAseq_partitionA_deltaB.RDS" is the synthetically modified gene expression data (same parameter legend as CNV), and (4) "Prot_partitionA_deltaB.RDS" is the synthetically modified protein expression data (same parameter legend as CNV).

Supplemental Files

The file "cluster_pathway_collection_20201117.gmt" is the collection of gene sets used for the simulation study in Gene Matrix Transpose format. Scripts to create and analyze these data sets available at: https://github.com/TransBioInfoLab/pathwayMultiomics_manuscript_supplement
m
Data for: A systematic review showed no performance benefit of machine...
data.mendeley.com
search.datacite.org
Updated Mar 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Van Calster (2019). Data for: A systematic review showed no performance benefit of machine learning over logistic regression for clinical prediction models [Dataset]. http://doi.org/10.17632/sypyt6c2mc.1
Explore at:
Unique identifier
https://doi.org/10.17632/sypyt6c2mc.1
Dataset updated
Mar 14, 2019
Authors
Ben Van Calster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The uploaded files are:

1) Excel file containing 6 sheets in respective Order: "Data Extraction" (summarized final data extractions from the three reviewers involved), "Comparison Data" (data related to the comparisons investigated), "Paper level data" (summaries at paper level), "Outcome Event Data" (information with respect to number of events for every outcome investigated within a paper), "Tuning Classification" (data related to the manner of hyperparameter tuning of Machine Learning Algorithms).

2) R script used for the Analysis (In order to read the data, please: Save "Comparison Data", "Paper level data", "Outcome Event Data" Excel sheets as txt files. In the R script srpap: Refers to the "Paper level data" sheet, srevents: Refers to the "Outcome Event Data" sheet and srcompx: Refers to " Comparison data Sheet".

3) Supplementary Material: Including Search String, Tables of data, Figures

4) PRISMA checklist items
Z
Simulation Data & R scripts for: "Introducing recurrent events analyses to...
data.niaid.nih.gov
doi.org
+1more
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferry, Nicolas (2024). Simulation Data & R scripts for: "Introducing recurrent events analyses to assess species interactions based on camera trap data: a comparison with time-to-first-event approaches" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11085005
Explore at:
Dataset updated
Apr 29, 2024
Dataset provided by
Department of National Park Monitoring and Animal Management, Bavarian Forest National Park
Authors
Ferry, Nicolas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files descriptions:

All csv files refer to results from the different models (PAMM, AARs, Linear models, MRPPs) on each iteration of the simulation. One row being one iteration. "results_perfect_detection.csv" refers to the results from the first simulation part with all the observations."results_imperfect_detection.csv" refers to the results from the first simulation part with randomly thinned observations to mimick imperfect detection.

ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).PAMM30: p-value of the PAMM running on the 30-days survey.PAMM7: p-value of the PAMM running on the 7-days survey.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

"results_int_dir_perf_det.csv" refers to the results from the second simulation part, with all the observations."results_int_dir_imperf_det.csv" refers to the results from the second simulation part, with randomly thinned observations to mimick imperfect detection.ID_run: identified of the iteration (N: number of sites, D_AB: duration of the effect of A on B, D_BA: duration of the effect of B on A, AB: effect of A on B, BA: effect of B on A, Se: seed number of the iteration).p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of A on B.p_pamm7_AB: p-value of the PAMM running on the 7-days survey testing for the effect of B on A.AAR1: ratio value for the Avoidance-Attraction-Ratio calculating AB/BA.AAR2_BAB: ratio value for the Avoidance-Attraction-Ratio calculating BAB/BB.AAR2_ABA: ratio value for the Avoidance-Attraction-Ratio calculating ABA/AA.Harmsen_P: p-value from the linear model with interaction Species1*Species2 from Harmsen et al. (2009).Niedballa_P: p-value from the linear model comparing AB to BA (Niedballa et al. 2021).Karanth_permA: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species A (Karanth et al. 2017).MurphyAB_permA: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). MurphyBA_permA: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species A (Murphy et al. 2021). Karanth_permB: rank of the observed interval duration median (AB and BA undifferenciated) compared to the randomized median distribution, when permuting on species B (Karanth et al. 2017).MurphyAB_permB: rank of the observed AB interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021). MurphyBA_permB: rank of the observed BA interval duration median compared to the randomized median distribution, when permuting on species B (Murphy et al. 2021).

Scripts files description:1_Functions: R script containing the functions: - MRPP from Karanth et al. (2017) adapted here for time efficiency. - MRPP from Murphy et al. (2021) adapted here for time efficiency. - Version of the ct_to_recurrent() function from the recurrent package adapted to process parallized on the simulation datasets. - The simulation() function used to simulate two species observations with reciprocal effect on each other.2_Simulations: R script containing the parameters definitions for all iterations (for the two parts of the simulations), the simulation paralellization and the random thinning mimicking imperfect detection.3_Approaches comparison: R script containing the fit of the different models tested on the simulated data.3_1_Real data comparison: R script containing the fit of the different models tested on the real data example from Murphy et al. 2021.4_Graphs: R script containing the code for plotting results from the simulation part and appendices.5_1_Appendix - Check for similarity between codes for Karanth et al 2017 method: R script containing Karanth et al. (2017) and Murphy et al. (2021) codes lines and the adapted version for time-efficiency matter and a comparison to verify similarity of results.5_2_Appendix - Multi-response procedure permutation difference: R script containing R code to test for difference of the MRPPs approaches according to the species on which permutation are done.
Behavioral responses of common dolphins to naval sonar
data.niaid.nih.gov
datadryad.org
zip
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brandon Southall; John Durban (2024). Behavioral responses of common dolphins to naval sonar [Dataset]. http://doi.org/10.5061/dryad.ncjsxkt40
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ncjsxkt40
Dataset updated
Oct 4, 2024
Dataset provided by
Southall Environmental Associates (United States)
University of California, Santa Cruz
Authors
Brandon Southall; John Durban
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Despite strong interest in how noise affects marine mammals, little is known about the most abundant and commonly exposed taxa. Social delphinids occur in groups of hundreds of individuals that travel quickly, change behavior ephemerally, and are not amenable to conventional tagging methods, posing challenges in quantifying noise impacts. We integrated drone-based photogrammetry, strategically-placed acoustic recorders, and broad-scale visual observations to provide complimentary measurements of different aspects of behavior for short- and long-beaked common dolphins. We measured behavioral responses during controlled exposure experiments (CEEs) of military mid-frequency (3-4 kHz) active sonar (MFAS) using simulated and actual Navy sonar sources. We used latent-state Bayesian models to evaluate response probability and persistence in exposure and post-exposure phases. Changes in sub-group movement and aggregation parameters were commonly detected during different phases of MFAS CEEs but not control CEEs. Responses were more evident in short-beaked common dolphins (n=14 CEEs), and a direct relationship between response probability and received level was observed. Long-beaked common dolphins (n=20) showed less consistent responses, although contextual differences may have limited which movement responses could be detected. These are the first experimental behavioral response data for these abundant dolphins to directly inform impact assessments for military sonars. Methods We used complementary visual and acoustic sampling methods at variable spatial scales to measure different aspects of common dolphin behavior in known and controlled MFAS exposure and non-exposure contexts. Three fundamentally different data collection systems were used to sample group behavior. A broad-scale visual sampling of subgroup movement was conducted using theodolite tracking from shore-based stations. Assessments of whole-group and sub-group sizes, movement, and behavior were conducted at 2-minute intervals from shore-based and vessel platforms using high-powered binoculars and standardized sampling regimes. Aerial UAS-based photogrammetry quantified the movement of a single focal subgroup. The UAS consisted of a large (1.07 m diameter) custom-built octocopter drone launched and retrieved by hand from vessel platforms. The drone carried a vertically gimballed camera (at least 16MP) and sensors that allowed precise spatial positioning, allowing spatially explicit photogrammetry to infer movement speed and directionality. Remote-deployed (drifting) passive acoustic monitoring (PAM) sensors were strategically deployed around focal groups to examine both basic aspects of subspecies-specific common dolphin acoustic (whistling) behavior and potential group responses in whistling to MFAS on variable temporal scales (Casey et al., in press). This integration allowed us to evaluate potential changes in movement, social cohesion, and acoustic behavior and their covariance associated with the absence or occurrence of exposure to MFAS. The collective raw data set consists of several GB of continuous broadband acoustic data and hundreds of thousands of photogrammetry images. Three sets of quantitative response variables were analyzed from the different data streams: directional persistence and variation in speed of the focal subgroup from UAS photogrammetry; group vocal activity (whistle counts) from passive acoustic records; and number of sub-groups within a larger group being tracked by the shore station overlook. We fit separate Bayesian hidden Markov models (HMMs) to each set of response data, with the HMM assumed to have two states: a baseline state and an enhanced state that was estimated in sequential 5-s blocks throughout each CEE. The number of subgroups was recorded during periodic observations every 2 minutes and assumed constant across time blocks between observations. The number of subgroups was treated as missing data 30 seconds before each change was noted to introduce prior uncertainty about the precise timing of the change. For movement, two parameters relating to directional persistence and variation in speed were estimated by fitting a continuous time-correlated random walk model to spatially explicit photogrammetry data in the form of location tracks for focal individuals that were sequentially tracked throughout each CEE as a proxy for subgroup movement. Movement parameters were assumed to be normally distributed. Whistle counts were treated as normally distributed but truncated as positive because negative count data is not possible. Subgroup counts were assumed to be Poisson distributed as they were distinct, small values. In all cases, the response variable mean was modeled as a function of the HMM with a log link: log(Responset) = l0 + l1Z t where at each 5-s time block t, the hidden state took values of Zt = 0 to identify one state with a baseline response level l0, or Zt = 1 to identify an “enhanced” state, with l1 representing the enhancement of the quantitative value of the response variable. A flat uniform (-30,30) prior distribution was used for l0 in each response model, and a uniform (0,30) prior distribution was adopted for each l1 to constrain enhancements to be positive. For whistle and subgroup counts, the enhanced state indicated increased vocal activity and more subgroups. A common indicator variable was estimated for the latent state for both the movement parameters, such that switching to the enhanced state described less directional persistence and more variation in velocity. Speed was derived as a function of these two parameters and was used here as a proxy for their joint responses, representing directional displacement over time.
To assess differences in the behavior states between experimental phases, the block-specific latent states were modeled as a function of phase-specific probabilities, Z t ~ Bernoulli (pphaset), to learn about the probability pphase of being in an enhanced state during each phase. For each pre-exposure, exposure, and post-exposure phase, this probability was assigned a flat uniform (0,1) prior probability. The model was programmed in R (R version 3.6.1; The R Foundation for Statistical Computing) with the nimble package (de Valpine et al. 2020) to estimate posterior distributions of model parameters using Markov Chain Monte Carlo (MCMC) sampling. Inference was based on 100,000 MCMC samples following a burn-in of 100,000, with chain convergence determined by visual inspection of three MCMC chains and corroborated by convergence diagnostics (Brooks and Gelman, 1998). To compare behavior across phases, we compared the posterior distribution of the pphase parameters for each response variable, specifically by monitoring the MCMC output to assess the “probability of response” as the proportion of iterations for which pexposure was greater or less than ppre-exposure and the “probability of persistence” as the proportion of iterations for which ppost-exposre was greater or less than ppre-exposure. These probabilities of response and persistence thus estimated the extent of separation (non-overlap) between the distributions of pairs of pphase parameters: if the two distributions of interest were identical, then p=0.5, and if the two were non-overlapping, then p=1. Similarly, we estimated the average values of the response variables in each phase by predicting phase-specific functions of the parameters: Mean.responsephase = exp(l0 + l1pphase) and simply derived average speed as the mean of the speed estimates for 5-second blocks in each phase.
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
m
Data from: Datasets for lot sizing and scheduling problems in the...
data.mendeley.com
narcis.nl
Updated Jan 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Piñeros (2021). Datasets for lot sizing and scheduling problems in the fruit-based beverage production process [Dataset]. http://doi.org/10.17632/j2x3gbskfw.1
Explore at:
Unique identifier
https://doi.org/10.17632/j2x3gbskfw.1
Dataset updated
Jan 19, 2021
Authors
Juan Piñeros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
XBT and CTD pairs dataset Version 2
researchdata.edu.au
data.csiro.au
datadownload
Updated Oct 16, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susan Wijffels; Franco Reseghetti; Zanna Chase; Mark Rosenberg; Steve Rintoul; Rebecca Cowley (2014). XBT and CTD pairs dataset Version 2 [Dataset]. https://researchdata.edu.au/3377826
Explore at:
datadownloadAvailable download formats
Dataset updated
Oct 16, 2014
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Susan Wijffels; Franco Reseghetti; Zanna Chase; Mark Rosenberg; Steve Rintoul; Rebecca Cowley
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 1984 - Aug 30, 2013
Area covered

Description
The XBT/CTD pairs dataset (Version 2) contains additional datasets and updated datasets from the Version 1 data. Version 1 data was used to update the calculation of historical XBT fall rate and temperature corrections presented in Cowley, R., Wijffels, S., Cheng, L., Boyer, T., and Kizu, S. (2013). Biases in Expendable Bathythermograph Data: A New View Based on Historical Side-by-Side Comparisons. Journal of Atmospheric and Oceanic Technology, 30, 1195–1225, doi:10.1175/JTECH-D-12-00127.1. http://journals.ametsoc.org/doi/abs/10.1175/JTECH-D-12-00127.1 Version 2 contains 1,188 pairs from seven datasets that add to Version 1 which contains 4,115 pairs from 114 datasets. There are also 10 updated datasets included in Version 2. The updates apply to the CTD depth data in the Quality Controlled version of the 10 datasets. The 10 updated Version 2 datasets should be used in preference to the copies in Version 1. Note that future versions of the XBT/CTD pairs database may supersede this version. Please check more recent versions for updates to individual datasets. Each dataset contains the scientifically quality controlled version and (where available) the originator's data. The XBT/CTD pairs are identified in the document 'XBT_CTDpairs_metadata_V2.csv'. Although the XBT data in the additional datasets was collected after 2008, much of the probes in the ss2012t01 dataset were manufactured during the mid-1980s. Lineage: Data is sourced from CSIRO Oceans and Atmosphere Flagship, Australian Antarctic Division and Italian National Agency for New Technologies, Energy and Sustainable Economic Development. Original and raw data files are included where available. Quality controlled datasets follow the procedure of Bailey, R., Gronell, A., Phillips, H., Tanner, E., and Meyers, G. (1994). Quality control cookbook for XBT data, Version 1.1. CSIRO Marine Laboratories Reports, 221. Quality controlled data is in the 'MQNC' format used at CSIRO Marine and Atmospheric Research. The MQNC format is described in the document 'XBT_CTDpairs_descriptionV2.pdf'. Note that future versions of the XBT/CTD pairs database may supersede this version. Please check more recent versions for updates to individual datasets.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Audio Cartography
openneuro.org
Updated Aug 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megen Brittell (2020). Audio Cartography [Dataset]. http://doi.org/10.18112/openneuro.ds001415.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds001415.v1.0.0
Dataset updated
Aug 8, 2020
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Megen Brittell
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Audio Cartography project investigated the influence of temporal arrangement on the interpretation of information from a simple spatial data set. I designed and implemented three auditory map types (audio types), and evaluated differences in the responses to those audio types.

The three audio types represented simplified raster data (eight rows x eight columns). First, a "sequential" representation read values one at a time from each cell of the raster, following an English reading order, and encoded the data value as loudness of a single fixed-duration and fixed-frequency note. Second, an augmented-sequential ("augmented") representation used the same reading order, but encoded the data value as volume, the row as frequency, and the column as the rate of the notes play (constant total cell duration). Third, a "concurrent" representation used the same encoding as the augmented type, but allowed the notes to overlap in time.

Participants completed a training session in a computer-lab setting, where they were introduced to the audio types and practiced making a comparison between data values at two locations within the display based on what they heard. The training sessions, including associated paperwork, lasted up to one hour. In a second study session, participants listened to the auditory maps and made decisions about the data they represented while the fMRI scanner recorded digital brain images.

The task consisted of listening to an auditory representation of geospatial data ("map"), and then making a decision about the relative values of data at two specified locations. After listening to the map ("listen"), a graphic depicted two locations within a square (white background). Each location was marked with a small square (size: 2x2 grid cells); one square had a black solid outline and transparent black fill, the other had a red dashed outline and transparent red fill. The decision ("response") was made under one of two conditions. Under the active listening condition ("active") the map was played a second time while participants made their decision; in the memory condition ("memory"), a decision was made in relative quiet (general scanner noises and intermittent acquisition noise persisted). During the initial map listening, participants were aware of neither the locations of the response options within the map extent, nor the response conditions under which they would make their decision. Participants could respond any time after the graphic was displayed; once a response was entered, the playback stopped (active response condition only) and the presentation continued to the next trial.

Data was collected in accordance with a protocol approved by the Institutional Review Board at the University of Oregon.

Additional details about the specific maps used in this are available through University of Oregon's ScholarsBank (DOI 10.7264/3b49-tr85).

Details of the design process and evaluation are provided in the associated dissertation, which is available from ProQuest and University of Oregon's ScholarsBank.

Scripts that created the experimental stimuli and automated processing are available through University of Oregon's ScholarsBank (DOI 10.7264/3b49-tr85).

Preparation of fMRI Data

Conversion of the DICOM files produced by the scanner to NiFTi format was performed by MRIConvert (LCNI). Orientation to standard axes was performed and recorded in the NiFTi header (FMRIB, fslreorient2std). The excess slices in the anatomical images that represented tissue in the next were trimmed (FMRIB, robustfov). Participant identity was protected through automated defacing of the anatomical data (FreeSurfer, mri_deface), with additional post-processing to ensure that no brain voxels were erroneously removed from the image (FMRIB, BET; brain mask dilated with three iterations "fslmaths -dilM").

Preparation of Metadata

The dcm2niix tool (Rorden) was used to create draft JSON sidecar files with metadata extracted from the DICOM headers. The draft sidecar file were revised to augment the JSON elements with additional tags (e.g., "Orientation" and "TaskDescription") and to make a more human-friendly version of tag contents (e.g., "InstitutionAddress" and "DepartmentName"). The device serial number was constant throughout the data collection (i.e., all data collection was conducted on the same scanner), and the respective metadata values were replaced with an anonymous identifier: "Scanner1".

Preparation of Behavioral Data

The stimuli consisted of eighteen auditory maps. Spatial data were generated with the rgeos, sp, and spatstat libraries in R; auditory maps were rendered with the Pyo (Belanger) library for Python and prepared for presentation in Audacity. Stimuli were presented using PsychoPy (Peirce, 2007), which produced log files from which event details were extracted. The log files included timestamped entries for stimulus timing and trigger pulses from the scanner.

Log files are available in "sourcedata/behavioral".

Extracted event details accompany BOLD images in "sub-NN/func/*events.tsv".

Three column explanatory variable files are in "derivatives/ev/sub-NN".

References

Audacity® software is copyright © 1999-2018 Audacity Team. Web site: https://audacityteam.org/. The name Audacity® is a registered trademark of Dominic Mazzoni.

FMRIB (Functional Magnetic Resonance Imaging of the Brain). FMRIB Software Library (FSL; fslreorient2std, robustfov, BET). Oxford, v5.0.9, Available: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/

FreeSurfer (mri_deface). Harvard, v1.22, Available: https://surfer.nmr.mgh.harvard.edu/fswiki/AutomatedDefacingTools)

LCNI (Lewis Center for Neuroimaging). MRIConvert (mcverter), v2.1.0 build 440, Available: https://lcni.uoregon.edu/downloads/mriconvert/mriconvert-and-mcverter

Peirce, JW. PsychoPy–psychophysics software in Python. Journal of Neuroscience Methods, 162(1–2):8 – 13, 2007. Software Available: http://www.psychopy.org/

Python software is copyright © 2001-2015 Python Software Foundation. Web site: https://www.python.org

Pyo software is copyright © 2009-2015 Olivier Belanger. Web site: http://ajaxsoundstudio.com/software/pyo/.

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available: https://www.R-project.org/.

rgeos software is copyright © 2016 Bivand and Rundel. Web site: https://CRAN.R-project.org/package=rgeos

Rorden, C. dcm2niix, v1.0.20171215, Available: https://github.com/rordenlab/dcm2niix

spatstat software is copyright © 2016 Baddeley, Rubak, and Turner. Web site: https://CRAN.R-project.org/package=spatstat

sp software is copyright © 2016 Pebesma and Bivand. Web site: https://CRAN.R-project.org/package=sp
Cyclistic_Divvy_data
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
Explore at:
zip(21440758 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Rami Ghaith
Description
The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Reddit: /r/Art
kaggle.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/Art [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-online-art-trends-with-reddit-posting/discussion?sort=undefined
Explore at:
zip(84621 bytes)Available download formats
Dataset updated
Dec 17, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/Art

Examining Content by Title, Score, ID, URL, Comments, Create Date, and Timestamp

By Reddit [source]

About this dataset

This dataset offers an in-depth exploration of the artistic world of Reddit, with a focus on the posts available on the website. By examining the titles, scores, ID's, URLs, comments, creation dates and timestamps associated with each post about art on Reddit, researchers can gain invaluable insight into how art enthusiasts share their work and build networks within this platform. Through analyzing this data we can understand what sorts of topics attract more attention from viewers and how members interact with one another in online discussions. Moreover, this dataset has potential to explore some of the larger underlying issues that shape art communities today - from examining production trends to better understanding consumption patterns. Overall, this comprehensive dataset is an essential resource for those aiming to analyze and comprehend digital spaces where art is circulated and discussed - giving unique insight into how ideas are created and promoted throughout creative networks

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is an excellent source of information related to online art trends, providing comprehensive analysis of Reddit posts related to art. In this guide, we’ll discuss how you can use this dataset to gather valuable insights about the way in which art is produced and shared on the web.
First and foremost, you should start by familiarizing yourself with the columns included in the dataset. Each post contains a title, score (number of upvotes), URL, comments (number of comments), created date and timestamp. When interpreting each column individually or comparing different posts/threads, these values will provide invaluable insight into topics such as most discussed or favored content within the Reddit community.
After exploring the general features within each post/thread in your analysis it’s time to move onto more specific components such as body content (including images) and creative dates - when users began responding and interacting with content posted about a specific topic or action related item). Utilizing these variables will help researchers uncover meaningful patterns regarding how communities interact with certain types of content over longer periods of time & also give context from what type of topics are trending at any given moment when analyzing at shorter intervals.
Finally one last creative output that can stem from using this data set revolves around examining titles for common words & phrases that appear often among posts discussing similar types of artwork or other forms media production - identifying potential keywords & symbols associated across several different groups can paint a holistic picture regards what kind engagement each group desires while they engage amongst other like-minded individuals further aided by parameters presented through number scores what helps measure overall reception per submissions or individual thoughts presented in comment thread discussions among others known similar outlets available on site itself! Here's hoping utilizing these techniques may bring attention to some possible conclusions derived already exists previously undiscovered apart our eyes – good luck everyone!

Research Ideas

Analyzing topics and themes within art posts to determine what content is most popular.

Examining the score of art posts to determine how the responding audience engages with each piece.

Comparing across different subreddits to explore the ‘meta-discourse’ of topics that appear in multiple forums or platforms

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Art.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | ...
Crop classification dataset for testing domain adaptation or distributional...
zenodo.org
data.niaid.nih.gov
bin, csv
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell (2022). Crop classification dataset for testing domain adaptation or distributional shift methods [Dataset]. http://doi.org/10.5281/zenodo.6376160
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6376160
Dataset updated
May 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dan M. Kluger; Dan M. Kluger; Sherrie Wang; Sherrie Wang; David B. Lobell; David B. Lobell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf).

In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts.

More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload.

All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf".

Preferred Citation:

-Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488.

-URL to this Zenodo post https://zenodo.org/record/6376160
Data repository of multi-temporal high-resolution data products of ecosystem...
zenodo.org
bin, zip
Updated May 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifang Shi; Yifang Shi; Jinhu Wang; W. Daniel Kissling; Jinhu Wang; W. Daniel Kissling (2025). Data repository of multi-temporal high-resolution data products of ecosystem structure derived from country-wide airborne laser scanning surveys of the Netherlands [Dataset]. http://doi.org/10.5281/zenodo.15261042
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15261042
Dataset updated
May 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yifang Shi; Yifang Shi; Jinhu Wang; W. Daniel Kissling; Jinhu Wang; W. Daniel Kissling
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 27, 2025
Description
This data repository contains a set of multi-temporal data products of ecosystem structure derived from four national ALS surveys of the Netherlands (AHN1–AHN4) (folders: 1_AHN1, 2_AHN2, 3_AHN3, and 4_AHN4). Four sets of 25 LiDAR-derived vegetation metrics representing ecosystem height, cover, and structural variability are provided at 10 m spatial resolution, providing valuable data sources for a wide range of ecological research and field beyond. A preview of all generated LiDAR metrics are also provided (folder: 5_Maps). All 25 LiDAR metrics were calculated using Laserfarm workflow (https://laserfarm.readthedocs.io/en/latest/) (building on the user-extendable features from the “Laserchicken” software: https://laserchicken.readthedocs.io/en/latest/#features). All metrics are calculated with the normalized point cloud. More details on metric calculation are provided on GitHub (Laserchicken: https://github.com/eEcoLiDAR/laserchicken and Laserfarm: https://github.com/eEcoLiDAR/Laserfarm), as well as on the “Laserchicken” documentation page (https://laserchicken.readthedocs.io/en/latest/). We also provided masks to minimize the influence of water surfaces, buildings and roads, powerlines and NA values in the data products (folder: 6_Masks). To supplement the generated data products, we also provided a set of raster layers that contains point/pulse density of each AHN survey and the DTM and DSM raster layers for each AHN dataset (folder: 7_Auxiliary_data). To test the robustness of the LiDAR metrics, we also compared the metrics generated from different pulse densities across different habitat types (folder: 8_Sensitivity_analysis). Two use cases demonstrated the utility of the presented data products: (use case 1) monitoring forest structural change across time using multi-temporal ALS data and (use case 2) comparison of vegetation structural difference within Natura 2000 sites. The used data are also provided (folder: 9_Use_case). Note that all the raster layers are provided at 10 m resolution under the local Dutch coordinate system “RD_new” (EPSG: 28992, NAP:5709). To gain more insights of the pre-classification accuracy of the AHN datasets, we also conducted a preliminary assessment of the effect of terrain filtering on vegetation change detection across AHN datasets (i.e. AHN2–AHN4). The data used in this analysis are made available (folder: 10_Ground_classification).

An overview of all the folders in the repository:

1. AHN1

2. AHN2

3. AHN3

4. AHN4

5. Maps

Those folders contain four sets of 25 LiDAR metrics at 10 m resolution generated from each AHN dataset. The file names and their corresponding LiDAR metrics can be found in Table 1. An additional folder (5_Maps) contains the maps (.pdf format) of all 25 metrics for each AHN dataset.

6. Masks

ahn3_10m_mask_building_road_water.tif

ahn4_10m_mask_building_road_water.tif

ahn4_10m_mask_powerline.tif

ahn1_10m_NA_mask.tif

ahn2_10m_NA_mask.tif

ahn3_10m_NA_mask.tif

ahn4_10m_NA_mask.tif

It contains two mask layers of water surfaces, buildings and roads for both AHN3 and AHN4 data products based on the Dutch cadaster data (TOP10NL) from 2018 (corresponding to AHN3) and 2021 (corresponding to AHN4) (https://www.kadaster.nl/zakelijk/producten/geo-informatie/topnl). In the masks, water surfaces, buildings and roads were merged into one class with pixel value assigned to 1 and the rest has the pixel value of 0. There is also a powerline mask generated from the AHN4 dataset at 10 m resolution, where pixels containing powerlines were assigned a value of 1 and the rest as NoData. We provide those masks to minimize the inaccuracies of the data products caused by human infrastructures and water surfaces. We also provided a mask for each AHN dataset where NA value occurs — areas with no vegetation points (“unclassified” class in the AHN datasets). Pixels with NA value were assigned with a value of 1 and the rest as 0.

7. Auxiliary data

(1) Point_density

ahn1_10m_point_density.tif

ahn2_10m_point_density.tif

ahn3_10m_point_density.tif

ahn4_10m_point_density.tif

(2) Pulse_density

ahn3_10m_pulse_density.tif

ahn4_10m_pulse_density.tif

(3) Flighttime

ahn3_10m_flighttime.tif

ahn4_10m_flighttime.tif

(4) DTM_DSM

ahn2_10m_dtm.tif

ahn2_10m_dsm.tif

ahn3_10m_dtm.tif

ahn3_10m_dsm.tif

ahn4_10m_dtm.tif

ahn4_10m_dsm.tif

It contains four raster layers representing the point density of each AHN dataset, two raster layers for pulse density of the AHN3 and AHN4, two raster layers for flight timestamp of the AHN3 and AHN4, and six DTM and DSM layers for AHN2–AHN4. All raster layers are provide at 10 m resolution.

8. Sensitivity analysis

Dunes

Marsh

Grassland

Shrubland

Woodland

Code

Figure

It contains the 25 metrics generated from point clouds with the original and down-sampled pulse densities (original pulse density of the AHN4, pulse density of the AHN3, ½ of the pulse density of the AHN3, and ¼ of the pulse density of AHN3) for each habitat type (i.e. dunes, marsh, grassland, shrubland, and woodland). We also provided the code and the figures generated from this analysis.

9. Use_case

(1) Multi-temporal_AHN

Data

Usecase_multi-temporal_AHN.R

It contains the input data for the use case data processing (i.e. Data folder), including the shapefile of the area (i.e. shp folder), and extracted pixel value from six selected LiDAR metrics from AHN1–AHN5 (i.e. Metrics folder), and the selected LiDAR metrics of the area (e.g. Hp95 folder), and the R code for data processing (i.e. Usecase_multi-temporal_AHN.R).

(2) Natura2000

Data

Natura2000_end2021_HABITATCLASS.csv

Natura2000_NL_habitat_grouped.csv

Usecase_Natura2000.R

It contains a folder of the input data used for the use case (i.e. Data folder), including the shapefile (i.e. shp folder) of the Natura 2000 sites in the Netherlands (i.e. Nature2000_NL_RDnew.shp) and the 100 random sample plots from each habitat type (e.g. woodland_points.shp), and the LiDAR metrics from AHN4 used for demonstrating the vegetation structure within each habitat type (i.e. AHN4_metrics folder). The table “Natura2000_end2021_HABITATCLASS.csv” is the original attribute table of Natura 2000 sites, including information related to the description of habitat classes (column “DESCRIPTION”), the code corresponding to the habitat class (column “HABITATCODE”), the code for the specific site (column “SITECODE”), and the percentage of the cover of a specific habitat class in one site (column “PERCENTAGECOVER”). The table “Natura2000_NL_habitat_grouped.csv” contains two subtabs, one (i.e. “Habitatclass”) is the copy of the original attribute table of Natura 2000 sites in the Netherlands, and the other one (i.e. “Habitat_class_summary”) is the grouped habitat type based on the dominant habitat class (i.e. class with the highest percentage cover) in each site. Different colors indicate different habitat types, corresponding to the colors in the first tab (“Habitatclass”) where
ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1
figshare.com
application/gzip
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Andreatta; Santiago Carmona (2023). ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1 [Dataset]. http://doi.org/10.6084/m9.figshare.12478571.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12478571.v2
Dataset updated
Jun 29, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Massimo Andreatta; Santiago Carmona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have developed ProjecTILs, a computational approach to project new data sets into a reference map of T cells, enabling their direct comparison in a stable, annotated system of coordinates. Because new cells are embedded in the same space of the reference, ProjecTILs enables the classification of query cells into annotated, discrete states, but also over a continuous space of intermediate states. By comparing multiple samples over the same map, and across alternative embeddings, the method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) and identifying genetic programs significantly altered in the query compared to a control set or to the reference map. We illustrate the projection of several data sets from recent publications over two cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection.To construct the reference TIL atlas, we obtained single-cell gene expression matrices from the following GEO entries: GSE124691, GSE116390, GSE121478, GSE86028; and entry E-MTAB-7919 from Array-Express. Data from GSE124691 contained samples from tumor and from tumor-draining lymph nodes, and were therefore treated as two separate datasets. For the TIL projection examples (OVA Tet+, miR-155 KO and Regnase-KO), we obtained the gene expression counts from entries GSE122713, GSE121478 and GSE137015, respectively.Prior to dataset integration, single-cell data from individual studies were filtered using TILPRED-1.0 (https://github.com/carmonalab/TILPRED), which removes cells not enriched in T cell markers (e.g. Cd2, Cd3d, Cd3e, Cd3g, Cd4, Cd8a, Cd8b1) and cells enriched in non T cell genes (e.g. Spi1, Fcer1g, Csf1r, Cd19). Dataset integration was performed using STACAS (https://github.com/carmonalab/STACAS), a batch-correction algorithm based on Seurat 3. For the TIL reference map, we specified 600 variable genes per dataset, excluding cell cycling genes, mitochondrial, ribosomal and non-coding genes, as well as genes expressed in less than 0.1% or more than 90% of the cells of a given dataset. For integration, a total of 800 variable genes were derived as the intersection of the 600 variable genes of individual datasets, prioritizing genes found in multiple datasets and, in case of draws, those derived from the largest datasets. We determined pairwise dataset anchors using STACAS with default parameters, and filtered anchors using an anchor score threshold of 0.8. Integration was performed using the IntegrateData function in Seurat3, providing the anchor set determined by STACAS, and a custom integration tree to initiate alignment from the largest and most heterogeneous datasets.Next, we performed unsupervised clustering of the integrated cell embeddings using the Shared Nearest Neighbor (SNN) clustering method implemented in Seurat 3 with parameters {resolution=0.6, reduction=”umap”, k.param=20}. We then manually annotated individual clusters (merging clusters when necessary) based on several criteria: i) average expression of key marker genes in individual clusters; ii) gradients of gene expression over the UMAP representation of the reference map; iii) gene-set enrichment analysis to determine over- and under- expressed genes per cluster using MAST. In order to have access to predictive methods for UMAP, we recomputed PCA and UMAP embeddings independently of Seurat3 using respectively the prcomp function from basic R package “stats”, and the “umap” R package (https://github.com/tkonopka/umap).

Facebook

Twitter

Click to copy link

Link copied

Cite

Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

Explore at:

38 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0034338

Dataset updated

Jun 8, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Anne E. Goodenough; Adam G. Hart; Richard Stafford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

Clear search

Close search

Google apps

Main menu

Data from: Regression with Empirical Variable Selection: Description of a...

Data from: WiBB: An integrated method for quantifying the relative...

Reddit /r/datasets Dataset

The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Optimization and Evaluation Datasets for PiMine

A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

Clustering of samples and variables with mixed-type data

Benchmark Multi-Omics Datasets for Methods Comparison

Data for: A systematic review showed no performance benefit of machine...

Simulation Data & R scripts for: "Introducing recurrent events analyses to...

Behavioral responses of common dolphins to naval sonar

University SET data, with faculty and courses characteristics

Data from: Datasets for lot sizing and scheduling problems in the...

XBT and CTD pairs dataset Version 2

Film Circulation dataset

Audio Cartography

Preparation of fMRI Data

Preparation of Metadata

Preparation of Behavioral Data

References

Cyclistic_Divvy_data

Reddit: /r/Art

Reddit: /r/Art

Examining Content by Title, Score, ID, URL, Comments, Create Date, and Timestamp

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Crop classification dataset for testing domain adaptation or distributional...

Data repository of multi-temporal high-resolution data products of ecosystem...

ProjecTILs murine reference atlas of tumor-infiltrating T cells, version 1

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets