Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is intended to accompany the paper "Designing Types for R, Empirically" (@ OOPSLA'20, link to paper). This data was obtained by running the Typetracer (aka propagatr) dynamic analysis tool (link to tool) on the test, example, and vignette code of a corpus of >400 extensively used R packages.
Specifically, this dataset contains:
function type traces for >400 R packages (raw-traces.tar.gz);
trace data processed into a more readable/usable form (processed-traces.tar.gz), which was used in obtaining results in the paper;
inferred type declarations for the >400 R packages using various strategies to merge the processed traces (see type-declarations-* directories), and finally;
contract assertion data from running the reverse dependencies of these packages and checking function usage against the declared types (contract-assertion-reverse-dependencies.tar.gz).
A preprint of the paper is also included, which summarizes our findings.
Fair warning Re: data size: the raw traces, once uncompressed, take up nearly 600GB. The already processed traces are in the 10s of GB, which should be more manageable for a consumer-grade computer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.
IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.
IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform
The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.
Due to the changes in our systems, some tables have been affected.
Data quality has been improved across all tables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘School Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/smeilisa07/number of school teacher student class on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This is my first analyst data. This dataset i got from open data Jakarta website (http://data.jakarta.go.id/), so mostly the dataset is in Indonesian. But i have try describe it that you can find it on VARIABLE DESCRIPTION.txt file.
The title of this dataset is jumlah-sekolah-guru-murid-dan-ruang-kelas-menurut-jenis-sekolah-2011-2016, with type is CSV, so you can easily access it. If you not understand, the title means the number of school, teacher, student, and classroom according to the type of school 2011 - 2016. I think, if you just read from the title, you can imagine the contents. So this dataset have 50 observations and 8 variables, taken from 2011 until 2016.
In general, this dataset is about the quality of education in Jakarta, which each year some of school level always decreasing and some is increase, but not significant.
This dataset comes from Indonesian education authorities, which is already established in the CSV file by Open Data Jakarta.
Althought this data given from Open Data Jakarta publicly, i want always continue to improve my Data Scientist skill, especially in R programming, because i think R programming is easy to learn and really help me to be always curious about Data Scientist. So, this dataset that I am still struggle with below problem, and i need solution.
Question :
How can i cleaning this dataset ? I have try cleaning this dataset, but i still not sure. You can check on
my_hypothesis.txt file, when i try cleaning and visualize this dataset.
How can i specify the model for machine learning ? What recommended steps i should take ?
How should i cluster my dataset, if i want the label is not number but tingkat_sekolah for every tahun and
jenis_sekolah ? You can check on my_hypothesis.txt file.
--- Original source retains full ownership of the source dataset ---
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
\r From 11 March 2025, the dataset will be updated to include 1 new field, Date of Deregistration, (see help file for details). \r \r
\r From 7 August 2018, the Company dataset will be updated weekly every Tuesday. As a result, the information might not be accurate at the time you check the Company dataset.\r ASIC-Connect updates information in real time, therefore, please consider accessing information on that platform if you need up to date information.\r \r ***\r \r
ASIC is Australia’s corporate, markets and financial services regulator. ASIC contributes to Australia’s economic reputation and wellbeing by ensuring that Australia’s financial markets are fair and transparent, supported by confident and informed investors and consumers.\r \r Australian companies are required to keep their details up to date on ASIC's Company Register. Information contained in the register is made available to the public to search via ASIC's website.\r \r Select data from the ASIC's Company Register will be uploaded each week to www.data.gov.au. The data made available will be a snapshot of the register at a point in time. Legislation prescribes the type of information ASIC is allowed to disclose to the public.\r \r The information included in the downloadable dataset is:\r \r * Company Name\r * Australian Company Number (ACN)\r * Type \r * Class\r * Sub Class \r * Status\r * Date of Registration\r * Date of Deregistration (Available from 11 March 2025)\r * Previous State of Registration (where applicable)\r * State Registration Number (where applicable) \r * Modified since last report – flag to indicate if data has been modified since last report\r * Current Name Indicator\r * Australian Business Number (ABN) \r * Current Name\r * Current Name Start Date\r \r Additional information about companies can be found via ASIC's website. Accessing some information may attract a fee.\r \r More information about searching ASIC's registers.\r
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description from the SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery GitHub Repository * The "Note" was added by the Roboflow team.
This is a single class dataset consisting of tiles of satellite imagery labeled with potential 'targets'. Labelers were instructed to draw boxes around anything they suspect may a paraglider wing, missing in a remote area of Nevada. Volunteers were shown examples of similar objects already in the environment for comparison. The missing wing, as it was found after 3 weeks, is shown below.
https://michaeltpublic.s3.amazonaws.com/images/anomaly_small.jpg" alt="anomaly">
The dataset contains the following:
Set | Images | Annotations |
---|---|---|
Train | 1808 | 3048 |
Validate | 490 | 747 |
Test | 254 | 411 |
Total | 2552 | 4206 |
The data is in the COCO format, and is directly compatible with faster r-cnn as implemented in Facebook's Detectron2.
Download the data here: sarnet.zip
Or follow these steps
# download the dataset
wget https://michaeltpublic.s3.amazonaws.com/sarnet.zip
# extract the files
unzip sarnet.zip
***Note* with Roboflow, you can download the data here** (original, raw images, with annotations): https://universe.roboflow.com/roboflow-public/sarnet-search-and-rescue/ (download v1, original_raw-images) * Download the dataset in COCO JSON format, or another format of choice, and import them to Roboflow after unzipping the folder to get started on your project.
Get started with a Faster R-CNN model pretrained on SaRNet: SaRNet_Demo.ipynb
Source code for the paper is located here: SaRNet_train_test.ipynb
@misc{thoreau2021sarnet,
title={SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery},
author={Michael Thoreau and Frazer Wilson},
year={2021},
eprint={2107.12469},
archivePrefix={arXiv},
primaryClass={eess.IV}
}
The source data was generously provided by Planet Labs, Airbus Defence and Space, and Maxar Technologies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List glmmeg.R: R code demonstrating how to fit a logistic regression model, with a random intercept term, to randomly generated overdispersed binomial data. boot.glmm.R: R code for estimating P-values by applying the bootstrap to a GLMM likelihood ratio statistic. Description glmm.R is some example R code which show how to fit a logistic regression model (with or without a random effects term) and use diagnostic plots to check the fit. The code is run on some randomly generated data, which are generated in such a way that overdispersion is evident. This code could be directly applied for your own analyses if you read into R a data.frame called “dataset”, which has columns labelled “success” and “failure” (for number of binomial successes and failures), “species” (a label for the different rows in the dataset), and where we want to test for the effect of some predictor variable called “location”. In other cases, just change the labels and formula as appropriate. boot.glmm.R extends glmm.R by using bootstrapping to calculate P-values in a way that provides better control of Type I error in small samples. It accepts data in the same form as that generated in glmm.R.
The GOES-R Geostationary Lightning Mapper (GLM) Gridded Data Products consist of full disk extent gridded lightning flash data collected by the Geostationary Lightning Mapper (GLM) on board each of the Geostationary Operational Environmental Satellites R-Series (GOES-R). These satellites are a part of the GOES-R series program: a four-satellite series within the National Aeronautics and Space Administration (NASA) and National Oceanic and Atmospheric Association (NOAA) GOES program. GLM is the first operational geostationary optical lightning detector that provides total lightning data (in-cloud, cloud-to-cloud, and cloud-to-ground flashes). While it detects each of these types of lightning, the GLM is unable to distinguish between each type. The GLM GOES L3 dataset files contain gridded lightning flash data over the Western Hemisphere in netCDF-4 format from December 31, 2017 to present as this is an ongoing dataset.
AirGapAgent-R 🛡️🧠 A Benchmark for Evaluating Contextual Privacy of Personal LLM Agents
Code Repository: parameterlab/leaky_thoughts Paper: Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Original Paper that detailed the procedure to create the dataset: AirGapAgent: Protecting Privacy-Conscious Conversational Agents (Bagdasarian et al.)
🧠 What is AirGapAgent-R? AirGapAgent-R is a probing benchmark designed to test contextual privacy in personal LLM agents, reconstructed from the original (unreleased) benchmark used in the AirGapAgent paper (Bagdasarian et al.). It simulates real-world data-sharing decisions where models must reason about whether user-specific data (e.g., age, medical history) should be revealed based on a specific task context.
The procedure used to create the dataset is detailed in Appendix C of our paper (see below).
📦 Dataset Structure
Profiles: 20 synthetic user profiles
Fields per Profile: 26 personal data fields (e.g., name, phone, medication)
Scenarios: 8 task contexts (e.g., doctor appointment, travel booking)
Total Prompts: 4,160 (user profile × scenario × question)
Each example includes: - The user profile - The scenario context - The domain - The data field that the model should consider whether to share or not - A ground-truth label (should share / should not share the specific data field)
The prompt is empty, as all the prompts depends on the specific model / reasoning type being used. All prompts available are in the prompts folder of the code repository (parameterlab/leaky_thoughts).
We also include a smaller variant used in some of our experiments (e.g., in RAnA experiments) together with the two datasets used in the swapping experiments detailed in Appendix A.3 of our paper.
🧪 Use Cases Use this dataset to evaluate:
Reasoning trace privacy leakage
Trade-offs between utility (task performance) and privacy
Prompting strategies and anonymization techniques
Susceptibility to prompt injection and reasoning-based attacks
📊 Metrics In the associated paper, we evaluate:
Utility Score: % of correct data sharing decisions
Privacy Score: % of cases with no inappropriate leakage in either answer or reasoning
📥 Clone via Hugging Face CLI bash huggingface-cli download --repo-type dataset parameterlab/leaky_thoughts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Here you can find information about all models generated by SIMON. Models can be downloaded and re-used for predictions. Each dataset is stored in separate folder which contains all the models built for that dataset. Name format is: {modelName}.RData This file contains following information: - All training specific model data: folds, tuning parameters, etc - All predictions made with test dataset - Confusion matrix and all performance measures calculated - Features and their Variable Importance Scores Here is an example of RData file structure: List of 5 $ model_training_fit :List of 23 ..$ method : chr "bagEarth" ..$ modelInfo :List of 15 .. ..$ label : chr "Bagged MARS" .. ..$ library : chr "earth" .. ..$ type : chr [1:2] "Regression" "Classification" .. ..$ parameters:'data.frame': 2 obs. of 3 variables: .. .. ..$ parameter: Factor w/ 2 levels "degree","nprune": 2 1 .. .. ..$ class : Factor w/ 1 level "numeric": 1 1 .. .. ..$ label : Factor w/ 2 levels "#Terms","Product Degree": 1 2 .. ..$ grid :function (x, y, len = NULL, search = "grid") .. ..$ loop :function (grid) .. ..$ fit :function (x, y, wts, param, lev, last, classProbs, ...) .. ..$ predict :function (modelFit, newdata, submodels = NULL) .. ..$ prob :function (modelFit, newdata, submodels = NULL) .. ..$ predictors:function (x, ...) .. ..$ varImp :function (object, ...) .. ..$ levels :function (x) .. ..$ tags : chr [1:5] "Multivariate Adaptive Regression Splines" "Ensemble Model" "Implicit Feature Selection" "Bagging" ... .. ..$ sort :function (x) .. ..$ oob :function (x) ..$ modelType : chr "Classification" ..$ results :'data.frame': 3 obs. of 24 variables: .. ..$ degree : num [1:3] 1 1 1 .. ..$ nprune : num [1:3] 2 10 18 .. ..$ logLoss : num [1:3] 1.27 1.84 1.66 .. ..$ AUC : num [1:3] 0.694 0.75 0.695 .. ..$ Accuracy : num [1:3] 0.623 0.698 0.657 .. ..$ Kappa : num [1:3] 0.12 0.36 0.262 .. ..$ F1 : num [1:3] 0.46 0.614 0.542 .. ..$ Sensitivity : num [1:3] 0.217 0.589 0.517 .. ..$ Specificity : num [1:3] 0.895 0.765 0.743 .. ..$ Pos_Pred_Value : num [1:3] 0.606 0.655 0.6 .. ..$ Neg_Pred_Value : num [1:3] 0.636 0.76 0.715 .. ..$ Detection_Rate : num [1:3] 0.0864 0.238 0.2098 .. ..$ Balanced_Accuracy : num [1:3] 0.556 0.677 0.63 .. ..$ logLossSD : num [1:3] 0.188 0.693 0.562 .. ..$ AUCSD : num [1:3] 0.19 0.146 0.157 .. ..$ AccuracySD : num [1:3] 0.0922 0.1339 0.1302 .. ..$ KappaSD : num [1:3] 0.217 0.28 0.279 .. ..$ F1SD : num [1:3] 0.099 0.174 0.176 .. ..$ SensitivitySD : num [1:3] 0.204 0.246 0.266 .. ..$ SpecificitySD : num [1:3] 0.12 0.194 0.182 .. ..$ Pos_Pred_ValueSD : num [1:3] 0.369 0.257 0.235 .. ..$ Neg_Pred_ValueSD : num [1:3] 0.0711 0.1264 0.137 .. ..$ Detection_RateSD : num [1:3] 0.0818 0.1114 0.1167 .. ..$ Balanced_AccuracySD: num [1:3] 0.0996 0.1358 0.1406 ..$ pred :'data.frame': 720 obs. of 8 variables: .. ..$ pred : Factor w/ 2 levels "high","low": 2 1 2 2 2 2 1 2 2 2 ... .. ..$ obs : Factor w/ 2 levels "high","low": 1 1 2 2 2 2 1 1 1 2 ... .. ..$ rowIndex: int [1:720] 4 26 34 39 43 47 65 4 26 34 ... .. ..$ high : num [1:720] 0.415 0.822 0.39 0.276 0.135 ... .. ..$ low : num [1:720] 0.585 0.178 0.61 0.724 0.865 ... .. ..$ degree : num [1:720] 1 1 1 1 1 1 1 1 1 1 ... .. ..$ nprune : num [1:720] 18 18 18 18 18 18 18 2 2 2 ... .. ..$ Resample: chr [1:720] "Fold01.Rep1" "Fold01.Rep1" "Fold01.Rep1" "Fold01.Rep1" ... ..$ bestTune :'data.frame': 1 obs. of 2 variables: .. ..$ nprune: num 10 .. ..$ degree: num 1 ..$ call : language train.formula(form = factor(outcome) ~ ., data = training, method = model, trControl = trControl, preProcess = NU| truncated ..$ dots : list() ..$ metric : chr "Accuracy" ..$ control :List of 27 .. ..$ method : chr "repeatedcv" .. ..$ number : num 10 .. ..$ repeats : num 3 .. ..$ search : chr "grid" .. ..$ p : num 0.75 .. ..$ initialWindow : NULL .. ..$ horizon : num 1 .. ..$ fixedWindow : logi TRUE .. ..$ skip : num 0 .. ..$ verboseIter : logi FALSE .. ..$ returnData : logi TRUE .. ..$ returnResamp : chr "final" .. ..$ savePredictions : chr "all" .. ..$ classProbs : logi TRUE .. ..$ summaryFunction :function (data, lev = NULL, model = NULL) .. ..$ selectionFunction: chr "best" .. ..$ preProcOptions :List of 6 .. .. ..$ thresh : num 0.95 .. .. ..$ ICAcomp : num 3 .. .. ..$ k : num 5 .. .. ..$ freqCut : num 19 .. .. ..$ uniqueCut: num 10 .. .. ..$ cutoff : num 0.9 .. ..$ sampling : NULL .. ..$ index :List of 30 .. .. ..$ Fold01.Rep1: int [1:73] 1 2 3 5 6 7 8 9 10 11 ... .. .. ..$ Fold02.Rep1: int [1:72] 1 2 3 4 5 6 7 8 9 10 ... .. .. ..$ Fold03.Rep1: int [1:72] 1 2 3 4 5 6 7 8 9 10 ... .. .. ..$ Fold04.Rep1: int [1:71] 1 2 3 4 5 7 8 9 10 11 ... .. .. ..$ Fold05.Rep1: int [1:72] 1 2 3 4 5 6 7 8 9 10 ... .. .. ..$ Fold06.Rep1: int [1:72] 1 2 4 6 7 8 9 10 11 12 ... .. .. ..$ Fold07.Rep1: int [1:73] 1 3 4 5 6 7 8 9 10 11 ... ....
Species distribution models (SDMs) are becoming an important tool for marine conservation and management. Yet while there is an increasing diversity and volume of marine biodiversity data for training SDMs, little practical guidance is available on how to leverage distinct data types to build robust models. We explored the effect of different data types on the fit, performance and predictive ability of SDMs by comparing models trained with four data types for a heavily exploited pelagic fish, the blue shark (Prionace glauca), in the Northwest Atlantic: two fishery-dependent (conventional mark-recapture tags, fisheries observer records) and two fishery-independent (satellite-linked electronic tags, pop-up archival tags). We found that all four data types can result in robust models, but differences among spatial predictions highlighted the need to consider ecological realism in model selection and interpretation regardless of data type. Differences among models were primarily attributed ..., Please see the README document ("README.md") and the accompanying published article: Braun, C. D., M. C. Arostegui, N. Farchadi, M. Alexander, P. Afonso, A. Allyn, S. J. Bograd, S. Brodie, D. P. Crear, E. F. Culhane, T. H. Curtis, E. L. Hazen, A. Kerney, N. Lezama-Ochoa, K. E. Mills, D. Pugh, N. Queiroz, J. D. Scott, G. B. Skomal, D. W. Sims, S. R. Thorrold, H. Welch, R. Young-Morse, R. Lewison. In press. Building use-inspired species distribution models: using multiple data types to examine and improve model performance. Ecological Applications. Accepted. DOI: < article DOI will be added when it is assigned >,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
|
Validation set | |
Model |
True |
False |
Presence |
A |
B |
Background |
C |
D |
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
The Mechanical MNIST – Distribution Shift dataset contains the results of finite element simulation of heterogeneous material subject to large deformation due to equibiaxial extension at a fixed boundary displacement of d = 7.0. The result provided in this dataset is the change in strain energy after this equibiaxial extension. The Mechanical MNIST dataset is generated by converting the MNIST bitmap images (28x28 pixels) with range 0 - 255 to 2D heterogeneous blocks of material (28x28 unit square) with varying modulus in range 1- s. The original bitmap images are sourced from the MNIST Digits dataset, (http://www.pymvpa.org/datadb/mnist.html) which corresponds to Mechanical MNIST – MNIST, and the EMNIST Letters dataset (https://www.nist.gov/itl/products-and-services/emnist-dataset) which correspond to Mechanical MNIST – EMNIST Letters. The Mechanical MNIST – Distribution Shift dataset is specifically designed to demonstrate three types of data distribution shift: (1) covariate shift, (2) mechanism shift, and (3) sampling bias, for all of which the training and testing environments are drawn from different distributions. For each type of data distribution shift, we have one dataset generated from the Mechanical MNIST bitmaps and one from the Mechanical MNIST – EMNIST Letters bitmaps. For the covariate shift dataset, the training dataset is collected from two environments (2500 samples from s = 100, and 2500 samples from s = 90), and the test data is collected from two additional environments (2000 samples from s = 75, and 2000 samples from s = 50). For the mechanism shift dataset, the training data is identical to the training data in the covariate shift dataset (i.e., 2500 samples from s = 100, and 2500 samples from s = 90), and the test datasets are from two additional environments (2000 samples from s = 25, and 2000 samples from s = 10). For the sampling bias dataset, datasets are collected such that each datapoint is selected from the broader MNIST and EMNIST inputs bitmap selection by a probability which is controlled by a parameter r. The training data is collected from two environments (9800 from r = 15, and 200 from r = -2), and the test data is collected from three different environments (2000 from r = -5, 2000 from r = -10, and 2000 from r = 1). Thus, in the end we have 6 benchmark datasets with multiple training and testing environments in each. The enclosed document “folder_description.pdf'” shows the organization of each zipped folder provided on this page. The code to reproduce these simulations is available on GitHub (https://github.com/elejeune11/Mechanical-MNIST/blob/master/generate_dataset/Equibiaxial_Extension_FEA_test_FEniCS.py).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Daily number of passed, failed and total general written road rules tests throughout Queensland.\r \r Please note: Test Type field in this dataset has been changed from “WRITTEN TEST GENERAL” to “DRIVER PRE LEARNER KNOWLEDGE”.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Linear models are applied widely to analyse empirical data. Modern software allows implementation of linear models with a few clicks or lines of code. While convenient, this increases the risk of ignoring essential assessment steps. Indeed, inappropriate application of linear models is an important source of inaccurate statistical inference. Despite extensive guidance and detailed demonstration of exemplary analyses, many users struggle to implement and assess their own models. To fill this gap, we present a versatile R-workflow template that facilitates (Generalized) Linear (Mixed) Model analyses. The script guides users from data exploration through model formulation, assessment and refinement to the graphical and numerical presentation of results. The workflow accommodates a variety of data types, distribution families, and dependency structures that arise from hierarchical sampling. To apply the routine, minimal coding skills are required for data preparation, naming of variables of interest, linear model formulation, and settings for summary graphs. Beyond that, default functions are provided for visual data exploration and model assessment. Focused on graphs, model assessment offers qualitative feedback and guidance on model refinement, pointing to more detailed or advanced literature where appropriate. With this workflow, we hope to contribute to research transparency, comparability, and reproducibility.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset is longitudinal in nature, comprising data from school years (2007/2008-2010/2011) following students in grade 1 to grade 4. Measures were chosen to provide a wide array of both reading and writing measures, encompassing reading and writing skills at the word, sentence, and larger passage or text levels. Participants were tested on all measures once a year, approximately one year apart. Participants were first grade students in the fall of 2007 whose parents consented to participate in the longitudinal study. Participants attended six different schools in a metropolitan school district in Tallahassee, Florida. Data was gathered by trained testers during thirty to sixty minute sessions in a quiet room designated for testing at the schools. The test battery was scored in a lab by two or more raters and discrepancies in the scoring were resolved by an additional rater.
Reading Measures Decoding Measures. The Woodcock Reading Mastery Tests-Revised (WRMT-R; Woodcock, 1987): Word Attack subtest was used to assess accuracy for decoding non-words. The Test of Word Reading Efficiency (TOWRE; Torgesen, Wagner, & Rashotte, 1999): Phonetic Decoding Efficiency (PDE) subtest was also used to assess pseudo-word reading fluency and accuracy. Both subtests were used to form a word level decoding latent factor. The WRMT-R Word Attack subtest consist of a list of non-words that are read out loud by the participant. The lists start off with letters and become increasingly more difficult to include complex non-words. Testing is discontinued after six consecutive incorrect items. The median reliability is reported to be .87 for Word Attack (Woodock, McGrew, & Mather, 2001). The TOWRE PDE requires accurately reading as many non-words as possible in 45 seconds. The TOWRE test manual reports test-retest reliability to be .90 for the PDE subtest. Sentence Reading Measures. Two forms of the Test of Silent Reading Efficiency and Comprehension (TOSREC, forms A and D; Wagner et al., 2010) were used as measures of silent reading fluency. Students were required to read brief statements (e.g., “a cow is an animal”) and verify the truthfulness of the statement by circling yes or no. Students are given three minutes to read and answer as many sentences as possible. The mean alternate forms reliability for the TOSREC ranges from .86 to .95.
Reading Comprehension Measures. The Woodcock-Johnson-III (WJ-III) Passage Comprehension subtest (Woodcock et al., 2001) and the Woodcock Reading Mastery TestRevised Passage Comprehension subtest (WRMT-R; Woodcock, 1987) were used to provide two indicators of reading comprehension. For both of the passage comprehension subtests, students read brief passages to identify missing words. Testing is discontinued when the ceiling is reached (six consecutive wrong answers or until the last page was reached). According to the test manuals, test-retest reliability is reported to be above .90 for WRMT-R, and the median reliability coefficient for WJ-III is reported to be .92.
Spelling Measures. The Spelling subtest from the Wide Range Achievement Test-3 (WRAT-3; Wilkinson, 1993) and the Spelling subtest from the Wechsler Individual Achievement Test-II (WIAT-II; The Psychological Corporation, 2002) were used to form a spelling factor. 14 Both spelling subtests required students to spell words with increasing difficulty from dictation. The ceiling for the WRAT3 Spelling subtest is misspelling ten consecutive words. If the first five words are not spelled correctly, the student is required to write his or her name and a series of letters and then continue spelling until they have missed ten consecutive items. The ceiling for WIAT-II is misspelling 6 consecutive words. The reliability of the WRAT-3 spelling subtest is reported to be .96 and the reliability of the WIAT-II Spelling subtest is reported to be .94.
Written Expression Measures. The Written Expression subtest from the Wechsler Individual Achievement Test-II (WIAT-II; The Psychological Corporation, 2002) was administered. Written Expression score is based on a composite of Word Fluency and Combining Sentences in first and second grades and a composite of Word Fluency, Combining Sentences, and Paragraph tasks in third grade. In this study the Combining Sentences task was used as an indicator of writing ability at the sentence level. For this task students are asked to combine various sentences into one meaningful sentence. According to the manual, the test-retest reliability coefficient for the Written Expression subtest is .86.
Writing Prompts. A writing composition task was also administered. Participants were asked to write a passage on a topic provided by the tester. Students were instructed to scratch out any mistakes and were not allowed to use erasers. The task was administered in groups and lasted 10 minutes. The passages for years 1 and 2 required expository writing and the passage for year 3 required narrative writing. The topics were as follows: choosing a pet for the classroom (year 1), favorite subject (year 2), a day off from school (year 3). The writing samples were transcribed into a computer database by two trained coders. In order to submit the samples to Coh-Metrix (described below) the coders also corrected the samples. Samples were corrected once for spelling and punctuation using a hard criterion (i.e., words were corrected individually for spelling errors regardless of the context, and run-on sentences were broken down into separate sentences). In addition, the samples were completely corrected using the soft criterion: corrections were made for spelling based on context (e.g., correcting there for their), punctuation, grammar, usage, and syntax (see Appendix A for examples of original and corrected transcripts). The samples that were corrected only for spelling and punctuation using the hard criterion were used for several reasons: (a) developing readers make many spelling errors which make their original samples illegible, and (b) the samples that were completely corrected do not stay true to the child’s writing ability. Accuracy of writing was not reflected in 15 the corrected samples because of the elimination of spelling errors. However, as mentioned above spelling ability was measured separately. Data on compositional fluency and complexity were obtained from Coh-Metrix. Compositional fluency refers to how much writing was done and complexity refers to the density of writing and length of sentences (Berninger et al., 2002; Wagner et al., 2010).
Coh-Metrix Measures. The transcribed samples were analyzed using Coh-Metrix (McNamara et al., 2005; Graesser et al., 2004). Coh-Metrix is a computer scoring system that analyzes over 50 measures of coherence, cohesion, language, and readability of texts. Appendix B contains the list of variables provided by Coh-Metrix. In the present study, the variables were broadly grouped into the following categories: a) syntactic, b) semantic, c) compositional fluency, d) frequency, e) readability and f) situation model. Syntactic measures provide information on pronouns, noun phrases, verb and noun constituents, connectives, type-token ratio, and number of words before the main verb. Connectives are words such as so and because that are used to connect clauses. Causal, logical, additive and temporal connectives indicate cohesion and logical ordering of ideas. Type-token ratio is the ratio of unique words to the number of times each word is used. Semantic measures provide information on nouns, word stems, anaphors, content word overlap, Latent Semantic Analysis (LSA), concreteness, and hypernyms. Anaphors are words (such as pronouns) used to avoid repetition (e.g., she refers to a person that was previously described in the text). LSA refers to how conceptually similar each sentence is to every other sentence in the text. Concreteness refers to the level of imaginability of a word, or the extent to which words are not abstract. Concrete words have more distinctive features and can be easily pictured in the mind. Hypernym is also a measure of concreteness and refers to the conceptual taxonomic level of a word (for example, chair has 7 hypernym levels: seat -> furniture -> furnishings -> instrumentality -> artifact -> object -> entity). Compositional fluency measures include the number of paragraphs, sentences and words, as well as their average length and the frequencies of content words. Frequency indices provide information on the frequency of content words, including several transformations of the raw frequency score. Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content. Readability indices are related to fluency and include two traditional indices used to assess difficulty of text: Flesch Reading Ease Score and Flesch- 16 Kincaid Grade Level. Finally, situation model indices describe what the text is about, including causality of events and actions, intentionality of performing actions, tenses of actions and spatial information. Because Coh-Metrix hasn’t been widely used to study the development of writing in primary grade children (Puranik et al., 2010) the variables used in the present study were determined in an exploratory manner described below. Out of the 56 variables, 3 were used in the present study: total number of words, total number of sentences and average sentence length (or average number of words per sentence). Nelson and Van Meter (2007) report that total word productivity is a robust measure of developmental growth in writing. Therefore, indicators for a paragraph level factor included total number of words and total number of sentences. Average words per sentence was used as an indicator for a latent sentence level factor, along with the WIAT-II Combining Sentences task.
Following the Sunshine State Standards, students are required to take the Florida
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When studying the impacts of climate change, there is a tendency to select climate data from a small set of arbitrary time periods or climate windows (e.g., spring temperature). However, these arbitrary windows may not encompass the strongest periods of climatic sensitivity and may lead to erroneous biological interpretations. Therefore, there is a need to consider a wider range of climate windows to better predict the impacts of future climate change. We introduce the R package climwin that provides a number of methods to test the effect of different climate windows on a chosen response variable and compare these windows to identify potential climate signals. climwin extracts the relevant data for each possible climate window and uses this data to fit a statistical model, the structure of which is chosen by the user. Models are then compared using an information criteria approach. This allows users to determine how well each window explains variation in the response variable and compare model support between windows. climwin also contains methods to detect type I and II errors, which are often a problem with this type of exploratory analysis. This article presents the statistical framework and technical details behind the climwin package and demonstrates the applicability of the method with a number of worked examples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
Data points present in this dataset were obtained following the subsequent steps: To assess the secretion efficiency of the constructs, 96 colonies from the selection plates were evaluated using the workflow presented in Figure Workflow. We picked transformed colonies and cultured in 400 μL TAP medium for 7 days in Deep-well plates (Corning Axygen®, No.: PDW500CS, Thermo Fisher Scientific Inc., Waltham, MA), covered with Breathe-Easy® (Sigma-Aldrich®). Cultivation was performed on a rotary shaker, set to 150 rpm, under constant illumination (50 μmol photons/m2s). Then 100 μL sample were transferred clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA) and fluorescence was measured using an Infinite® M200 PRO plate reader (Tecan, Männedorf, Switzerland). Fluorescence was measured at excitation 575/9 nm and emission 608/20 nm. Supernatant samples were obtained by spinning Deep-well plates at 3000 × g for 10 min and transferring 100 μL from each well to the clear bottom 96-well plate (Corning Costar, Tewksbury, MA, USA), followed by fluorescence measurement. To compare the constructs, R Statistic version 3.3.3 was used to perform one-way ANOVA (with Tukey's test), and to test statistical hypotheses, the significance level was set at 0.05. Graphs were generated in RStudio v1.0.136. The codes are deposit herein.
Info
ANOVA_Turkey_Sub.R -> code for ANOVA analysis in R statistic 3.3.3
barplot_R.R -> code to generate bar plot in R statistic 3.3.3
boxplotv2.R -> code to generate boxplot in R statistic 3.3.3
pRFU_+_bk.csv -> relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_+_bl.csv -> supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
sup_raw.csv -> supernatant mCherry fluorescence dataset of 96 colonies for each construct.
who_+_bl2.csv -> whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
who_raw.csv -> whole culture mCherry fluorescence dataset of 96 colonies for each construct.
who_+_Chlo.csv -> whole culture chlorophyll fluorescence dataset of 96 colonies for each construct.
Anova_Output_Summary_Guide.pdf -> Explain the ANOVA files content
ANOVA_pRFU_+_bk.doc -> ANOVA of relative supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_sup_+_bk.doc -> ANOVA of supernatant mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_who_+_bk.doc -> ANOVA of whole culture mCherry fluorescence dataset of positive colonies, blanked with parental wild-type cc1690 cell of Chlamydomonas reinhardtii
ANOVA_Chlo.doc -> ANOVA of whole culture chlorophyll fluorescence of all constructs, plus average and standard deviation values.
Consider citing our work.
Molino JVD, de Carvalho JCM, Mayfield SP (2018) Comparison of secretory signal peptides for heterologous protein expression in microalgae: Expanding the secretion portfolio for Chlamydomonas reinhardtii. PLoS ONE 13(2): e0192433. https://doi.org/10.1371/journal. pone.0192433
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is intended to accompany the paper "Designing Types for R, Empirically" (@ OOPSLA'20, link to paper). This data was obtained by running the Typetracer (aka propagatr) dynamic analysis tool (link to tool) on the test, example, and vignette code of a corpus of >400 extensively used R packages.
Specifically, this dataset contains:
function type traces for >400 R packages (raw-traces.tar.gz);
trace data processed into a more readable/usable form (processed-traces.tar.gz), which was used in obtaining results in the paper;
inferred type declarations for the >400 R packages using various strategies to merge the processed traces (see type-declarations-* directories), and finally;
contract assertion data from running the reverse dependencies of these packages and checking function usage against the declared types (contract-assertion-reverse-dependencies.tar.gz).
A preprint of the paper is also included, which summarizes our findings.
Fair warning Re: data size: the raw traces, once uncompressed, take up nearly 600GB. The already processed traces are in the 10s of GB, which should be more manageable for a consumer-grade computer.