Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Subsetting of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A subset of OpenWebText, an open-source recreation of OpenAI's internal WebText corpus. This subset contains ~2 million documents, mainly in English, scraped from the Web. Highly unstructed text data; not necessarily clean.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A subset of Common Crawl, extracted from Colossally Cleaned Common Crawl (C4) dataset with the additional constraint that extracted text safely encodes to ASCII. A Unigram tokenizer of vocabulary 12.228k tokens is provided, along with pre-tokenized data.
Facebook
TwitterSource Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 60247 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.
The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):
Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:
sdss-gs/
├── data.csv
├── fits
├── img
├── spectra
└── ssel
Where, each directory contains:
Changelog
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Loop Functions of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024
Facebook
TwitterThis dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This code is not mine. The dataset provided here is a balanced subset derived from the original dataset, and I do not claim ownership over the original data.
The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil.
The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the CODE dataset is described in and used for assessing model performance:
"Deep neural network estimated electrocardiographic-age as a mortality predictor"
Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232
This dataset is a subset of the CODE 15% dataset obtained by random sampling from the negative class while maintaining all the observations of the positive class to create a balanced dataset without the need to focus on class imbalance.
The code15_hdf5 folder contains the exams and labels for the entire CODE 15% dataset. The code15_wfdb folder contains the exam records file in .dat format.
An additional file (signals_features.csv) is provided, containing handcrafted features from the ECG records (lead II) related to P, Q, R, S, and T waves. Features such as P wave duration, PR interval, PR segment, QRS duration, ST segment, and ST slope were computed by first extracting all the points using the neurokit2 Python library and then aggregated for each record ID using descriptive statistics. Heart rate variability features were also included along with the P, Q, R, S, and T waves.
Link to the original dataset: https://doi.org/10.5281/zenodo.4916206
Facebook
Twitterhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;· syntax: word class, subcategorisations per word class;· frequency of the entries: disambiguated for homographic lemmata.
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Getting started, Background of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024
Facebook
TwitterData from the IFLS, merged across waves, most outcomes taken from wave 5. Includes birth order, family structure, Big 5 Personality, intelligence tests, and risk lotteries
This table contains variable names, labels, and number of missing values. See the complete codebook for more.
[truncated]
This dataset was automatically described using the codebook R package (version 0.8.2).
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This index was compiled by Miss Mollie Bentley from various records she has used relating to the police. These include: Almanac listings, Colonial Secretary's Office Records, Police Gazettes, various police department occurrence books and letter books, police journals, government gazettes, estimates, York police records etc.\r \r Entry is by name of policeman. Information given varies but is usually about appointments, promotions, retirements, transfers etc.\r \r The Western Australian Biographical Index (WABI) is a highly used resource at the State Library of Western Australia. A recent generous contribution by the Friends of Battye Library (FOBS) has enabled SLWA to have the original handwritten index cards scanned and later transcribed.\r \r The dataset contains: several csv files with data describing card number, card text and url link to image of the original handwritten card.\r \r The transcription was crowd-sourced and we are aware that there are some data quality issues including:\r \r * Some cards are missing\r * Transcripts are crowdsourced so may contain spelling errors and possibly missing information\r * Some cards are crossed out. Some of these are included in the collection and some are not\r * Some of the cards contain relevant information on the back (usually children of the person mentioned). This info should be on the next consecutive card\r * As the information is an index, collected in the 1970s from print material, it is incomplete. It is also unreferenced.\r It is still a very valuable dataset as it contains a wealth of information about early settlers in Western Australia. It is of particular interest to genealogists and historians.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zip-file contains the data and code accompanying the paper 'Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis'. Together, these files should allow for the replication of the results.
The 'raw_data' folder contains the 'MA_database.csv' file, which contains the extracted data from all primary studies that are used in the analysis. Furthermore, this folder contains the file 'MA_database_description.txt', which gives a description of each data column in the database.
The 'derived_data' folder contains the files that are produced by the R-scripts in this study and used for data analysis. The 'MA_database_processed.csv' and 'MA_database_processed.RData' files contain the converted raw database that is suitable for analysis. The 'DB_IA_subsets.RData' file contains the 'Individual Abundance' (IA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria. The 'DB_IA_VCV_matrices.RData' contains for all IA data subsets the variance-covariance (VCV) matrices. The 'DB_AM_subsets.RData' file contains the 'Total Abundance' (TA) and 'Mean Abundance' (MA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria.
The 'output_data' folder contains maps with the output data for each data subset (i.e. for each metric, taxonomic group and set of inclusion criteria). For each data subset, the map contains random effects selection results ('Results1_REsel_
The 'scripts' folder contains all R-scripts that we used for this study. The 'PrepareData.R' script takes the database as input and adjusts the file so that it can be used for data analysis. The 'PrepareDataIA.R' and 'PrepareDataAM.R' scripts make subsets of the data and prepare the data for the meta-regression analysis and mixed-effects regression analysis, respectively. The regression analyses are performed in the 'SelectModelsIA.R' and 'SelectModelsAM.R' scripts to calculate the regression model results for the IA metric and MA/TA metrics, respectively. These scripts require the 'RandomAndFixedEffects.R' script, containing the random and fixed effects parameter combinations, as well as the 'Functions.R' script. The 'CreateMap.R' script creates a global map with the location of all studies included in the analysis (figure 1 in the paper). The 'CreateForestPlots.R' script creates plots showing the IA data distribution for both taxonomic groups (figure 2 in the paper). The 'CreateHeatMaps.R' script creates heat maps for all metrics and taxonomic groups (figure 3 in the paper, figures S11.1 and S11.2 in the appendix). The 'CalculateStatistics.R' script calculates the descriptive statistics that are reported throughout the paper, and creates the figures that describe the dataset characteristics (figures S3.1 to S3.5 in the appendix). The 'CreateFunnelPlots.R' script creates the funnel plots for both taxonomic groups (figures S6.1 and S6.2 in the appendix) and performs Egger's tests. The 'CreateControlGraphs.R' script creates graphs showing the dependency of the nutrient response to control concentrations for all metrics and taxonomic groups (figures S10.1 and S10.2 in the appendix).
The 'figures' folder contains all figures that are included in this study.
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Simulation of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024
Facebook
TwitterSubset of MedleyDB: Mix audio files and Melody Annotations for the 108 files in the MedleyDB multitrack dataset containing melody.
There are 3 types of melody annotations released:
melody1: "The f0 curve of the predominant melodic line drawn from a single source"
melody2: "The f0 curve of the predominant melodic line drawn from multiple sources"
melody3: "The f0 curves of all melodic lines drawn from multiple sources"
For further details, refer to the MedleyDB website.
Further Annotation and Metadata files are version controlled and are available in the MedleyDB github repository: Metadata can be found here, Annotations can be found here.
For detailed information about the dataset, please visit MedleyDB's website.
If you make use of MedleyDB for academic purposes, please cite the following publication:
R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam and J. P. Bello, "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research", in 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, Oct. 2014.
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Control structures of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MZ = monozygotic, DZ = dizygotic, FA = fractional anisotropy, R = right-handed, WM = white matter.*White matter was segmented reliably in 187 children at baseline, and 117 children at follow-up.**105 children had reliable white matter segmentations and DTI measurements at both time-points. The last column displays p-values of the differences in sample characteristics for children who participated only once, versus children who participated twice.
Facebook
Twitterhttps://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain
This dataset contains time series observations of surface-atmosphere exchanges of sensible heat (H) and latent heat (LE) and momentum (τ) measured at UKCEH eddy covariance flux observation sites during summer 2019. The dataset includes ancillary weather and soil physics observations made at each site. Eddy covariance (EC) and micrometeorological observations were collected using open-path eddy covariance systems. Flux, meteorological and soil physics observations were collected and processed using harmonised protocols across all sites. This work was supported by the Natural Environment Research Council award number NE/R016429/1 as part of the UK-SCAPE programme delivering National Capability.
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Subsetting of Data Analysis with R, 2nd Semester , Bachelor of Computer Application 2023-2024