Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
TwitterWhen working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterOriginal dataset that is shared on Github can be found here. These are hands on practice datasets that were linked through the Coursera Guided Project Certificate Course for Handling Missing Values in R, a part of Coursera Project Network. The datasets links were shared by the original author and instructor of the course Arimoro Olayinka Imisioluwa.
Things you could do with this dataset: As a beginner in R, these datasets helped me to get a hang over making data clean and tidy and handling missing values(only numeric) using R. Good for anyone looking for a beginner to intermediate level understanding in these subjects.
Here are my notebooks as kernels using these datasets and using a few more preloaded datasets in R, as suggested by the instructor. TidY DatA Practice MissinG DatA HandlinG - NumeriC
Facebook
TwitterSubscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Rafael R. Lima
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")
Facebook
TwitterSurveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.
Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.
The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files.
The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, r...
Facebook
TwitterThis dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.
The code to derive the dataset is given as follows:
library(tesseract) library(pdftools)
pdfs <- list.files("[path to your output directory containing all PDF files]")
meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)
meta$Doc.Date <- as.character(meta$Doc.Date)
meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])
if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }
}
meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")
meta.clean <- meta.clean[order(meta.clean$Doc.Date),]
docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }
img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)
txt <- ""
for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }
docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }
write.table(docs,"[path to your output directory]/documents.csv", row.names = F)
Facebook
TwitterThis dataset was created by SantosJGND
Facebook
TwitterShanghai Hanyang Clean Technology R Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
Facebook
TwitterFor any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.
Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.
I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.
Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html
Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy
Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)
This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.
One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.
I am also interested to see what others did with this analysis - what were the findings and insights you found?
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Case study: How does a bike-share navigate speedy success?
Scenario:
As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.
About the company
In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.
Project Overview:
This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.
Dataset Acknowledgment:
We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.
Objective:
The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.
Methodology:
Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.
Visualization and Reporting:
Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:
Conclusion:
The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.
Acknowledgments:
Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.
STRATEGIES USED
Case Study Roadmap - ASK
●What is the problem you are trying to solve? ●How can your insights drive business decisions?
Key Tasks ● Identify the business task ● Consider key stakeholders
Deliverable ● A clear statement of the business task
Case Study Roadmap - PREPARE
● Where is your data located? ● Are there any problems with the data?
Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.
Deliverable ● A description of all data sources used
Case Study Roadmap - PROCESS
● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?
Key tasks ● Choose your tools. ● Document the cleaning process.
Deliverable ● Documentation of any cleaning or manipulation of data
Case Study Roadmap - ANALYZE
● Has your data been properly formaed? ● How will these insights help answer your business questions?
Key tasks ● Perform calculations ● Formatting
Deliverable ● A summary of analysis
Case Study Roadmap - SHARE
● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?
Key tasks ● Present your findings ● Create effective data viz.
Deliverable ● Supporting viz and key findings
**Case Study Roadmap - A...
Facebook
Twitter(1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".
The scripts contain:
- A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)
- Two files to make a wide variety of plots (Plotting.R and Specificplots.R)
- A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.