19 datasets found

Crime Data Analysis
kaggle.com
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Candace Gostinski (2024). Crime Data Analysis [Dataset]. https://www.kaggle.com/datasets/candacegostinski/crime-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Candace Gostinski
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In a world of increasing crime, many organizations are interested in examining incident details to learn from and prevent future crime. Our client, based in Los Angeles County, was interested in this exact thing. They asked us to examine the data to answer several questions; among them, what was the rate of increase or decrease in crime from 2020 to 2023, and which ethnicity or group of people were targeted the most.

Our data was collected from Kaggle.com at the following link:

https://www.kaggle.com/datasets/nathaniellybrand/los-angeles-crime-dataset-2020-present

It was cleaned, examined for further errors, and the analysis performed using RStudio. The results of this analysis are in the attached PDF entitled: "crime_data_analysis_report." Please feel free to review the results as well as follow along with the dataset on your own machine.
Healthcare Device Data Analysis with R
kaggle.com
zip
Updated Oct 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanley888cy (2021). Healthcare Device Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-02
Explore at:
zip(353177 bytes)Available download formats
Dataset updated
Oct 7, 2021
Authors
stanley888cy
Description
Context

Hi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.

In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.

My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.

I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

Stanley Cheng 2021-10-07
Bike Sharing Data Analysis with R
kaggle.com
zip
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanley888cy (2021). Bike Sharing Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-01
Explore at:
zip(189322255 bytes)Available download formats
Dataset updated
Sep 28, 2021
Authors
stanley888cy
Description
What is this ? In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them. I am then able to develop a recommendation based on those findings.

Content: Hi. This is my first data analysis project and also my first time to use R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of a frictional bike-share company in Chicago. For detailed background story, please check the pdf file (Case 01.pdf) for reference.

In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them by descriptive analysis. I am then able to develop a recommendation based on those findings.

First I will make a background introduction, my business tasks and objectives, and how I obtain the data sources for analysis. Also, they are the R code I worked in RStudio for data processing, cleaning and generating graphs for next part analysis. Next, there are my analysis of bike data, with graphs and charts generated by R ggplot2. At the end, I also provide some recommendations to business tasks, based on the data finding.

I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

Stanley Cheng 2021-09-30
d
Replication Data for: Revisiting 'The Rise and Decline' in a Population of...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SG3LP1
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
Description
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Montana State University
California State Polytechnic University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
f
DataSheet1_ALASCA: An R package for longitudinal and cross-sectional...
frontiersin.figshare.com
pdf
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anders Hagen Jarmund; Torfinn Støve Madssen; Guro F. Giskeødegård (2023). DataSheet1_ALASCA: An R package for longitudinal and cross-sectional analysis of multivariate data by ASCA-based methods.pdf [Dataset]. http://doi.org/10.3389/fmolb.2022.962431.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmolb.2022.962431.s001
Dataset updated
Jun 10, 2023
Dataset provided by
Frontiers
Authors
Anders Hagen Jarmund; Torfinn Støve Madssen; Guro F. Giskeødegård
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The increasing availability of multivariate data within biomedical research calls for appropriate statistical methods that can describe and model complex relationships between variables. The extended ANOVA simultaneous component analysis (ASCA+) framework combines general linear models and principal component analysis (PCA) to decompose and visualize the separate effects of experimental factors. It has recently been demonstrated how linear mixed models can be included in the framework to analyze data from longitudinal experimental designs with repeated measurements (RM-ASCA+). The ALASCA package for R makes the ASCA+ framework accessible for general use and includes multiple methods for validation and visualization. The package is especially useful for longitudinal data and the ability to easily adjust for covariates is an important strength. This paper demonstrates how the ALASCA package can be applied to gain insights into multivariate data from interventional as well as observational designs. Publicly available data sets from four studies are used to demonstrate the methods available (proteomics, metabolomics, and transcriptomics).
m
Inflation- Unemployment Data & Analysis Codes (R)
data.mendeley.com
Updated Sep 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hazar Altinbas (2018). Inflation- Unemployment Data & Analysis Codes (R) [Dataset]. http://doi.org/10.17632/v9679528f7.1
Explore at:
Unique identifier
https://doi.org/10.17632/v9679528f7.1
Dataset updated
Sep 11, 2018
Authors
Hazar Altinbas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data is used for examination of inflation- unemployment relationship for 18 countries after 1991. Inflation data is obtained from World Bank database (https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG) and unemployment data is obtained from International Labor Organization (http://www.ilo.org/wesodata/).

Analysis period is different for all countries because of structural breaks determined by single point change point detection algorithm included in changepoint package of Killick & Eckley (2014). Granger-causality is conducted with Toda&Yamamoto (1995) procedure. Integration levels are determined with 3 stationary tests. VAR models are run with vars package (Pfaff, Stigler & Pfaff; 2018) without trend and constant terms. Cointegration test is conducted with urca package (Pfaff, Zivot, Stigler & Pfaff; 2016).

All data files are .csv files. Analyst need to change country index (variable name: j) in order to see individual results. Findings can be seen in the article.

Killick, R., & Eckley, I. (2014). changepoint: An R package for changepoint analysis. Journal of statistical software, 58(3), 1-19.

Pfaff, B., Stigler, M., & Pfaff, M. B. (2018). Package ‘vars’. Online] https://cran. r-project. org/web/packages/vars/vars. pdf.

Pfaff, B., Zivot, E., Stigler, M., & Pfaff, M. B. (2016). Package ‘urca’. Unit root and cointegration tests for time series data. R package version, 1-2.

Toda, H. Y., & Yamamoto, T. (1995). Statistical inference in vector autoregressions with possibly integrated processes. Journal of econometrics, 66(1-2), 225-250.
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
INESC TEC
Centro de Computação Gráfica
Authors
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
Data Science Interview 👩‍💻Questions Collection📂
kaggle.com
zip
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Jafer (2021). Data Science Interview 👩‍💻Questions Collection📂 [Dataset]. https://www.kaggle.com/syedjaferk/datascience-interview-questions-collection
Explore at:
zip(16464364 bytes)Available download formats
Dataset updated
Nov 1, 2021
Authors
Syed Jafer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

Collection of Interview Preparation Materials from different resources.

Content

Collection of PDF's and Kaggle Discussions helpful for DataScience Interview Questions.

Acknowledgements

Acknowledgements is cited under each resources.

Inspiration

To have a good source of materials for interview preparations.
m
Supplementary Data II: R/R-Studio software code for waste survey data...
data.mendeley.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fwangmun Wamyil (2024). Supplementary Data II: R/R-Studio software code for waste survey data analysis of Jos Plateau state, Nigeria [Dataset]. http://doi.org/10.17632/j8c4s7mdx4.1
Explore at:
Unique identifier
https://doi.org/10.17632/j8c4s7mdx4.1
Dataset updated
Nov 7, 2024
Authors
Fwangmun Wamyil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Jos, Nigeria, Plateau
Description
R/R-Studio software code for waste survey data analysis of Jos Plateau state, Nigeria. This details the software code used in R-Studio (RStudio 2023.06.2+561 "Mountain Hydrangea" ) and R (R version 4.3.1) for analyzing the study on waste characterization survey/waste audit in Jos, Plateau state, Nigeria. The code is provided in .R, .docx, and .pdf., it is accessible directly using RStudio for .docx and .pdf. It includes codes for generating plots used in paper publication. Note: that you require the dataset used in the study which is provided in steps to reproduce.
Source files For Bike Share Case Study
kaggle.com
zip
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MG (2022). Source files For Bike Share Case Study [Dataset]. https://www.kaggle.com/datasets/magdas0/source-files-for-bike-share-case-study/versions/7
Explore at:
zip(4497856 bytes)Available download formats
Dataset updated
Aug 4, 2022
Authors
MG
Description
Bike Share Case Study

This case study has been prepared as a partial fulfillment for the Capstone project, the final course in Google Data Analytics offered by Google at the Coursera platform.

I created a dataset that contains source files I wrote to perform this analysis:

Files presenting and documenting the analysis

2022-08-04-bike-share-pres.pdf - the final presentation of the results including diagrams, conclusions and recommendations, and

2022-08-04-bike-share-report.pdf - document describing all stages of the project

scripts - R, bash, and SQL scripts I created and used for this project

spreadsheets - spreadsheets I created and used for this project

The original data regarding bike sharing program is available publicly. The link is provided in the presentation and in the report.
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
There are five documents included. The first PDF document contains the data...
figshare.com
pdf
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengzhi Hu (2025). There are five documents included. The first PDF document contains the data analysis scripts and the analysis statement from the educational authority that regulates the research. Each page in this document is stamped for verification purposes. The second TXT document contains the data analysis scripts obtained from R program. The script is exactly the same as the one included in the first document. The third PDF document contains the official statement on the data access and confidentiality, with official stamp for verification. Due to ethical considerations and the regulation of local educational policies, the raw dataset for the study cannot be shared but can be accessed on-site with the educational authority. The last two documents are the speaking tasks used in the study, including a text presentation of task instructions and a test instruction recording. [Dataset]. http://doi.org/10.6084/m9.figshare.29390120.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29390120.v1
Dataset updated
Jun 24, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Hengzhi Hu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data analysis transcript is specifically for the study titled "What Shapes Automated Ratings in Computer-Based English Speaking Tests? Perspectives from Analytic Complexity, Accuracy, Fluency, and Pronunciation Indices". This study adopts a quantitative, correlational research design to investigate the extent to which various linguistic features—namely Complexity, Accuracy, Fluency, and Pronunciation (CAFP)—predict Automated Ratings (ARs) in China’s Computer-Based English Speaking Test (CBEST) administered during Zhongkao. The aim is to uncover how these linguistic indices influence machine-generated scores and to evaluate the validity and fairness of automated assessment systems in high-stakes educational contexts.The CBEST format used in this study includes three task types: Reading-Aloud, Communicative Question & Answer, and Response to a Topic. These tasks are scored using an integrated system developed by iFlytek, which combines automatic speech recognition (ASR), deep learning models, and benchmarked manual expert evaluation. The assessment model has been officially recognized and is widely adopted in Chinese provinces for junior secondary school students.
Table_3_Hotspot and Frontier Analysis of Exercise Training Therapy for Heart...
frontiersin.figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Wang; Yuhong Jia; Molin Li; Sirui Jiao; Henan Zhao (2023). Table_3_Hotspot and Frontier Analysis of Exercise Training Therapy for Heart Failure Complicated With Depression Based on Web of Science Database and Big Data Analysis.pdf [Dataset]. http://doi.org/10.3389/fcvm.2021.665993.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2021.665993.s003
Dataset updated
May 31, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yan Wang; Yuhong Jia; Molin Li; Sirui Jiao; Henan Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Exercise training has been extensively studied in heart failure (HF) and psychological disorders, which has been shown to worsen each other. However, our understanding of how exercise simultaneously protect heart and brain of HF patients is still in its infancy. The purpose of this study was to take advantage of big data techniques to explore hotspots and frontiers of mechanisms that protect the heart and brain simultaneously through exercise training.Methods: We studied the scientific publications on related research between January 1, 2003 to December 31, 2020 from the WoS Core Collection. Research hotspots were assessed through open-source software, CiteSpace, Pajek, and VOSviewer. Big data analysis and visualization were carried out using R, Cytoscape and Origin.Results: From 2003 to 2020, the study on HF, depression, and exercise simultaneously was the lowest of all research sequences (two-way ANOVAs, p < 0.0001). Its linear regression coefficient r was 0.7641. The result of hotspot analysis of related keyword-driven research showed that inflammation and stress (including oxidative stress) were the common mechanisms. Through the further analyses, we noted that inflammation, stress, oxidative stress, apoptosis, reactive oxygen species, cell death, and the mechanisms related to mitochondrial biogenesis/homeostasis, could be regarded as the primary mechanism targets to study the simultaneous intervention of exercise on the heart and brain of HF patients with depression.Conclusions: Our findings demonstrate the potential mechanism targets by which exercise interferes with both the heart and brain for HF patients with depression. We hope that they can boost the attention of other researchers and clinicians, and open up new avenues for designing more novel potential drugs to block heart-brain axis vicious circle.
Data_Sheet_10_A mathematical and exploratory data analysis of malaria...
frontiersin.figshare.com
pdf
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah (2023). Data_Sheet_10_A mathematical and exploratory data analysis of malaria disease transmission through blood transfusion.PDF [Dataset]. http://doi.org/10.3389/fams.2023.1105543.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fams.2023.1105543.s002
Dataset updated
Jun 20, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Malaria is a mosquito-borne disease spread by an infected vector (infected female Anopheles mosquito) or through transfusion of plasmodium-infected blood to susceptible individuals. The disease burden has resulted in high global mortality, particularly among children under the age of five. Many intervention responses have been implemented to control malaria disease transmission, including blood screening, Long-Lasting Insecticide Bed Nets (LLIN), treatment with an anti-malaria drug, spraying chemicals/pesticides on mosquito breeding sites, and indoor residual spray, among others. As a result, the SIR (Susceptible—Infected—Recovered) model was developed to study the impact of various malaria control and mitigation strategies. The associated basic reproduction number and stability theory is used to investigate the stability analysis of the model equilibrium points. By constructing an appropriate Lyapunov function, the global stability of the malaria-free equilibrium is investigated. By determining the direction of bifurcation, the implicit function theorem is used to investigate the stability of the model endemic equilibrium. The model is fitted to malaria data from Benue State, Nigeria, using R and MATLAB. Estimates of parameters were made. Following that, an optimal control model is developed and analyzed using Pontryaging's Maximum Principle. The malaria-free equilibrium point is locally and globally stable if the basic reproduction number (R0) and the blood transfusion reproduction number (Rα) are both less or equal to unity. The study of the sensitive parameters of the model revealed that the transmission rate of malaria from mosquito-to-human (βmh), transmission rate from humans-to-mosquito (βhm), blood transfusion reproduction number (Rα) and recruitment rate of mosquitoes (bm) are all sensitive parameters capable of increasing the basic reproduction number (R0) thereby increasing the risk in spreading malaria disease. The result of the optimal control shows that five possible controls are effective in reducing the transmission of malaria. The study recommended the combination of five controls, followed by the combination of four and three controls is effective in mitigating malaria transmission. The result of the optimal simulation also revealed that for communities or areas where resources are scarce, the combination of Long Lasting Insecticide Treated Bednets (u2), Treatment (u3), and Indoor insecticide spray (u5) is recommended. Numerical simulations are performed to validate the model's analytical results.
Data_Sheet_2_SplinectomeR Enables Group Comparisons in Longitudinal...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin R. Shields-Cutler; Gabe A. Al-Ghalith; Moran Yassour; Dan Knights (2023). Data_Sheet_2_SplinectomeR Enables Group Comparisons in Longitudinal Microbiome Studies.PDF [Dataset]. http://doi.org/10.3389/fmicb.2018.00785.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2018.00785.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Robin R. Shields-Cutler; Gabe A. Al-Ghalith; Moran Yassour; Dan Knights
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Longitudinal, prospective studies often rely on multi-omics approaches, wherein various specimens are analyzed for genomic, metabolomic, and/or transcriptomic profiles. In practice, longitudinal studies in humans and other animals routinely suffer from subject dropout, irregular sampling, and biological variation that may not be normally distributed. As a result, testing hypotheses about observations over time can be statistically challenging without performing transformations and dramatic simplifications to the dataset, causing a loss of longitudinal power in the process. Here, we introduce splinectomeR, an R package that uses smoothing splines to summarize data for straightforward hypothesis testing in longitudinal studies. The package is open-source, and can be used interactively within R or run from the command line as a standalone tool. We present a novel in-depth analysis of a published large-scale microbiome study as an example of its utility in straightforward testing of key hypotheses. We expect that splinectomeR will be a useful tool for hypothesis testing in longitudinal microbiome studies.
Supplement 1. An R-script and data file for the analysis conducted in the...
wiley.figshare.com
html
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melvin M Varughese; Etienne A. D Pienaar (2023). Supplement 1. An R-script and data file for the analysis conducted in the main paper. [Dataset]. http://doi.org/10.6084/m9.figshare.3563832.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3563832.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Melvin M Varughese; Etienne A. D Pienaar
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Data.txt (md5: 98f606cc561c67f3c31fffc54687162b) CT_Method_(User_Version).r (md5: 4b9125ece816c25f69503000707f1031)

Description Data.txt – A data set containing species abundances for Pseudo-nitzschia australis and Prorocentrum micans along with surface water temperatures, collected at the Scripps Institute of Oceanography station for the time period 1930 to 1937. No column headers are given in the file for coding purposes, however the columns, in order, correspond to: the day number on which observations were made, the log-abundance for Pseudo-nitzschia australis, the log abundance for Prorocentrum micans and finally the surface temperature. CT_Method_(User_Version).r – R source code of approximately 475 lines (comments included) for implementation of the methodology introduced in the paper. The code is set to run the analysis presented in the paper. The R code should run so long as Data.txt is kept in the current R working directory. Comments in the script indicate, where necessary, what the applicable lines of code do and how they pertain to the main paper. MCMC output is automatically analyzed and figures are saved in .pdf format in the working directory. It is advised that an open source editor such as Tinn-R be used in order to aid readability of the code.
f
DataSheet_1_AgTC and AgETL: open-source tools to enhance data collection and...
frontiersin.figshare.com
pdf
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Vargas-Rojas; To-Chia Ting; Katherine M. Rainey; Matthew Reynolds; Diane R. Wang (2024). DataSheet_1_AgTC and AgETL: open-source tools to enhance data collection and management for plant science research.pdf [Dataset]. http://doi.org/10.3389/fpls.2024.1265073.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2024.1265073.s001
Dataset updated
Feb 21, 2024
Dataset provided by
Frontiers
Authors
Luis Vargas-Rojas; To-Chia Ting; Katherine M. Rainey; Matthew Reynolds; Diane R. Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an Extract-Transform-Load (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.
RQ4
figshare.com
pdf
Updated Jan 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunior Pacheco (2019). RQ4 [Dataset]. http://doi.org/10.6084/m9.figshare.7562237.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7562237.v2
Dataset updated
Jan 8, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yunior Pacheco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Each CSV file contains information about the number of classes in a framework according to the number of extension points defined in them.This file can be used as input to the plots.R scriptThe pdf files show the resulting graphics corresponding to the data of the csv files.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Candace Gostinski (2024). Crime Data Analysis [Dataset]. https://www.kaggle.com/datasets/candacegostinski/crime-data-analysis

Crime Data Analysis

An analysis of crime in Los Angeles County from 2020-2024.

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 9, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Candace Gostinski

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

In a world of increasing crime, many organizations are interested in examining incident details to learn from and prevent future crime. Our client, based in Los Angeles County, was interested in this exact thing. They asked us to examine the data to answer several questions; among them, what was the rate of increase or decrease in crime from 2020 to 2023, and which ethnicity or group of people were targeted the most.

Our data was collected from Kaggle.com at the following link:

https://www.kaggle.com/datasets/nathaniellybrand/los-angeles-crime-dataset-2020-present

It was cleaned, examined for further errors, and the analysis performed using RStudio. The results of this analysis are in the attached PDF entitled: "crime_data_analysis_report." Please feel free to review the results as well as follow along with the dataset on your own machine.

Clear search

Close search

Google apps

Main menu

Crime Data Analysis

Healthcare Device Data Analysis with R

Context

Bike Sharing Data Analysis with R

Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

Data from: Designing data science workshops for data-intensive environmental...

DataSheet1_ALASCA: An R package for longitudinal and cross-sectional...

Inflation- Unemployment Data & Analysis Codes (R)

Assessing the impact of hints in learning formal specification: Research...

Data Science Interview 👩‍💻Questions Collection📂

Context

Content

Acknowledgements

Inspiration

Supplementary Data II: R/R-Studio software code for waste survey data...

Source files For Bike Share Case Study

Bike Share Case Study

Data Cleaning Sample

There are five documents included. The first PDF document contains the data...

Table_3_Hotspot and Frontier Analysis of Exercise Training Therapy for Heart...

Data_Sheet_10_A mathematical and exploratory data analysis of malaria...

Data_Sheet_2_SplinectomeR Enables Group Comparisons in Longitudinal...

Supplement 1. An R-script and data file for the analysis conducted in the...

DataSheet_1_AgTC and AgETL: open-source tools to enhance data collection and...

RQ4

Crime Data AnalysisSee More Versions

An analysis of crime in Los Angeles County from 2020-2024.

Crime Data Analysis