Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Case study: How does a bike-share navigate speedy success?
Scenario:
As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.
About the company
In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.
Project Overview:
This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.
Dataset Acknowledgment:
We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.
Objective:
The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.
Methodology:
Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.
Visualization and Reporting:
Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:
Conclusion:
The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.
Acknowledgments:
Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.
STRATEGIES USED
Case Study Roadmap - ASK
●What is the problem you are trying to solve? ●How can your insights drive business decisions?
Key Tasks ● Identify the business task ● Consider key stakeholders
Deliverable ● A clear statement of the business task
Case Study Roadmap - PREPARE
● Where is your data located? ● Are there any problems with the data?
Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.
Deliverable ● A description of all data sources used
Case Study Roadmap - PROCESS
● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?
Key tasks ● Choose your tools. ● Document the cleaning process.
Deliverable ● Documentation of any cleaning or manipulation of data
Case Study Roadmap - ANALYZE
● Has your data been properly formaed? ● How will these insights help answer your business questions?
Key tasks ● Perform calculations ● Formatting
Deliverable ● A summary of analysis
Case Study Roadmap - SHARE
● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?
Key tasks ● Present your findings ● Create effective data viz.
Deliverable ● Supporting viz and key findings
**Case Study Roadmap - A...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1430847%2F29f7950c3b7daf11175aab404725542c%2FGettyImages-1187621904-600x360.jpg?generation=1601115151722854&alt=media" alt="">
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions
32 cheat sheets: This includes A-Z about the techniques and tricks that can be used for visualization, Python and R visualization cheat sheets, Types of charts, and their significance, Storytelling with data, etc..
32 Charts: The corpus also consists of a significant amount of data visualization charts information along with their python code, d3.js codes, and presentations relation to the respective charts explaining in a clear manner!
Some recommended books for data visualization every data scientist's should read:
In case, if you find any books, cheat sheets, or charts missing and if you would like to suggest some new documents please let me know in the discussion sections!
A kind request to kaggle users to create notebooks on different visualization charts as per their interest by choosing a dataset of their own as many beginners and other experts could find it useful!
To create interactive EDA using animation with a combination of data visualization charts to give an idea about how to tackle data and extract the insights from the data
Feel free to use the discussion platform of this data set to ask questions or any queries related to the data visualization corpus and data visualization techniques
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. We examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. The datasets used to illustrate points in the associated review are provided here together with the R script used to analyse the data. Data are either simulated internal to this script or are SNP data generated as part of other studies and included as compressed binary files readily accessable by reading into R using R base function readRDS(). Refer to the analysis script for examples. Methods A dataset was constructed from a SNP matrix generated for the freshwater turtles in the genus Emydura, a recent radiation of Chelidae in Australasia. The dataset (SNP_starting_data.Rdata) includes selected populations that vary in level of divergence to encompass variation within species and variation between closely related species. Sampling localities with evidence of admixture between species were removed. Monomorphic loci were removed, and the data was filtered on call rate (>95%), repeatability (>99.5%) and read depth (5x < read depth < 50x). Where there was more than one SNP per sequence tag, only one was retained at random. The resultant dataset had 18,196 SNP loci scored for 381 individuals from 7 sampling localities or populations – Emydura victoriae [Ord River, NT, n=15], E. tanybaraga [Holroyd River, Qld, n=10], E. subglobosa worrelli [Daly River, NT, n=25], E. subglobosa subglobosa [Fly River, PNG, n=55], E. macquarii macquarii [Murray Darling Basin north, NSW/Qld, n=152], E. macquarii krefftii [Fitzroy River, Qld, n=39] and E. macquarii emmotti [Cooper Creek, Qld, n=85]. The missing data rate was 1.7%, subsequently imputed by nearest neighbour to yield a fully populated data matrix. The data are a subset of those published by Georges et al. (2018, Molecular Ecology 27:5195-5213) for illustrative purposes only. A companion SilicoDArT dataset (silicodart_starting_data.Rdata) is also included. The above manipulations were performed in R package dartR. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package (as implemented in dartR). Principal Coordinates Analysis was undertaken using the pcoa function in R package ape implemented in dartR. To exemplify the effect of missing values on SNP visualisation using PCA, we simulated ten populations that reproduced over 200 non-overlapping generations. Simulated populations were placed in a linear series with low dispersal between adjacent populations (one disperser every ten generations). Each population had 100 individuals, of which 50 individuals were sampled at random. Genotypes were generated for 1,000 neutral loci on one chromosome. We then randomly selected 50% of genotypes and set them as missing data. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package. The R script to implement this is provided (Supplementary_script_for_ms.R). The data for the Australian Blue Mountains skink Eulamprus leuraensis were generated for 372 individuals collected from 17 swamps isolated to varying degrees in the Blue Mountains region of New South Wales. Tail snips were collected and stored in 95% ethanol. The tissue samples were digested with proteinase K overnight and DNA was extracted using a NucleoMag 96 Tissue Kit (MachereyNagel, Duren, Germany) coupled with NucleoMag SEP (Ref. 744900) to allow automated separation of high-quality DNA on a Freedom Evo robotic liquid handler (TECAN, Miinnedorf, Switzerland). SNP data were generated by the commercial service of Diversity Arrays Technology Pty Ltd (Canberra, Australia) using published protocols. A total of 13,496 loci were scored which reduced to 7,935 after filtering out secondary SNPs on the same sequence tag, filtering on reproducibility (threshold 0.99) and call rate (threshold 0.95), and removal of monomorphic loci. The resultant data (Eulamprus_filtered.Rdata) is used to demonstrate the impact of a substantial inversion on the outcomes of a PCA. To test the effect of having closely related individuals (parents and offspring) on the PCoA pattern we ran a simulation using dartR, where we picked up two individuals to become the parents with 2-8 offspring. We ran a PCoA for all of the simulated cases. The R code used is included in the R script uploaded here. Refer to the companion manuscript for links to the literature associated with the above techniques.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The FragPipe computational proteomics platform is gaining widespread popularity among the proteomics research community because of its fast processing speed and user-friendly graphical interface. Although FragPipe produces well-formatted output tables that are ready for analysis, there is still a need for an easy-to-use and user-friendly downstream statistical analysis and visualization tool. FragPipe-Analyst addresses this need by providing an R shiny web server to assist FragPipe users in conducting downstream analyses of the resulting quantitative proteomics data. It supports major quantification workflows, including label-free quantification, tandem mass tags, and data-independent acquisition. FragPipe-Analyst offers a range of useful functionalities, such as various missing value imputation options, data quality control, unsupervised clustering, differential expression (DE) analysis using Limma, and gene ontology and pathway enrichment analysis using Enrichr. To support advanced analysis and customized visualizations, we also developed FragPipeAnalystR, an R package encompassing all FragPipe-Analyst functionalities that is extended to support site-specific analysis of post-translational modifications (PTMs). FragPipe-Analyst and FragPipeAnalystR are both open-source and freely available.
Facebook
Twitter[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
Facebook
TwitterA comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
Facebook
TwitterBlood Transfusion Service Center Data Set : - Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan This is a classification problem.
Data Set Information : To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Attribute Information : Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database. R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
Table 1 shows the descriptive statistics of the data. We selected 500 data at random as the training set, and the rest 248 as the testing set.
Table 1. Descriptive statistics of the data - Variable Data Type Measurement Description min max mean std - Recency quantitative Months Input 0.03 74.4 9.74 8.07 Frequency quantitative Times Input 1 50 5.51 5.84 - Monetary quantitative c.c. blood Input 250 12500 1378.68 1459.83 - Time quantitative Months Input 2.27 98.3 34.42 24.32 - Whether he/she donated blood in March 2007 binary 1=yes 0=no Output 0 1 1 (24%) 0 (76%)
| Data Set Characteristics | Multivariate |
| Number of Instances | 748 |
| Area | Business |
| Attribute Characteristics | Real |
| Number of Attributes | 5 |
| Associated Tasks | Classification |
| Missing Values? | N/A |
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Citation Request : NOTE: Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence, "Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source data and code supporting the manuscript "Beyond NUE: A focus on true nitrogen gains in cereals" based on maize, sorghum, and barley datasets.
README (README.txt). References, sources for the reviewed dataset.
Dataset S1 (rev_dataset.xlsx). Review data on NUE related traits for maize, sorghum, and barley cultivars from different decades of commercial release collected from published literature. Missing values or information of studies are represented by “n/a” in the data. References of traits and variables included in the data are described in a separate sheet within the XLSX file.
Dataset S2 (analysisR.pdf). R code for data processing, analysis, and visualization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains wireless link quality estimation data for the FlockLab testbed [1,2]. The rationale and description of this dataset is described in a the following abstract (pdf is included in this repository -- see below).
Dataset: Wireless Link Quality Estimationon FlockLab – and Beyond Romain Jacob, Reto Da Forno, Roman Trüb, Andreas Biri, Lothar Thiele DATA '19 Proceedings of the 2nd Workshop on Data Acquisition To Analysis, 2019
Data collection scenario
The data collection scenario is simple. Each FlockLab node is assigned one dedicated time slot. In this slot, a node sends 100 packets, called strobes. All strobes have the same payload size and use a given radio frequency channel and transmit power. All other nodes listen for the strobes and log packet reception events (i.e., success or failed).
The test scenario is ran every two hours on two different platforms: the TelosB [3] and DPP-cc430 [4] platforms. We used all nodes currently available at test time (between 27 and 29).
Final dataset status
3 months of data with about 12 tests per day per platform
5 month of data with about 4 tests per day per platform
Data collection firmware
We are happy to share the link quality data we collected for the FlockLab testbed, but we also wanted to make it easier for others to collect similar datasets for other wireless networks. To achieve this, we include in this repository the data collection firmware we design. The entire data collection scheduling and control is done entirely in software, in order to make the firmware usable in a large variety on wireless networks. We implemented our data collection software using Baloo [5], a flexible network stack design framework based on Synchronous Transmission. Baloo efficiently handles network time synchronization and offers a flexible interface to schedule communication rounds. The firmware source code is available in the Baloo repository [6].
A set of experiment parameters can be patched directly in the firmware, which let the user tune the data collection without having to recompile the source code. This improves usability and facilitates automation. An example patching script is included in this repository. Currently, the following parameters can be patched:
rf_channel,
payload,
host_id, and
rand_seed
Current supported platforms
TelosB [3]
DPP-cc430 [4]
Repository versions
v1.4.1 Updated visualizations in the notebook
v1.4.0 Addition of data from November 2019 to March 2020. Data collection is discontinued (the new FlockLab testbed is being setup).
v1.3.1 Update abstract and notebook
v1.3.0 Addition of October 2019 data. The frequency of tests has been reduced to 4 per day, executing at (approximately) 1:00, 7:00, 13:00, and 19:00. From October 28 onward, time shifted by one hour (2:00, 8:00, 14:00, 20:00).
v1.2.0 Addition of September 2019 data. Many missing tests on the 12, 13, 19, and 20 of September (due to construction works in the building).
v1.1.4 Update of the abstract to have hyperlinks to the plots. Corrected typos.
v1.1.0 Initial version. Add the data collected in August 2019. Data collected was disturbed at the beginning of the month and resumed normally on the August 13. Data from previous days are incomplete.
v1.0.0 Initial version. Contain collected data in July 2019, from the 10th to 30th of July. No data were collected on the 31st of July (technical issue).
List of files
yyyy-mm_raw_platform.zip Archive containing all FlockLab test result files (one .zip file per month and per platform).
yyyy-mm_preprocessed_all.zip Archive containing preprocessed csv files, one per month and per platform.
firmware.zip Archive containing the firmware for all supported platform.
firmware_patch.sh Example bash script illustrating the firmware patching.
parse_flocklab_results.ipynb [open in nbviewer] Jupyter notebook used to create the pre-process data files. Also includes some example of data visualization.
parse_flocklab_results.html HTML rendering of the notebook (static).
plots.zip Archive containing high resolution visualization of the dataset, generated by the parse_flocklab_results notebook, and presented in the abstract.
abstract.pdf A 3 page abstract presenting the dataset.
CRediT.pdf The list of contributions from the authors.
References
[1] R. Lim, F. Ferrari, M. Zimmerling, C. Walser, P. Sommer, and J. Beutel, “FlockLab: A Testbed for Distributed, Synchronized Tracing and Profiling of Wireless Embedded Systems,” in Proceedings of the 12th International Conference on Information Processing in Sensor Networks, New York, NY, USA, 2013, pp. 153–166.
[2] “FlockLab,” GitLab. [Online]. Available: https://gitlab.ethz.ch/tec/public/flocklab/wikis/home. [Accessed: 24-Jul-2019].
[3] Advanticsys, “MTM-CM5000-MSP 802.15.4 TelosB mote Module.” [Online]. Available: https://www.advanticsys.com/shop/mtmcm5000msp-p-14.html. [Accessed: 21-Sep-2018].
[4] Texas Instruments, “CC430F6137 16-Bit Ultra-Low-Power MCU.” [Online]. Available: http://www.ti.com/product/CC430F6137. [Accessed: 21-Sep-2018].
[5] R. Jacob, J. Bächli, R. Da Forno, and L. Thiele, “Synchronous Transmissions Made Easy: Design Your Network Stack with Baloo,” in Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks, 2019.
[6] “Baloo,” Dec-2018. [Online]. Available: http://www.romainjacob.net/research/baloo/.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📘 Description
The Student Academic Performance Dataset contains detailed academic and lifestyle information of 250 students, created to analyze how various factors — such as study hours, sleep, attendance, stress, and social media usage — influence their overall academic outcomes and GPA.
This dataset is synthetic but realistic, carefully generated to reflect believable academic patterns and relationships. It’s perfect for learning data analysis, statistics, and visualization using Excel, Python, or R.
The data includes 12 attributes, primarily numerical, ensuring that it’s suitable for a wide range of analytical tasks — from basic descriptive statistics (mean, median, SD) to correlation and regression analysis.
📊 Key Features
🧮 250 rows and 12 columns
💡 Mostly numerical — great for Excel-based statistical functions
🔍 No missing values — ready for direct use
📈 Balanced and realistic — ideal for clear visualizations and trend analysis
🎯 Suitable for:
Descriptive statistics
Correlation & regression
Data visualization projects
Dashboard creation (Excel, Tableau, Power BI)
💡 Possible Insights to Explore
How do study hours impact GPA?
Is there a relationship between stress levels and performance?
Does social media usage reduce study efficiency?
Do students with higher attendance achieve better grades?
⚙️ Data Generation Details
Each record represents a unique student.
GPA is calculated using a weighted formula based on midterm and final scores.
Relationships are designed to be realistic — for example:
Higher study hours → higher scores and GPA
Higher stress → slightly lower sleep hours
Excessive social media time → reduced academic performance
⚠️ Disclaimer
This dataset is synthetically generated using statistical modeling techniques and does not contain any real student data. It is intended purely for educational, analytical, and research purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data & Code from Farrell et al. "Predicting missing links in global host-parasite networks"ScriptsWithin the scripts folder are scripts to process the raw data and model results:1. Download, clean, and merge host-parasite interaction databases with mammal supertree (process_raw_data.R)2. Re-create raw data plots from the manuscript (raw_data_plots.R)3. Plot posterior interaction matrices, scaled trees, and pull out top predicted links (model_summaries.R)4. Re-create diagnostic plots from the manuscript (diagnostic_plots.R)5. Functions used for data manipulation and visualization that are sourced by other scripts (network_analysis.R)6. Investiage bias propagation via node degree product (bias_investigation.R)7. Generate risk maps (risk_maps.R)Data- raw_data: folder includes the data necessary to amalgamate the host-parasite interaction databases (via script 'process_raw_data.R').- clean_data: folder includes the full host-parasite interaction list 'hp_list' in both .csv and .rds formats, as well as the binary interaction matrices for the full dataset and ones subset by parasite type (virus, bacteria, fungi, etc...), and model diagnostics ('model_diagnostics.csv') used in 'diagnostic_plots.R' .- model_results: folder contains .rds files per model, which has the output interaction matrix from each simulation ('P'), the table of model diagnostics ('TB'), and the phylogeny scaling parameter ('Eta'), if applicable. Note that to save space the full cross-fold fit posteriors are omitted (these total ~ 4.5GB). Please contact MF if these are required. - literature_results: folder contains a .csv version of the results of the literature search outlined in the Supplementary Information.- plots_tables: folder contains .csv files for the top 100 'missing' links for each model, and a .csv for the top 1000 links from the full model run on the full dataset.
Facebook
TwitterProject Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.
This dataset was generated using R.
# Set seed for reproducibility
set.seed(42)
# Define number of observations (students)
n <- 5000
# Generate study hours (independent variable)
# Uniform distribution between 0 and 12 hours
study_hours <- runif(n, min = 0, max = 12)
# Create relationship between study hours and grade
# Base grade: 40 points
# Each study hour adds an average of 5 points
# Add normal noise (standard deviation = 10)
theoretical_grade <- 40 + 5 * study_hours
# Add normal noise to make it realistic
noise <- rnorm(n, mean = 0, sd = 10)
# Calculate final grade
grade <- theoretical_grade + noise
# Limit grades between 0 and 100
grade <- pmin(pmax(grade, 0), 100)
# Create the dataframe
dataset <- data.frame(
student_id = 1:n,
study_hours = round(study_hours, 2),
grade = round(grade, 2)
)
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📦 Dataset Description This dataset supports the Online Retail Customer Segmentation Project, which analyzes one year of transaction records from a UK-based online gift store.
The goal is to identify customer segments using RFM (Recency, Frequency, Monetary) modeling and KMeans clustering, and to explore customer value and behavior through visualization dashboards.
📁 Included Files: Filename Description retail_cleaned.csv Cleaned transaction-level data (negative quantity, missing IDs removed) retail_segmented.csv Main analysis table with RFM-based Segment labels merged in customer_summary copy.csv Customer-level summary: total orders, total spent, first/last purchase dates monthly_sales copy.csv Aggregated monthly sales data for time trend analysis Online Retail Analysis.pdf Full project report (data process + dashboard screenshots + insights) 🔧 Preprocessing Summary: Removed records with missing CustomerID, negative Quantity, or invalid UnitPrice
Created TotalPrice = Quantity × UnitPrice
Generated customer metrics in SQL and calculated RFM values in R
Performed KMeans clustering to create customer segments (Segment 1–4)
📊 Applications: Customer segmentation for loyalty/retention campaigns
Sales trend and seasonal pattern analysis
High-value customer targeting
Geographical revenue mapping
Facebook
TwitterThis dataset simulates Jet2 airline passenger bookings and is designed for segmentation, clustering, and behavioral analysis.
The Jet2 Synthetic Booking dataset provides a realistic simulation of passenger booking behavior for Jet2, a UK-based leisure airline. It is ideal for data science projects involving customer segmentation, predictive modeling, and operational insights.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset contains healthcare statistics and categorical information about patients who have been diagnosed with AIDS. This dataset was initially published in 1996.
https://classic.clinicaltrials.gov/ct2/show/NCT00000625
https://archive.ics.uci.edu/dataset/890/aids+clinical+trials+group+study+175
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By data.world's Admin [source]
This dataset offers a unique insight into the coverage of social insurance programs for the wealthiest quintile of populations around the world. It reveals how many individuals in each country are receiving support from old age contributory pensions, disability benefits, and social security and health insurance benefits such as occupational injury benefits, paid sick leave, maternity leave, and more. This data provides an invaluable resource to understand the health and well-being of those most financially privileged in society – often having greater impact on decision making than other groups. With up-to-date figures from 2019-05-11 this dataset is invaluable in uncovering where there is work to be done for improved healthcare provision in each country across the world
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Understand the context: Before you begin analyzing this dataset, it is important to understand the information that it provides. Take some time to read the description of what is included in the dataset, including a clear understanding of the definitions and scope of coverage provided with each data point.
Examine the data: Once you have a general understanding of this dataset's contents, take some time to explore its contents in more depth. What specific questions does this dataset help answer? What kind of insights does it provide? Are there any missing pieces?
Clean & Prepare Data: After you've preliminarily examined its content, start preparing your data for further analysis and visualization. Clean up any formatting issues or irregularities present in your data set by correcting typos and eliminating unnecessary rows or columns before working with your chosen programming language (I prefer R for data manipulation tasks). Additionally, consider performing necessary transformations such as sorting or averaging values if appropriate for the findings you wish to draw from your analysis.
Visualize Results: Once you've cleaned and prepared your data, use visualizations such as charts, graphs or tables to reveal patterns within it that support specific conclusions about how insurance coverage under social programs vary among different groups within society's quintiles - based on age groups etc.. This type of visualization allows those who aren't familiar with programming to process complex information quickly and accurately than when displayed numerically in tabular form only!
5 Final Analysis & Export Results: Finally export your visuals into presentation-ready formats (e.g., PDFs) which can be shared with colleagues! Additionally use these results as part of a narrative conclusion report providing an accurate assessment and meaningful interpretation about how social insurance programs vary between different members within society's quintiles (i..e., accordingest vs poorest), along with potential policy implications relevant for implementing effective strategies that improve access accordingly!
- Analyzing the effectiveness of social insurance programs by comparing the coverage levels across different geographic areas or socio-economic groups;
- Estimating the economic impact of social insurance programs on local and national economies by tracking spending levels and revenues generated;
- Identifying potential problems with access to social insurance benefits, such as racial or gender disparities in benefit coverage
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: coverage-of-social-insurance-programs-in-richest-quintile-of-population-1.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Project Description: Analysis of Restaurant Preferences and Ordering Trends on Zomato In this project, we explore and analyze various aspects of customer behavior and restaurant performance using Zomato's data. Our goal is to derive actionable insights that can help enhance customer experience and optimize restaurant offerings.
Objectives: Restaurant Popularity Analysis:
Identify Popular Restaurant Types: Determine which types of restaurants receive the most votes from customers. This will help us understand which categories are most favored and could guide marketing strategies. Vote Distribution by Restaurant Type:
Quantify Votes for Each Type: Calculate the total number of votes each type of restaurant has received. This will provide a clear picture of customer preferences across different restaurant categories. Rating Trends:
Analyze Rating Distribution: Examine the ratings that the majority of restaurants have received. This will help identify the overall satisfaction level of customers and the general quality of dining experiences. Couple Spending Patterns:
Average Spending Analysis: Analyze the average spending per order for couples who frequently order online. This insight will assist in understanding spending behaviors and potential revenue generation from this demographic. Mode of Ordering Performance:
Evaluate Ratings by Ordering Mode: Compare the ratings received by online versus offline orders to determine which mode is preferred and delivers higher customer satisfaction. Offline Ordering Trends:
Identify High-Order Restaurant Types: Find out which types of restaurants receive more offline orders. This information can be used to tailor promotions and offers for specific restaurant categories, enhancing customer engagement. Methodology: Data Collection:
Utilize Zomato’s API or available datasets to gather comprehensive data on restaurant types, votes, ratings, and ordering modes. Data Cleaning and Preparation:
Clean the dataset to handle missing values, standardize categories, and ensure data accuracy. Data Analysis:
Employ statistical and data visualization tools to aggregate votes, analyze ratings, and explore spending patterns. Use tools like Python (Pandas, Matplotlib, Seaborn), R, or Excel for data processing and visualization. Insights and Recommendations:
Generate insights based on the analysis and provide actionable recommendations for restaurant marketing strategies and customer engagement. This project aims to provide a detailed understanding of customer preferences and behaviors, enabling Zomato to make data-driven decisions to improve user experience and offer targeted promotions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}