21 datasets found

Example of how to manually extract incubation bouts from interactive plots...
figshare.com
txt
Updated Jan 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2066784.v1
Dataset updated
Jan 22, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Bulla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}
Data Insight: Google Analytics Capstone Project
kaggle.com
zip
Updated Mar 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sinderpreet (2024). Data Insight: Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/sinderpreet/datainsight-google-analytics-capstone-project
Explore at:
zip(215409585 bytes)Available download formats
Dataset updated
Mar 2, 2024
Authors
sinderpreet
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Case study: How does a bike-share navigate speedy success?

Scenario:

As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.

About the company

In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.

Project Overview:

This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.

Dataset Acknowledgment:

We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.

Objective:

The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.

Methodology:

Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.

Visualization and Reporting:

Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:

Conclusion:

The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.

Acknowledgments:

Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.

STRATEGIES USED

Case Study Roadmap - ASK

●What is the problem you are trying to solve? ●How can your insights drive business decisions?

Key Tasks ● Identify the business task ● Consider key stakeholders

Deliverable ● A clear statement of the business task

Case Study Roadmap - PREPARE

● Where is your data located? ● Are there any problems with the data?

Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.

Deliverable ● A description of all data sources used

Case Study Roadmap - PROCESS

● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?

Key tasks ● Choose your tools. ● Document the cleaning process.

Deliverable ● Documentation of any cleaning or manipulation of data

Case Study Roadmap - ANALYZE

● Has your data been properly formaed? ● How will these insights help answer your business questions?

Key tasks ● Perform calculations ● Formatting

Deliverable ● A summary of analysis

Case Study Roadmap - SHARE

● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?

Key tasks ● Present your findings ● Create effective data viz.

Deliverable ● Supporting viz and key findings

**Case Study Roadmap - A...
Data Visualization Cheat sheets and Resources
kaggle.com
zip
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kash (2022). Data Visualization Cheat sheets and Resources [Dataset]. https://www.kaggle.com/kaushiksuresh147/data-visualization-cheat-cheats-and-resources
Explore at:
zip(133638507 bytes)Available download formats
Dataset updated
May 31, 2022
Authors
Kash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Data Visualization Corpus

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1430847%2F29f7950c3b7daf11175aab404725542c%2FGettyImages-1187621904-600x360.jpg?generation=1601115151722854&alt=media" alt="">

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions

The Data Visualizaion Copus

The Data Visualization corpus consists:

32 cheat sheets: This includes A-Z about the techniques and tricks that can be used for visualization, Python and R visualization cheat sheets, Types of charts, and their significance, Storytelling with data, etc..

32 Charts: The corpus also consists of a significant amount of data visualization charts information along with their python code, d3.js codes, and presentations relation to the respective charts explaining in a clear manner!

Some recommended books for data visualization every data scientist's should read:

Beautiful Visualization by Julie Steele and Noah Iliinsky

Information Dashboard Design by Stephen Few

Knowledge is beautiful by David McCandless (Short abstract)

The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo

The Visual Display of Quantitative Information by Edward R. Tufte

storytelling with data: a data visualization guide for business professionals by cole Nussbaumer knaflic

Research paper - Cheat Sheets for Data Visualization Techniques by Zezhong Wang, Lovisa Sundin, Dave Murray-Rust, Benjamin Bach

Suggestions:

In case, if you find any books, cheat sheets, or charts missing and if you would like to suggest some new documents please let me know in the discussion sections!

Resources:

Charts: I personally recommend data viz catalogue, it's easy to understand with their explanation!

Python codes: Plotly for python and Python graph gallery

R codes for charts:Plotly for R

d3 codes: Visualization codes using d3

Request to kaggle users:

A kind request to kaggle users to create notebooks on different visualization charts as per their interest by choosing a dataset of their own as many beginners and other experts could find it useful!

To create interactive EDA using animation with a combination of data visualization charts to give an idea about how to tackle data and extract the insights from the data

Suggestion and queries:

Feel free to use the discussion platform of this data set to ask questions or any queries related to the data visualization corpus and data visualization techniques

Kindly upvote the dataset if you find it useful or if you wish to appreciate the effort taken to gather this corpus! Thank you and have a great day!
n
Data from: Distances and their visualization in studies of spatial-temporal...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arthur Georges (2024). Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs) [Dataset]. http://doi.org/10.5061/dryad.4b8gthtkn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4b8gthtkn
Dataset updated
Jan 15, 2024
Dataset provided by
University of Canberra
Authors
Arthur Georges
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. We examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. The datasets used to illustrate points in the associated review are provided here together with the R script used to analyse the data. Data are either simulated internal to this script or are SNP data generated as part of other studies and included as compressed binary files readily accessable by reading into R using R base function readRDS(). Refer to the analysis script for examples. Methods A dataset was constructed from a SNP matrix generated for the freshwater turtles in the genus Emydura, a recent radiation of Chelidae in Australasia. The dataset (SNP_starting_data.Rdata) includes selected populations that vary in level of divergence to encompass variation within species and variation between closely related species. Sampling localities with evidence of admixture between species were removed. Monomorphic loci were removed, and the data was filtered on call rate (>95%), repeatability (>99.5%) and read depth (5x < read depth < 50x). Where there was more than one SNP per sequence tag, only one was retained at random. The resultant dataset had 18,196 SNP loci scored for 381 individuals from 7 sampling localities or populations – Emydura victoriae [Ord River, NT, n=15], E. tanybaraga [Holroyd River, Qld, n=10], E. subglobosa worrelli [Daly River, NT, n=25], E. subglobosa subglobosa [Fly River, PNG, n=55], E. macquarii macquarii [Murray Darling Basin north, NSW/Qld, n=152], E. macquarii krefftii [Fitzroy River, Qld, n=39] and E. macquarii emmotti [Cooper Creek, Qld, n=85]. The missing data rate was 1.7%, subsequently imputed by nearest neighbour to yield a fully populated data matrix. The data are a subset of those published by Georges et al. (2018, Molecular Ecology 27:5195-5213) for illustrative purposes only. A companion SilicoDArT dataset (silicodart_starting_data.Rdata) is also included. The above manipulations were performed in R package dartR. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package (as implemented in dartR). Principal Coordinates Analysis was undertaken using the pcoa function in R package ape implemented in dartR. To exemplify the effect of missing values on SNP visualisation using PCA, we simulated ten populations that reproduced over 200 non-overlapping generations. Simulated populations were placed in a linear series with low dispersal between adjacent populations (one disperser every ten generations). Each population had 100 individuals, of which 50 individuals were sampled at random. Genotypes were generated for 1,000 neutral loci on one chromosome. We then randomly selected 50% of genotypes and set them as missing data. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package. The R script to implement this is provided (Supplementary_script_for_ms.R). The data for the Australian Blue Mountains skink Eulamprus leuraensis were generated for 372 individuals collected from 17 swamps isolated to varying degrees in the Blue Mountains region of New South Wales. Tail snips were collected and stored in 95% ethanol. The tissue samples were digested with proteinase K overnight and DNA was extracted using a NucleoMag 96 Tissue Kit (MachereyNagel, Duren, Germany) coupled with NucleoMag SEP (Ref. 744900) to allow automated separation of high-quality DNA on a Freedom Evo robotic liquid handler (TECAN, Miinnedorf, Switzerland). SNP data were generated by the commercial service of Diversity Arrays Technology Pty Ltd (Canberra, Australia) using published protocols. A total of 13,496 loci were scored which reduced to 7,935 after filtering out secondary SNPs on the same sequence tag, filtering on reproducibility (threshold 0.99) and call rate (threshold 0.95), and removal of monomorphic loci. The resultant data (Eulamprus_filtered.Rdata) is used to demonstrate the impact of a substantial inversion on the outcomes of a PCA. To test the effect of having closely related individuals (parents and offspring) on the PCoA pattern we ran a simulation using dartR, where we picked up two individuals to become the parents with 2-8 offspring. We ran a PCoA for all of the simulated cases. The R code used is included in the R script uploaded here. Refer to the companion manuscript for links to the literature associated with the above techniques.
f
Data from: Analysis and Visualization of Quantitative Proteomics Data Using...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii (2024). Analysis and Visualization of Quantitative Proteomics Data Using FragPipe-Analyst [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00294.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.4c00294.s003
Dataset updated
Sep 10, 2024
Dataset provided by
ACS Publications
Authors
Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The FragPipe computational proteomics platform is gaining widespread popularity among the proteomics research community because of its fast processing speed and user-friendly graphical interface. Although FragPipe produces well-formatted output tables that are ready for analysis, there is still a need for an easy-to-use and user-friendly downstream statistical analysis and visualization tool. FragPipe-Analyst addresses this need by providing an R shiny web server to assist FragPipe users in conducting downstream analyses of the resulting quantitative proteomics data. It supports major quantification workflows, including label-free quantification, tandem mass tags, and data-independent acquisition. FragPipe-Analyst offers a range of useful functionalities, such as various missing value imputation options, data quality control, unsupervised clustering, differential expression (DE) analysis using Limma, and gene ontology and pathway enrichment analysis using Enrichr. To support advanced analysis and customized visualizations, we also developed FragPipeAnalystR, an R package encompassing all FragPipe-Analyst functionalities that is extended to support site-specific analysis of post-translational modifications (PTMs). FragPipe-Analyst and FragPipeAnalystR are both open-source and freely available.
Data from: Data and code from: Cover crop and crop rotation effects on...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield in no-till system - V2 [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cover-crop-and-crop-rotation-effects-on-tissue-and-soil-population-dyna-831b9
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...
osti.gov
dataone.org
+1more
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
Explore at:
Unique identifier
https://doi.org/10.15485/1823516
Dataset updated
Dec 31, 2020
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Area covered
Colorado, United States, East River
Description
A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
Blood Transfusion Dataset
kaggle.com
zip
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Blood Transfusion Dataset [Dataset]. https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(2684 bytes)Available download formats
Dataset updated
Sep 30, 2022
Authors
Aman Chauhan
Description
Blood Transfusion Service Center Data Set : - Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan This is a classification problem.

Data Set Information : To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Attribute Information : Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database. R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Table 1 shows the descriptive statistics of the data. We selected 500 data at random as the training set, and the rest 248 as the testing set.

Table 1. Descriptive statistics of the data - Variable Data Type Measurement Description min max mean std - Recency quantitative Months Input 0.03 74.4 9.74 8.07 Frequency quantitative Times Input 1 50 5.51 5.84 - Monetary quantitative c.c. blood Input 250 12500 1378.68 1459.83 - Time quantitative Months Input 2.27 98.3 34.42 24.32 - Whether he/she donated blood in March 2007 binary 1=yes 0=no Output 0 1 1 (24%) 0 (76%)

Data Set Characteristics Multivariate
Number of Instances 748
Area Business
Attribute Characteristics Real
Number of Attributes 5
Associated Tasks Classification
Missing Values? N/A

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

Citation Request : NOTE: Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence, "Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).
m
Data and codes for "Beyond NUE: A focus on true nitrogen gains in cereals"
data.mendeley.com
researchdata.edu.au
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Fernandez (2024). Data and codes for "Beyond NUE: A focus on true nitrogen gains in cereals" [Dataset]. http://doi.org/10.17632/s6kdt7pc2j.1
Explore at:
Unique identifier
https://doi.org/10.17632/s6kdt7pc2j.1
Dataset updated
Nov 27, 2024
Authors
Javier Fernandez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source data and code supporting the manuscript "Beyond NUE: A focus on true nitrogen gains in cereals" based on maize, sorghum, and barley datasets.

README (README.txt). References, sources for the reviewed dataset.

Dataset S1 (rev_dataset.xlsx). Review data on NUE related traits for maize, sorghum, and barley cultivars from different decades of commercial release collected from published literature. Missing values or information of studies are represented by “n/a” in the data. References of traits and variables included in the data are described in a separate sheet within the XLSX file.

Dataset S2 (analysisR.pdf). R code for data processing, analysis, and visualization.
Z
Wireless Link Quality Estimation on FlockLab - and Beyond
data.niaid.nih.gov
zenodo.org
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Jacob; Reto Da Forno; Roman Trüb; Andreas Biri; Lothar Thiele (2024). Wireless Link Quality Estimation on FlockLab - and Beyond [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3354717
Explore at:
Dataset updated
Jul 22, 2024
Dataset provided by
ETH Zurich
Authors
Romain Jacob; Reto Da Forno; Roman Trüb; Andreas Biri; Lothar Thiele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains wireless link quality estimation data for the FlockLab testbed [1,2]. The rationale and description of this dataset is described in a the following abstract (pdf is included in this repository -- see below).

Dataset: Wireless Link Quality Estimationon FlockLab – and Beyond Romain Jacob, Reto Da Forno, Roman Trüb, Andreas Biri, Lothar Thiele DATA '19 Proceedings of the 2nd Workshop on Data Acquisition To Analysis, 2019

Data collection scenario

The data collection scenario is simple. Each FlockLab node is assigned one dedicated time slot. In this slot, a node sends 100 packets, called strobes. All strobes have the same payload size and use a given radio frequency channel and transmit power. All other nodes listen for the strobes and log packet reception events (i.e., success or failed).

The test scenario is ran every two hours on two different platforms: the TelosB [3] and DPP-cc430 [4] platforms. We used all nodes currently available at test time (between 27 and 29).

Final dataset status

3 months of data with about 12 tests per day per platform

5 month of data with about 4 tests per day per platform

Data collection firmware

We are happy to share the link quality data we collected for the FlockLab testbed, but we also wanted to make it easier for others to collect similar datasets for other wireless networks. To achieve this, we include in this repository the data collection firmware we design. The entire data collection scheduling and control is done entirely in software, in order to make the firmware usable in a large variety on wireless networks. We implemented our data collection software using Baloo [5], a flexible network stack design framework based on Synchronous Transmission. Baloo efficiently handles network time synchronization and offers a flexible interface to schedule communication rounds. The firmware source code is available in the Baloo repository [6].

A set of experiment parameters can be patched directly in the firmware, which let the user tune the data collection without having to recompile the source code. This improves usability and facilitates automation. An example patching script is included in this repository. Currently, the following parameters can be patched:

rf_channel,

payload,

host_id, and

rand_seed

Current supported platforms

TelosB [3]

DPP-cc430 [4]

Repository versions

v1.4.1 Updated visualizations in the notebook

v1.4.0 Addition of data from November 2019 to March 2020. Data collection is discontinued (the new FlockLab testbed is being setup).

v1.3.1 Update abstract and notebook

v1.3.0 Addition of October 2019 data. The frequency of tests has been reduced to 4 per day, executing at (approximately) 1:00, 7:00, 13:00, and 19:00. From October 28 onward, time shifted by one hour (2:00, 8:00, 14:00, 20:00).

v1.2.0 Addition of September 2019 data. Many missing tests on the 12, 13, 19, and 20 of September (due to construction works in the building).

v1.1.4 Update of the abstract to have hyperlinks to the plots. Corrected typos.

v1.1.0 Initial version. Add the data collected in August 2019. Data collected was disturbed at the beginning of the month and resumed normally on the August 13. Data from previous days are incomplete.

v1.0.0 Initial version. Contain collected data in July 2019, from the 10th to 30th of July. No data were collected on the 31st of July (technical issue).

List of files

yyyy-mm_raw_platform.zip Archive containing all FlockLab test result files (one .zip file per month and per platform).

yyyy-mm_preprocessed_all.zip Archive containing preprocessed csv files, one per month and per platform.

firmware.zip Archive containing the firmware for all supported platform.

firmware_patch.sh Example bash script illustrating the firmware patching.

parse_flocklab_results.ipynb [open in nbviewer] Jupyter notebook used to create the pre-process data files. Also includes some example of data visualization.

parse_flocklab_results.html HTML rendering of the notebook (static).

plots.zip Archive containing high resolution visualization of the dataset, generated by the parse_flocklab_results notebook, and presented in the abstract.

abstract.pdf A 3 page abstract presenting the dataset.

CRediT.pdf The list of contributions from the authors.

References

[1] R. Lim, F. Ferrari, M. Zimmerling, C. Walser, P. Sommer, and J. Beutel, “FlockLab: A Testbed for Distributed, Synchronized Tracing and Profiling of Wireless Embedded Systems,” in Proceedings of the 12th International Conference on Information Processing in Sensor Networks, New York, NY, USA, 2013, pp. 153–166.

[2] “FlockLab,” GitLab. [Online]. Available: https://gitlab.ethz.ch/tec/public/flocklab/wikis/home. [Accessed: 24-Jul-2019].

[3] Advanticsys, “MTM-CM5000-MSP 802.15.4 TelosB mote Module.” [Online]. Available: https://www.advanticsys.com/shop/mtmcm5000msp-p-14.html. [Accessed: 21-Sep-2018].

[4] Texas Instruments, “CC430F6137 16-Bit Ultra-Low-Power MCU.” [Online]. Available: http://www.ti.com/product/CC430F6137. [Accessed: 21-Sep-2018].

[5] R. Jacob, J. Bächli, R. Da Forno, and L. Thiele, “Synchronous Transmissions Made Easy: Design Your Network Stack with Baloo,” in Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks, 2019.

[6] “Baloo,” Dec-2018. [Online]. Available: http://www.romainjacob.net/research/baloo/.
Data from: Student Academic Performance Dataset
kaggle.com
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hackathon data (2025). Student Academic Performance Dataset [Dataset]. https://www.kaggle.com/datasets/aryancodes12fyds/student-academic-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hackathon data
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📘 Description

The Student Academic Performance Dataset contains detailed academic and lifestyle information of 250 students, created to analyze how various factors — such as study hours, sleep, attendance, stress, and social media usage — influence their overall academic outcomes and GPA.

This dataset is synthetic but realistic, carefully generated to reflect believable academic patterns and relationships. It’s perfect for learning data analysis, statistics, and visualization using Excel, Python, or R.

The data includes 12 attributes, primarily numerical, ensuring that it’s suitable for a wide range of analytical tasks — from basic descriptive statistics (mean, median, SD) to correlation and regression analysis.

📊 Key Features

🧮 250 rows and 12 columns

💡 Mostly numerical — great for Excel-based statistical functions

🔍 No missing values — ready for direct use

📈 Balanced and realistic — ideal for clear visualizations and trend analysis

🎯 Suitable for:

Descriptive statistics

Correlation & regression

Data visualization projects

Dashboard creation (Excel, Tableau, Power BI)

💡 Possible Insights to Explore

How do study hours impact GPA?

Is there a relationship between stress levels and performance?

Does social media usage reduce study efficiency?

Do students with higher attendance achieve better grades?

⚙️ Data Generation Details

Each record represents a unique student.

GPA is calculated using a weighted formula based on midterm and final scores.

Relationships are designed to be realistic — for example:

Higher study hours → higher scores and GPA

Higher stress → slightly lower sleep hours

Excessive social media time → reduced academic performance

⚠️ Disclaimer

This dataset is synthetically generated using statistical modeling techniques and does not contain any real student data. It is intended purely for educational, analytical, and research purposes.
Data & Code: "Predicting missing links in global host-parasite networks"
figshare.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxwell Farrell (2021). Data & Code: "Predicting missing links in global host-parasite networks" [Dataset]. http://doi.org/10.6084/m9.figshare.8969882.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8969882.v2
Dataset updated
Dec 16, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Maxwell Farrell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data & Code from Farrell et al. "Predicting missing links in global host-parasite networks"ScriptsWithin the scripts folder are scripts to process the raw data and model results:1. Download, clean, and merge host-parasite interaction databases with mammal supertree (process_raw_data.R)2. Re-create raw data plots from the manuscript (raw_data_plots.R)3. Plot posterior interaction matrices, scaled trees, and pull out top predicted links (model_summaries.R)4. Re-create diagnostic plots from the manuscript (diagnostic_plots.R)5. Functions used for data manipulation and visualization that are sourced by other scripts (network_analysis.R)6. Investiage bias propagation via node degree product (bias_investigation.R)7. Generate risk maps (risk_maps.R)Data- raw_data: folder includes the data necessary to amalgamate the host-parasite interaction databases (via script 'process_raw_data.R').- clean_data: folder includes the full host-parasite interaction list 'hp_list' in both .csv and .rds formats, as well as the binary interaction matrices for the full dataset and ones subset by parasite type (virus, bacteria, fungi, etc...), and model diagnostics ('model_diagnostics.csv') used in 'diagnostic_plots.R' .- model_results: folder contains .rds files per model, which has the output interaction matrix from each simulation ('P'), the table of model diagnostics ('TB'), and the phylogeny scaling parameter ('Eta'), if applicable. Note that to save space the full cross-fold fit posteriors are omitted (these total ~ 4.5GB). Please contact MF if these are required. - literature_results: folder contains a .csv version of the results of the literature search outlined in the Supplementary Information.- plots_tables: folder contains .csv files for the top 100 'missing' links for each model, and a .csv for the top 1000 links from the full model run on the full dataset.
SP500_data
kaggle.com
zip
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
Explore at:
zip(39005 bytes)Available download formats
Dataset updated
May 28, 2023
Authors
Franco Dicosola
Description
Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.
Study Hours vs Grades Dataset
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
Explore at:
zip(33964 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Andrey Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

Dataset Features

student_id: Unique identifier for each student (1-5000)

study_hours: Hours spent studying (0-12 hours, continuous)

grade: Final exam score (0-100 points, continuous)

Potential Use Cases

Linear regression modeling and practice

Data visualization exercises

Statistical analysis tutorials

Machine learning for beginners

Educational research simulations

Data Quality

No missing values

Normally distributed residuals

Realistic educational scenario

Ready for immediate analysis

Data Generation Code

This dataset was generated using R.

R Code

# Set seed for reproducibility set.seed(42) # Define number of observations (students) n <- 5000 # Generate study hours (independent variable) # Uniform distribution between 0 and 12 hours study_hours <- runif(n, min = 0, max = 12) # Create relationship between study hours and grade # Base grade: 40 points # Each study hour adds an average of 5 points # Add normal noise (standard deviation = 10) theoretical_grade <- 40 + 5 * study_hours # Add normal noise to make it realistic noise <- rnorm(n, mean = 0, sd = 10) # Calculate final grade grade <- theoretical_grade + noise # Limit grades between 0 and 100 grade <- pmin(pmax(grade, 0), 100) # Create the dataframe dataset <- data.frame( student_id = 1:n, study_hours = round(study_hours, 2), grade = round(grade, 2) )
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Online Retail Customer Segmentation Project
kaggle.com
zip
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lxnmo bill (2025). Online Retail Customer Segmentation Project [Dataset]. https://www.kaggle.com/datasets/lxnmobill/online-retail-customer-segmentation-project/versions/1
Explore at:
zip(16886255 bytes)Available download formats
Dataset updated
Apr 17, 2025
Authors
lxnmo bill
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📦 Dataset Description This dataset supports the Online Retail Customer Segmentation Project, which analyzes one year of transaction records from a UK-based online gift store.

The goal is to identify customer segments using RFM (Recency, Frequency, Monetary) modeling and KMeans clustering, and to explore customer value and behavior through visualization dashboards.

📁 Included Files: Filename Description retail_cleaned.csv Cleaned transaction-level data (negative quantity, missing IDs removed) retail_segmented.csv Main analysis table with RFM-based Segment labels merged in customer_summary copy.csv Customer-level summary: total orders, total spent, first/last purchase dates monthly_sales copy.csv Aggregated monthly sales data for time trend analysis Online Retail Analysis.pdf Full project report (data process + dashboard screenshots + insights) 🔧 Preprocessing Summary: Removed records with missing CustomerID, negative Quantity, or invalid UnitPrice

Created TotalPrice = Quantity × UnitPrice

Generated customer metrics in SQL and calculated RFM values in R

Performed KMeans clustering to create customer segments (Segment 1–4)

📊 Applications: Customer segmentation for loyalty/retention campaigns

Sales trend and seasonal pattern analysis

High-value customer targeting

Geographical revenue mapping
Jet2 Synthetic Booking
kaggle.com
zip
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adythio Niramoyo (2025). Jet2 Synthetic Booking [Dataset]. https://www.kaggle.com/datasets/adythioniramoyo/jet2-synthetic-booking
Explore at:
zip(10065345 bytes)Available download formats
Dataset updated
Nov 6, 2025
Authors
Adythio Niramoyo
Description
This dataset simulates Jet2 airline passenger bookings and is designed for segmentation, clustering, and behavioral analysis.

📊 Dataset Description: Jet2 Synthetic Booking

The Jet2 Synthetic Booking dataset provides a realistic simulation of passenger booking behavior for Jet2, a UK-based leisure airline. It is ideal for data science projects involving customer segmentation, predictive modeling, and operational insights.

🧾 Key Features

Passenger-level booking records with anonymized identifiers

Temporal booking patterns: Includes booking dates, departure dates, and lead times

Flight details: Routes, departure airports, destination airports

Fare and pricing data: Ticket prices, taxes, and total spend

Passenger segmentation: Useful for clustering into groups like Early Birds, Mid-Range, and Late Volatility

Synthetic generation: Modeled to reflect realistic Jet2 booking trends without using proprietary or personal data

🎯 Use Cases

K-Means clustering to identify booking behavior segments

Time series analysis of booking lead times and seasonal demand

Revenue optimization based on fare classes and booking windows

Marketing strategy development by understanding customer booking habits

📁 Format & Accessibility

Available as a CSV file on Kaggle

Cleaned and structured for immediate use in Python, R, or BI tools

No missing values or privacy concerns due to synthetic generation
AIDS Virus Infection Prediction 💉
kaggle.com
zip
Updated Apr 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aadarsh velu (2024). AIDS Virus Infection Prediction 💉 [Dataset]. https://www.kaggle.com/datasets/aadarshvelu/AIDS-virus-infection-prediction
Explore at:
zip(1710961 bytes)Available download formats
Dataset updated
Apr 28, 2024
Authors
Aadarsh velu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context :

Dataset contains healthcare statistics and categorical information about patients who have been diagnosed with AIDS. This dataset was initially published in 1996.

Attribute Information :

time: time to failure or censoring

trt: treatment indicator (0 = ZDV only; 1 = ZDV + ddI, 2 = ZDV + Zal, 3 = ddI only)

age: age (yrs) at baseline

wtkg: weight (kg) at baseline

hemo: hemophilia (0=no, 1=yes)

homo: homosexual activity (0=no, 1=yes)

drugs: history of IV drug use (0=no, 1=yes)

karnof: Karnofsky score (on a scale of 0-100)

oprior: Non-ZDV antiretroviral therapy pre-175 (0=no, 1=yes)

z30: ZDV in the 30 days prior to 175 (0=no, 1=yes)

preanti: days pre-175 anti-retroviral therapy

race: race (0=White, 1=non-white)

gender: gender (0=F, 1=M)

str2: antiretroviral history (0=naive, 1=experienced)

strat: antiretroviral history stratification (1='Antiretroviral Naive',2='> 1 but <= 52 weeks of prior antiretroviral therapy',3='> 52 weeks)

symptom: symptomatic indicator (0=asymp, 1=symp)

treat: treatment indicator (0=ZDV only, 1=others)

offtrt: indicator of off-trt before 96+/-5 weeks (0=no,1=yes)

cd40: CD4 at baseline

cd420: CD4 at 20+/-5 weeks

cd80: CD8 at baseline

cd820: CD8 at 20+/-5 weeks

infected: is infected with AIDS (0=No, 1=Yes)

Additional Variable Information :

Personal information (age, weight, race, gender, sexual activity)

Medical history (hemophilia, history of IV drugs)

Treatment history (ZDV/non-ZDV treatment history)

Lab results (CD4/CD8 counts)

Citation :

https://classic.clinicaltrials.gov/ct2/show/NCT00000625

Acknowledgment :

Creators :

S. Hammer

D. Katzenstein

M. Hughes

H. Gundacker

R. Schooley

R. Haubrich

W. K.

M. Lederman

J. Phair

M. Niu

M. Hirsch

T. Merigan

Donor :

https://archive.ics.uci.edu/dataset/890/aids+clinical+trials+group+study+175
Social Insurance Programs in Richest Quintile
kaggle.com
Updated Jan 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Social Insurance Programs in Richest Quintile [Dataset]. https://www.kaggle.com/datasets/thedevastator/coverage-of-social-insurance-programs-in-richest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 7, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Coverage of Social Insurance Programs in Richest Quintile

Percent of Population Eligible

By data.world's Admin [source]

About this dataset

This dataset offers a unique insight into the coverage of social insurance programs for the wealthiest quintile of populations around the world. It reveals how many individuals in each country are receiving support from old age contributory pensions, disability benefits, and social security and health insurance benefits such as occupational injury benefits, paid sick leave, maternity leave, and more. This data provides an invaluable resource to understand the health and well-being of those most financially privileged in society – often having greater impact on decision making than other groups. With up-to-date figures from 2019-05-11 this dataset is invaluable in uncovering where there is work to be done for improved healthcare provision in each country across the world

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Understand the context: Before you begin analyzing this dataset, it is important to understand the information that it provides. Take some time to read the description of what is included in the dataset, including a clear understanding of the definitions and scope of coverage provided with each data point.

Examine the data: Once you have a general understanding of this dataset's contents, take some time to explore its contents in more depth. What specific questions does this dataset help answer? What kind of insights does it provide? Are there any missing pieces?

Clean & Prepare Data: After you've preliminarily examined its content, start preparing your data for further analysis and visualization. Clean up any formatting issues or irregularities present in your data set by correcting typos and eliminating unnecessary rows or columns before working with your chosen programming language (I prefer R for data manipulation tasks). Additionally, consider performing necessary transformations such as sorting or averaging values if appropriate for the findings you wish to draw from your analysis.

Visualize Results: Once you've cleaned and prepared your data, use visualizations such as charts, graphs or tables to reveal patterns within it that support specific conclusions about how insurance coverage under social programs vary among different groups within society's quintiles - based on age groups etc.. This type of visualization allows those who aren't familiar with programming to process complex information quickly and accurately than when displayed numerically in tabular form only!

5 Final Analysis & Export Results: Finally export your visuals into presentation-ready formats (e.g., PDFs) which can be shared with colleagues! Additionally use these results as part of a narrative conclusion report providing an accurate assessment and meaningful interpretation about how social insurance programs vary between different members within society's quintiles (i..e., accordingest vs poorest), along with potential policy implications relevant for implementing effective strategies that improve access accordingly!

Research Ideas

Analyzing the effectiveness of social insurance programs by comparing the coverage levels across different geographic areas or socio-economic groups;

Estimating the economic impact of social insurance programs on local and national economies by tracking spending levels and revenues generated;

Identifying potential problems with access to social insurance benefits, such as racial or gender disparities in benefit coverage

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: coverage-of-social-insurance-programs-in-richest-quintile-of-population-1.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.
Zomato Project.
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umair Hayat (2024). Zomato Project. [Dataset]. https://www.kaggle.com/datasets/umairhayat/zomato-project/code
Explore at:
zip(2543 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Umair Hayat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Project Description: Analysis of Restaurant Preferences and Ordering Trends on Zomato In this project, we explore and analyze various aspects of customer behavior and restaurant performance using Zomato's data. Our goal is to derive actionable insights that can help enhance customer experience and optimize restaurant offerings.

Objectives: Restaurant Popularity Analysis:

Identify Popular Restaurant Types: Determine which types of restaurants receive the most votes from customers. This will help us understand which categories are most favored and could guide marketing strategies. Vote Distribution by Restaurant Type:

Quantify Votes for Each Type: Calculate the total number of votes each type of restaurant has received. This will provide a clear picture of customer preferences across different restaurant categories. Rating Trends:

Analyze Rating Distribution: Examine the ratings that the majority of restaurants have received. This will help identify the overall satisfaction level of customers and the general quality of dining experiences. Couple Spending Patterns:

Average Spending Analysis: Analyze the average spending per order for couples who frequently order online. This insight will assist in understanding spending behaviors and potential revenue generation from this demographic. Mode of Ordering Performance:

Evaluate Ratings by Ordering Mode: Compare the ratings received by online versus offline orders to determine which mode is preferred and delivers higher customer satisfaction. Offline Ordering Trends:

Identify High-Order Restaurant Types: Find out which types of restaurants receive more offline orders. This information can be used to tailor promotions and offers for specific restaurant categories, enhancing customer engagement. Methodology: Data Collection:

Utilize Zomato’s API or available datasets to gather comprehensive data on restaurant types, votes, ratings, and ordering modes. Data Cleaning and Preparation:

Clean the dataset to handle missing values, standardize categories, and ensure data accuracy. Data Analysis:

Employ statistical and data visualization tools to aggregate votes, analyze ratings, and explore spending patterns. Use tools like Python (Pandas, Matplotlib, Seaborn), R, or Excel for data processing and visualization. Insights and Recommendations:

Generate insights based on the analysis and provide actionable recommendations for restaurant marketing strategies and customer engagement. This project aims to provide a detailed understanding of customer preferences and behaviors, enabling Zomato to make data-driven decisions to improve user experience and offer targeted promotions.


Data Set Characteristics	Multivariate
Number of Instances	748
Area	Business
Attribute Characteristics	Real
Number of Attributes	5
Associated Tasks	Classification
Missing Values?	N/A

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1

Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.2066784.v1

Dataset updated

Jan 22, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Martin Bulla

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

Clear search

Close search

Google apps

Main menu

Example of how to manually extract incubation bouts from interactive plots...

Data Insight: Google Analytics Capstone Project

Data Visualization Cheat sheets and Resources

The Data Visualization Corpus

Data Visualization

The Data Visualizaion Copus

The Data Visualization corpus consists:

Suggestions:

Resources:

Request to kaggle users:

Suggestion and queries:

Kindly upvote the dataset if you find it useful or if you wish to appreciate the effort taken to gather this corpus! Thank you and have a great day!

Data from: Distances and their visualization in studies of spatial-temporal...

Data from: Analysis and Visualization of Quantitative Proteomics Data Using...

Data from: Data and code from: Cover crop and crop rotation effects on...

Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

Blood Transfusion Dataset

Data and codes for "Beyond NUE: A focus on true nitrogen gains in cereals"

Wireless Link Quality Estimation on FlockLab - and Beyond

Data from: Student Academic Performance Dataset

Data & Code: "Predicting missing links in global host-parasite networks"

SP500_data

Study Hours vs Grades Dataset

Dataset Features

Potential Use Cases

Data Quality

Data Generation Code

R Code

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Online Retail Customer Segmentation Project

Jet2 Synthetic Booking

📊 Dataset Description: Jet2 Synthetic Booking

🧾 Key Features

🎯 Use Cases

📁 Format & Accessibility

AIDS Virus Infection Prediction 💉

Context :

Attribute Information :

Additional Variable Information :

Citation :

Acknowledgment :

Creators :

Donor :

Social Insurance Programs in Richest Quintile

Coverage of Social Insurance Programs in Richest Quintile

Percent of Population Eligible

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Zomato Project.

Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA