21 datasets found
  1. Example of how to manually extract incubation bouts from interactive plots...

    • figshare.com
    txt
    Updated Jan 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 22, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Martin Bulla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    {# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

  2. Data Insight: Google Analytics Capstone Project

    • kaggle.com
    zip
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sinderpreet (2024). Data Insight: Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/sinderpreet/datainsight-google-analytics-capstone-project
    Explore at:
    zip(215409585 bytes)Available download formats
    Dataset updated
    Mar 2, 2024
    Authors
    sinderpreet
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Case study: How does a bike-share navigate speedy success?

    Scenario:

    As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.

    About the company

    In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.

    Project Overview:

    This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.

    Dataset Acknowledgment:

    We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.

    Objective:

    The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.

    Methodology:

    Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.

    Visualization and Reporting:

    Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:

    Conclusion:

    The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.

    Acknowledgments:

    Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.

    STRATEGIES USED

    Case Study Roadmap - ASK

    ●What is the problem you are trying to solve? ●How can your insights drive business decisions?

    Key Tasks ● Identify the business task ● Consider key stakeholders

    Deliverable ● A clear statement of the business task

    Case Study Roadmap - PREPARE

    ● Where is your data located? ● Are there any problems with the data?

    Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.

    Deliverable ● A description of all data sources used

    Case Study Roadmap - PROCESS

    ● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?

    Key tasks ● Choose your tools. ● Document the cleaning process.

    Deliverable ● Documentation of any cleaning or manipulation of data

    Case Study Roadmap - ANALYZE

    ● Has your data been properly formaed? ● How will these insights help answer your business questions?

    Key tasks ● Perform calculations ● Formatting

    Deliverable ● A summary of analysis

    Case Study Roadmap - SHARE

    ● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?

    Key tasks ● Present your findings ● Create effective data viz.

    Deliverable ● Supporting viz and key findings

    **Case Study Roadmap - A...

  3. Data Visualization Cheat sheets and Resources

    • kaggle.com
    zip
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kash (2022). Data Visualization Cheat sheets and Resources [Dataset]. https://www.kaggle.com/kaushiksuresh147/data-visualization-cheat-cheats-and-resources
    Explore at:
    zip(133638507 bytes)Available download formats
    Dataset updated
    May 31, 2022
    Authors
    Kash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Data Visualization Corpus

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1430847%2F29f7950c3b7daf11175aab404725542c%2FGettyImages-1187621904-600x360.jpg?generation=1601115151722854&alt=media" alt="">

    Data Visualization

    Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

    In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions

    The Data Visualizaion Copus

    The Data Visualization corpus consists:

    • 32 cheat sheets: This includes A-Z about the techniques and tricks that can be used for visualization, Python and R visualization cheat sheets, Types of charts, and their significance, Storytelling with data, etc..

    • 32 Charts: The corpus also consists of a significant amount of data visualization charts information along with their python code, d3.js codes, and presentations relation to the respective charts explaining in a clear manner!

    • Some recommended books for data visualization every data scientist's should read:

      1. Beautiful Visualization by Julie Steele and Noah Iliinsky
      2. Information Dashboard Design by Stephen Few
      3. Knowledge is beautiful by David McCandless (Short abstract)
      4. The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo
      5. The Visual Display of Quantitative Information by Edward R. Tufte
      6. storytelling with data: a data visualization guide for business professionals by cole Nussbaumer knaflic
      7. Research paper - Cheat Sheets for Data Visualization Techniques by Zezhong Wang, Lovisa Sundin, Dave Murray-Rust, Benjamin Bach

    Suggestions:

    In case, if you find any books, cheat sheets, or charts missing and if you would like to suggest some new documents please let me know in the discussion sections!

    Resources:

    Request to kaggle users:

    • A kind request to kaggle users to create notebooks on different visualization charts as per their interest by choosing a dataset of their own as many beginners and other experts could find it useful!

    • To create interactive EDA using animation with a combination of data visualization charts to give an idea about how to tackle data and extract the insights from the data

    Suggestion and queries:

    Feel free to use the discussion platform of this data set to ask questions or any queries related to the data visualization corpus and data visualization techniques

    Kindly upvote the dataset if you find it useful or if you wish to appreciate the effort taken to gather this corpus! Thank you and have a great day!

  4. n

    Data from: Distances and their visualization in studies of spatial-temporal...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur Georges (2024). Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs) [Dataset]. http://doi.org/10.5061/dryad.4b8gthtkn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2024
    Dataset provided by
    University of Canberra
    Authors
    Arthur Georges
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Distance measures are widely used for examining genetic structure in datasets that comprise many individuals scored for a very large number of attributes. Genotype datasets composed of single nucleotide polymorphisms (SNPs) typically contain bi-allelic scores for tens of thousands if not hundreds of thousands of loci. We examine the application of distance measures to SNP genotypes and sequence tag presence-absences (SilicoDArT) and use real datasets and simulated data to illustrate pitfalls in the application of genetic distances and their visualization. The datasets used to illustrate points in the associated review are provided here together with the R script used to analyse the data. Data are either simulated internal to this script or are SNP data generated as part of other studies and included as compressed binary files readily accessable by reading into R using R base function readRDS(). Refer to the analysis script for examples. Methods A dataset was constructed from a SNP matrix generated for the freshwater turtles in the genus Emydura, a recent radiation of Chelidae in Australasia. The dataset (SNP_starting_data.Rdata) includes selected populations that vary in level of divergence to encompass variation within species and variation between closely related species. Sampling localities with evidence of admixture between species were removed. Monomorphic loci were removed, and the data was filtered on call rate (>95%), repeatability (>99.5%) and read depth (5x < read depth < 50x). Where there was more than one SNP per sequence tag, only one was retained at random. The resultant dataset had 18,196 SNP loci scored for 381 individuals from 7 sampling localities or populations – Emydura victoriae [Ord River, NT, n=15], E. tanybaraga [Holroyd River, Qld, n=10], E. subglobosa worrelli [Daly River, NT, n=25], E. subglobosa subglobosa [Fly River, PNG, n=55], E. macquarii macquarii [Murray Darling Basin north, NSW/Qld, n=152], E. macquarii krefftii [Fitzroy River, Qld, n=39] and E. macquarii emmotti [Cooper Creek, Qld, n=85]. The missing data rate was 1.7%, subsequently imputed by nearest neighbour to yield a fully populated data matrix. The data are a subset of those published by Georges et al. (2018, Molecular Ecology 27:5195-5213) for illustrative purposes only. A companion SilicoDArT dataset (silicodart_starting_data.Rdata) is also included. The above manipulations were performed in R package dartR. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package (as implemented in dartR). Principal Coordinates Analysis was undertaken using the pcoa function in R package ape implemented in dartR. To exemplify the effect of missing values on SNP visualisation using PCA, we simulated ten populations that reproduced over 200 non-overlapping generations. Simulated populations were placed in a linear series with low dispersal between adjacent populations (one disperser every ten generations). Each population had 100 individuals, of which 50 individuals were sampled at random. Genotypes were generated for 1,000 neutral loci on one chromosome. We then randomly selected 50% of genotypes and set them as missing data. Principal Components Analysis was undertaken using the glPCA function of the R adegenet package. The R script to implement this is provided (Supplementary_script_for_ms.R). The data for the Australian Blue Mountains skink Eulamprus leuraensis were generated for 372 individuals collected from 17 swamps isolated to varying degrees in the Blue Mountains region of New South Wales. Tail snips were collected and stored in 95% ethanol. The tissue samples were digested with proteinase K overnight and DNA was extracted using a NucleoMag 96 Tissue Kit (MachereyNagel, Duren, Germany) coupled with NucleoMag SEP (Ref. 744900) to allow automated separation of high-quality DNA on a Freedom Evo robotic liquid handler (TECAN, Miinnedorf, Switzerland). SNP data were generated by the commercial service of Diversity Arrays Technology Pty Ltd (Canberra, Australia) using published protocols. A total of 13,496 loci were scored which reduced to 7,935 after filtering out secondary SNPs on the same sequence tag, filtering on reproducibility (threshold 0.99) and call rate (threshold 0.95), and removal of monomorphic loci. The resultant data (Eulamprus_filtered.Rdata) is used to demonstrate the impact of a substantial inversion on the outcomes of a PCA. To test the effect of having closely related individuals (parents and offspring) on the PCoA pattern we ran a simulation using dartR, where we picked up two individuals to become the parents with 2-8 offspring. We ran a PCoA for all of the simulated cases. The R code used is included in the R script uploaded here. Refer to the companion manuscript for links to the literature associated with the above techniques.

  5. f

    Data from: Analysis and Visualization of Quantitative Proteomics Data Using...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii (2024). Analysis and Visualization of Quantitative Proteomics Data Using FragPipe-Analyst [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00294.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    ACS Publications
    Authors
    Yi Hsiao; Haijian Zhang; Ginny Xiaohe Li; Yamei Deng; Fengchao Yu; Hossein Valipour Kahrood; Joel R. Steele; Ralf B. Schittenhelm; Alexey I. Nesvizhskii
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The FragPipe computational proteomics platform is gaining widespread popularity among the proteomics research community because of its fast processing speed and user-friendly graphical interface. Although FragPipe produces well-formatted output tables that are ready for analysis, there is still a need for an easy-to-use and user-friendly downstream statistical analysis and visualization tool. FragPipe-Analyst addresses this need by providing an R shiny web server to assist FragPipe users in conducting downstream analyses of the resulting quantitative proteomics data. It supports major quantification workflows, including label-free quantification, tandem mass tags, and data-independent acquisition. FragPipe-Analyst offers a range of useful functionalities, such as various missing value imputation options, data quality control, unsupervised clustering, differential expression (DE) analysis using Limma, and gene ontology and pathway enrichment analysis using Enrichr. To support advanced analysis and customized visualizations, we also developed FragPipeAnalystR, an R package encompassing all FragPipe-Analyst functionalities that is extended to support site-specific analysis of post-translational modifications (PTMs). FragPipe-Analyst and FragPipeAnalystR are both open-source and freely available.

  6. Data from: Data and code from: Cover crop and crop rotation effects on...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield in no-till system - V2 [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cover-crop-and-crop-rotation-effects-on-tissue-and-soil-population-dyna-831b9
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    [Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.

  7. Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

    • osti.gov
    • dataone.org
    • +1more
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
    Explore at:
    Dataset updated
    Dec 31, 2020
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    Area covered
    Colorado, United States, East River
    Description

    A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.

  8. Blood Transfusion Dataset

    • kaggle.com
    zip
    Updated Sep 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Blood Transfusion Dataset [Dataset]. https://www.kaggle.com/datasets/whenamancodes/blood-transfusion-dataset/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(2684 bytes)Available download formats
    Dataset updated
    Sep 30, 2022
    Authors
    Aman Chauhan
    Description

    Blood Transfusion Service Center Data Set : - Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan This is a classification problem.

    Data Set Information : To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

    Attribute Information : Given is the variable name, variable type, the measurement unit and a brief description. The "Blood Transfusion Service Center" is a classification problem. The order of this listing corresponds to the order of numerals along the rows of the database. R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

    Table 1 shows the descriptive statistics of the data. We selected 500 data at random as the training set, and the rest 248 as the testing set.

    Table 1. Descriptive statistics of the data - Variable Data Type Measurement Description min max mean std - Recency quantitative Months Input 0.03 74.4 9.74 8.07 Frequency quantitative Times Input 1 50 5.51 5.84 - Monetary quantitative c.c. blood Input 250 12500 1378.68 1459.83 - Time quantitative Months Input 2.27 98.3 34.42 24.32 - Whether he/she donated blood in March 2007 binary 1=yes 0=no Output 0 1 1 (24%) 0 (76%)

    Data Set CharacteristicsMultivariate
    Number of Instances748
    AreaBusiness
    Attribute CharacteristicsReal
    Number of Attributes5
    Associated TasksClassification
    Missing Values?N/A

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

    Citation Request : NOTE: Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence, "Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).

  9. m

    Data and codes for "Beyond NUE: A focus on true nitrogen gains in cereals"

    • data.mendeley.com
    • researchdata.edu.au
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Fernandez (2024). Data and codes for "Beyond NUE: A focus on true nitrogen gains in cereals" [Dataset]. http://doi.org/10.17632/s6kdt7pc2j.1
    Explore at:
    Dataset updated
    Nov 27, 2024
    Authors
    Javier Fernandez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source data and code supporting the manuscript "Beyond NUE: A focus on true nitrogen gains in cereals" based on maize, sorghum, and barley datasets.

    README (README.txt). References, sources for the reviewed dataset.

    Dataset S1 (rev_dataset.xlsx). Review data on NUE related traits for maize, sorghum, and barley cultivars from different decades of commercial release collected from published literature. Missing values or information of studies are represented by “n/a” in the data. References of traits and variables included in the data are described in a separate sheet within the XLSX file.

    Dataset S2 (analysisR.pdf). R code for data processing, analysis, and visualization.

  10. Z

    Wireless Link Quality Estimation on FlockLab - and Beyond

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romain Jacob; Reto Da Forno; Roman Trüb; Andreas Biri; Lothar Thiele (2024). Wireless Link Quality Estimation on FlockLab - and Beyond [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3354717
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    ETH Zurich
    Authors
    Romain Jacob; Reto Da Forno; Roman Trüb; Andreas Biri; Lothar Thiele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains wireless link quality estimation data for the FlockLab testbed [1,2]. The rationale and description of this dataset is described in a the following abstract (pdf is included in this repository -- see below).

    Dataset: Wireless Link Quality Estimationon FlockLab – and Beyond Romain Jacob, Reto Da Forno, Roman Trüb, Andreas Biri, Lothar Thiele DATA '19 Proceedings of the 2nd Workshop on Data Acquisition To Analysis, 2019

    Data collection scenario

    The data collection scenario is simple. Each FlockLab node is assigned one dedicated time slot. In this slot, a node sends 100 packets, called strobes. All strobes have the same payload size and use a given radio frequency channel and transmit power. All other nodes listen for the strobes and log packet reception events (i.e., success or failed).

    The test scenario is ran every two hours on two different platforms: the TelosB [3] and DPP-cc430 [4] platforms. We used all nodes currently available at test time (between 27 and 29).

    Final dataset status

    3 months of data with about 12 tests per day per platform

    5 month of data with about 4 tests per day per platform

    Data collection firmware

    We are happy to share the link quality data we collected for the FlockLab testbed, but we also wanted to make it easier for others to collect similar datasets for other wireless networks. To achieve this, we include in this repository the data collection firmware we design. The entire data collection scheduling and control is done entirely in software, in order to make the firmware usable in a large variety on wireless networks. We implemented our data collection software using Baloo [5], a flexible network stack design framework based on Synchronous Transmission. Baloo efficiently handles network time synchronization and offers a flexible interface to schedule communication rounds. The firmware source code is available in the Baloo repository [6].

    A set of experiment parameters can be patched directly in the firmware, which let the user tune the data collection without having to recompile the source code. This improves usability and facilitates automation. An example patching script is included in this repository. Currently, the following parameters can be patched:

    rf_channel,

    payload,

    host_id, and

    rand_seed

    Current supported platforms

    TelosB [3]

    DPP-cc430 [4]

    Repository versions

    v1.4.1 Updated visualizations in the notebook

    v1.4.0 Addition of data from November 2019 to March 2020. Data collection is discontinued (the new FlockLab testbed is being setup).

    v1.3.1 Update abstract and notebook

    v1.3.0 Addition of October 2019 data. The frequency of tests has been reduced to 4 per day, executing at (approximately) 1:00, 7:00, 13:00, and 19:00. From October 28 onward, time shifted by one hour (2:00, 8:00, 14:00, 20:00).

    v1.2.0 Addition of September 2019 data. Many missing tests on the 12, 13, 19, and 20 of September (due to construction works in the building).

    v1.1.4 Update of the abstract to have hyperlinks to the plots. Corrected typos.

    v1.1.0 Initial version. Add the data collected in August 2019. Data collected was disturbed at the beginning of the month and resumed normally on the August 13. Data from previous days are incomplete.

    v1.0.0 Initial version. Contain collected data in July 2019, from the 10th to 30th of July. No data were collected on the 31st of July (technical issue).

    List of files

    yyyy-mm_raw_platform.zip Archive containing all FlockLab test result files (one .zip file per month and per platform).

    yyyy-mm_preprocessed_all.zip Archive containing preprocessed csv files, one per month and per platform.

    firmware.zip Archive containing the firmware for all supported platform.

    firmware_patch.sh Example bash script illustrating the firmware patching.

    parse_flocklab_results.ipynb [open in nbviewer] Jupyter notebook used to create the pre-process data files. Also includes some example of data visualization.

    parse_flocklab_results.html HTML rendering of the notebook (static).

    plots.zip Archive containing high resolution visualization of the dataset, generated by the parse_flocklab_results notebook, and presented in the abstract.

    abstract.pdf A 3 page abstract presenting the dataset.

    CRediT.pdf The list of contributions from the authors.

    References

    [1] R. Lim, F. Ferrari, M. Zimmerling, C. Walser, P. Sommer, and J. Beutel, “FlockLab: A Testbed for Distributed, Synchronized Tracing and Profiling of Wireless Embedded Systems,” in Proceedings of the 12th International Conference on Information Processing in Sensor Networks, New York, NY, USA, 2013, pp. 153–166.

    [2] “FlockLab,” GitLab. [Online]. Available: https://gitlab.ethz.ch/tec/public/flocklab/wikis/home. [Accessed: 24-Jul-2019].

    [3] Advanticsys, “MTM-CM5000-MSP 802.15.4 TelosB mote Module.” [Online]. Available: https://www.advanticsys.com/shop/mtmcm5000msp-p-14.html. [Accessed: 21-Sep-2018].

    [4] Texas Instruments, “CC430F6137 16-Bit Ultra-Low-Power MCU.” [Online]. Available: http://www.ti.com/product/CC430F6137. [Accessed: 21-Sep-2018].

    [5] R. Jacob, J. Bächli, R. Da Forno, and L. Thiele, “Synchronous Transmissions Made Easy: Design Your Network Stack with Baloo,” in Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks, 2019.

    [6] “Baloo,” Dec-2018. [Online]. Available: http://www.romainjacob.net/research/baloo/.

  11. Data from: Student Academic Performance Dataset

    • kaggle.com
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hackathon data (2025). Student Academic Performance Dataset [Dataset]. https://www.kaggle.com/datasets/aryancodes12fyds/student-academic-performance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hackathon data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📘 Description

    The Student Academic Performance Dataset contains detailed academic and lifestyle information of 250 students, created to analyze how various factors — such as study hours, sleep, attendance, stress, and social media usage — influence their overall academic outcomes and GPA.

    This dataset is synthetic but realistic, carefully generated to reflect believable academic patterns and relationships. It’s perfect for learning data analysis, statistics, and visualization using Excel, Python, or R.

    The data includes 12 attributes, primarily numerical, ensuring that it’s suitable for a wide range of analytical tasks — from basic descriptive statistics (mean, median, SD) to correlation and regression analysis.

    📊 Key Features

    🧮 250 rows and 12 columns

    💡 Mostly numerical — great for Excel-based statistical functions

    🔍 No missing values — ready for direct use

    📈 Balanced and realistic — ideal for clear visualizations and trend analysis

    🎯 Suitable for:

    Descriptive statistics

    Correlation & regression

    Data visualization projects

    Dashboard creation (Excel, Tableau, Power BI)

    💡 Possible Insights to Explore

    How do study hours impact GPA?

    Is there a relationship between stress levels and performance?

    Does social media usage reduce study efficiency?

    Do students with higher attendance achieve better grades?

    ⚙️ Data Generation Details

    Each record represents a unique student.

    GPA is calculated using a weighted formula based on midterm and final scores.

    Relationships are designed to be realistic — for example:

    Higher study hours → higher scores and GPA

    Higher stress → slightly lower sleep hours

    Excessive social media time → reduced academic performance

    ⚠️ Disclaimer

    This dataset is synthetically generated using statistical modeling techniques and does not contain any real student data. It is intended purely for educational, analytical, and research purposes.

  12. Data & Code: "Predicting missing links in global host-parasite networks"

    • figshare.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxwell Farrell (2021). Data & Code: "Predicting missing links in global host-parasite networks" [Dataset]. http://doi.org/10.6084/m9.figshare.8969882.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Maxwell Farrell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data & Code from Farrell et al. "Predicting missing links in global host-parasite networks"ScriptsWithin the scripts folder are scripts to process the raw data and model results:1. Download, clean, and merge host-parasite interaction databases with mammal supertree (process_raw_data.R)2. Re-create raw data plots from the manuscript (raw_data_plots.R)3. Plot posterior interaction matrices, scaled trees, and pull out top predicted links (model_summaries.R)4. Re-create diagnostic plots from the manuscript (diagnostic_plots.R)5. Functions used for data manipulation and visualization that are sourced by other scripts (network_analysis.R)6. Investiage bias propagation via node degree product (bias_investigation.R)7. Generate risk maps (risk_maps.R)Data- raw_data: folder includes the data necessary to amalgamate the host-parasite interaction databases (via script 'process_raw_data.R').- clean_data: folder includes the full host-parasite interaction list 'hp_list' in both .csv and .rds formats, as well as the binary interaction matrices for the full dataset and ones subset by parasite type (virus, bacteria, fungi, etc...), and model diagnostics ('model_diagnostics.csv') used in 'diagnostic_plots.R' .- model_results: folder contains .rds files per model, which has the output interaction matrix from each simulation ('P'), the table of model diagnostics ('TB'), and the phylogeny scaling parameter ('Eta'), if applicable. Note that to save space the full cross-fold fit posteriors are omitted (these total ~ 4.5GB). Please contact MF if these are required. - literature_results: folder contains a .csv version of the results of the literature search outlined in the Supplementary Information.- plots_tables: folder contains .csv files for the top 100 'missing' links for each model, and a .csv for the top 1000 links from the full model run on the full dataset.

  13. SP500_data

    • kaggle.com
    zip
    Updated May 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
    Explore at:
    zip(39005 bytes)Available download formats
    Dataset updated
    May 28, 2023
    Authors
    Franco Dicosola
    Description

    Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.

  14. Study Hours vs Grades Dataset

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
    Explore at:
    zip(33964 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Andrey Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

    Dataset Features

    • student_id: Unique identifier for each student (1-5000)
    • study_hours: Hours spent studying (0-12 hours, continuous)
    • grade: Final exam score (0-100 points, continuous)

    Potential Use Cases

    • Linear regression modeling and practice
    • Data visualization exercises
    • Statistical analysis tutorials
    • Machine learning for beginners
    • Educational research simulations

    Data Quality

    • No missing values
    • Normally distributed residuals
    • Realistic educational scenario
    • Ready for immediate analysis

    Data Generation Code

    This dataset was generated using R.

    R Code

    # Set seed for reproducibility
    set.seed(42)
    
    # Define number of observations (students)
    n <- 5000
    
    # Generate study hours (independent variable)
    # Uniform distribution between 0 and 12 hours
    study_hours <- runif(n, min = 0, max = 12)
    
    # Create relationship between study hours and grade
    # Base grade: 40 points
    # Each study hour adds an average of 5 points
    # Add normal noise (standard deviation = 10)
    theoretical_grade <- 40 + 5 * study_hours
    
    # Add normal noise to make it realistic
    noise <- rnorm(n, mean = 0, sd = 10)
    
    # Calculate final grade
    grade <- theoretical_grade + noise
    
    # Limit grades between 0 and 100
    grade <- pmin(pmax(grade, 0), 100)
    
    # Create the dataframe
    dataset <- data.frame(
     student_id = 1:n,
     study_hours = round(study_hours, 2),
     grade = round(grade, 2)
    )
    
  15. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  16. Online Retail Customer Segmentation Project

    • kaggle.com
    zip
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lxnmo bill (2025). Online Retail Customer Segmentation Project [Dataset]. https://www.kaggle.com/datasets/lxnmobill/online-retail-customer-segmentation-project/versions/1
    Explore at:
    zip(16886255 bytes)Available download formats
    Dataset updated
    Apr 17, 2025
    Authors
    lxnmo bill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📦 Dataset Description This dataset supports the Online Retail Customer Segmentation Project, which analyzes one year of transaction records from a UK-based online gift store.

    The goal is to identify customer segments using RFM (Recency, Frequency, Monetary) modeling and KMeans clustering, and to explore customer value and behavior through visualization dashboards.

    📁 Included Files: Filename Description retail_cleaned.csv Cleaned transaction-level data (negative quantity, missing IDs removed) retail_segmented.csv Main analysis table with RFM-based Segment labels merged in customer_summary copy.csv Customer-level summary: total orders, total spent, first/last purchase dates monthly_sales copy.csv Aggregated monthly sales data for time trend analysis Online Retail Analysis.pdf Full project report (data process + dashboard screenshots + insights) 🔧 Preprocessing Summary: Removed records with missing CustomerID, negative Quantity, or invalid UnitPrice

    Created TotalPrice = Quantity × UnitPrice

    Generated customer metrics in SQL and calculated RFM values in R

    Performed KMeans clustering to create customer segments (Segment 1–4)

    📊 Applications: Customer segmentation for loyalty/retention campaigns

    Sales trend and seasonal pattern analysis

    High-value customer targeting

    Geographical revenue mapping

  17. Jet2 Synthetic Booking

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adythio Niramoyo (2025). Jet2 Synthetic Booking [Dataset]. https://www.kaggle.com/datasets/adythioniramoyo/jet2-synthetic-booking
    Explore at:
    zip(10065345 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    Adythio Niramoyo
    Description

    This dataset simulates Jet2 airline passenger bookings and is designed for segmentation, clustering, and behavioral analysis.

    📊 Dataset Description: Jet2 Synthetic Booking

    The Jet2 Synthetic Booking dataset provides a realistic simulation of passenger booking behavior for Jet2, a UK-based leisure airline. It is ideal for data science projects involving customer segmentation, predictive modeling, and operational insights.

    🧾 Key Features

    • Passenger-level booking records with anonymized identifiers
    • Temporal booking patterns: Includes booking dates, departure dates, and lead times
    • Flight details: Routes, departure airports, destination airports
    • Fare and pricing data: Ticket prices, taxes, and total spend
    • Passenger segmentation: Useful for clustering into groups like Early Birds, Mid-Range, and Late Volatility
    • Synthetic generation: Modeled to reflect realistic Jet2 booking trends without using proprietary or personal data

    🎯 Use Cases

    • K-Means clustering to identify booking behavior segments
    • Time series analysis of booking lead times and seasonal demand
    • Revenue optimization based on fare classes and booking windows
    • Marketing strategy development by understanding customer booking habits

    📁 Format & Accessibility

    • Available as a CSV file on Kaggle
    • Cleaned and structured for immediate use in Python, R, or BI tools
    • No missing values or privacy concerns due to synthetic generation
  18. AIDS Virus Infection Prediction 💉

    • kaggle.com
    zip
    Updated Apr 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadarsh velu (2024). AIDS Virus Infection Prediction 💉 [Dataset]. https://www.kaggle.com/datasets/aadarshvelu/AIDS-virus-infection-prediction
    Explore at:
    zip(1710961 bytes)Available download formats
    Dataset updated
    Apr 28, 2024
    Authors
    Aadarsh velu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context :

    Dataset contains healthcare statistics and categorical information about patients who have been diagnosed with AIDS. This dataset was initially published in 1996.

    Attribute Information :

    • time: time to failure or censoring
    • trt: treatment indicator (0 = ZDV only; 1 = ZDV + ddI, 2 = ZDV + Zal, 3 = ddI only)
    • age: age (yrs) at baseline
    • wtkg: weight (kg) at baseline
    • hemo: hemophilia (0=no, 1=yes)
    • homo: homosexual activity (0=no, 1=yes)
    • drugs: history of IV drug use (0=no, 1=yes)
    • karnof: Karnofsky score (on a scale of 0-100)
    • oprior: Non-ZDV antiretroviral therapy pre-175 (0=no, 1=yes)
    • z30: ZDV in the 30 days prior to 175 (0=no, 1=yes)
    • preanti: days pre-175 anti-retroviral therapy
    • race: race (0=White, 1=non-white)
    • gender: gender (0=F, 1=M)
    • str2: antiretroviral history (0=naive, 1=experienced)
    • strat: antiretroviral history stratification (1='Antiretroviral Naive',2='> 1 but <= 52 weeks of prior antiretroviral therapy',3='> 52 weeks)
    • symptom: symptomatic indicator (0=asymp, 1=symp)
    • treat: treatment indicator (0=ZDV only, 1=others)
    • offtrt: indicator of off-trt before 96+/-5 weeks (0=no,1=yes)
    • cd40: CD4 at baseline
    • cd420: CD4 at 20+/-5 weeks
    • cd80: CD8 at baseline
    • cd820: CD8 at 20+/-5 weeks
    • infected: is infected with AIDS (0=No, 1=Yes)

    Additional Variable Information :

    • Personal information (age, weight, race, gender, sexual activity)
    • Medical history (hemophilia, history of IV drugs)
    • Treatment history (ZDV/non-ZDV treatment history)
    • Lab results (CD4/CD8 counts)

    Citation :

    https://classic.clinicaltrials.gov/ct2/show/NCT00000625

    Acknowledgment :

    Creators :

    1. S. Hammer
    2. D. Katzenstein
    3. M. Hughes
    4. H. Gundacker
    5. R. Schooley
    6. R. Haubrich
    7. W. K.
    8. M. Lederman
    9. J. Phair
    10. M. Niu
    11. M. Hirsch
    12. T. Merigan

    Donor :

    https://archive.ics.uci.edu/dataset/890/aids+clinical+trials+group+study+175

  19. Social Insurance Programs in Richest Quintile

    • kaggle.com
    Updated Jan 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Social Insurance Programs in Richest Quintile [Dataset]. https://www.kaggle.com/datasets/thedevastator/coverage-of-social-insurance-programs-in-richest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Coverage of Social Insurance Programs in Richest Quintile

    Percent of Population Eligible

    By data.world's Admin [source]

    About this dataset

    This dataset offers a unique insight into the coverage of social insurance programs for the wealthiest quintile of populations around the world. It reveals how many individuals in each country are receiving support from old age contributory pensions, disability benefits, and social security and health insurance benefits such as occupational injury benefits, paid sick leave, maternity leave, and more. This data provides an invaluable resource to understand the health and well-being of those most financially privileged in society – often having greater impact on decision making than other groups. With up-to-date figures from 2019-05-11 this dataset is invaluable in uncovering where there is work to be done for improved healthcare provision in each country across the world

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Understand the context: Before you begin analyzing this dataset, it is important to understand the information that it provides. Take some time to read the description of what is included in the dataset, including a clear understanding of the definitions and scope of coverage provided with each data point.

    • Examine the data: Once you have a general understanding of this dataset's contents, take some time to explore its contents in more depth. What specific questions does this dataset help answer? What kind of insights does it provide? Are there any missing pieces?

    • Clean & Prepare Data: After you've preliminarily examined its content, start preparing your data for further analysis and visualization. Clean up any formatting issues or irregularities present in your data set by correcting typos and eliminating unnecessary rows or columns before working with your chosen programming language (I prefer R for data manipulation tasks). Additionally, consider performing necessary transformations such as sorting or averaging values if appropriate for the findings you wish to draw from your analysis.

    • Visualize Results: Once you've cleaned and prepared your data, use visualizations such as charts, graphs or tables to reveal patterns within it that support specific conclusions about how insurance coverage under social programs vary among different groups within society's quintiles - based on age groups etc.. This type of visualization allows those who aren't familiar with programming to process complex information quickly and accurately than when displayed numerically in tabular form only!

    5 Final Analysis & Export Results: Finally export your visuals into presentation-ready formats (e.g., PDFs) which can be shared with colleagues! Additionally use these results as part of a narrative conclusion report providing an accurate assessment and meaningful interpretation about how social insurance programs vary between different members within society's quintiles (i..e., accordingest vs poorest), along with potential policy implications relevant for implementing effective strategies that improve access accordingly!

    Research Ideas

    • Analyzing the effectiveness of social insurance programs by comparing the coverage levels across different geographic areas or socio-economic groups;
    • Estimating the economic impact of social insurance programs on local and national economies by tracking spending levels and revenues generated;
    • Identifying potential problems with access to social insurance benefits, such as racial or gender disparities in benefit coverage

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: coverage-of-social-insurance-programs-in-richest-quintile-of-population-1.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.

  20. Zomato Project.

    • kaggle.com
    zip
    Updated Aug 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Hayat (2024). Zomato Project. [Dataset]. https://www.kaggle.com/datasets/umairhayat/zomato-project/code
    Explore at:
    zip(2543 bytes)Available download formats
    Dataset updated
    Aug 24, 2024
    Authors
    Umair Hayat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Project Description: Analysis of Restaurant Preferences and Ordering Trends on Zomato In this project, we explore and analyze various aspects of customer behavior and restaurant performance using Zomato's data. Our goal is to derive actionable insights that can help enhance customer experience and optimize restaurant offerings.

    Objectives: Restaurant Popularity Analysis:

    Identify Popular Restaurant Types: Determine which types of restaurants receive the most votes from customers. This will help us understand which categories are most favored and could guide marketing strategies. Vote Distribution by Restaurant Type:

    Quantify Votes for Each Type: Calculate the total number of votes each type of restaurant has received. This will provide a clear picture of customer preferences across different restaurant categories. Rating Trends:

    Analyze Rating Distribution: Examine the ratings that the majority of restaurants have received. This will help identify the overall satisfaction level of customers and the general quality of dining experiences. Couple Spending Patterns:

    Average Spending Analysis: Analyze the average spending per order for couples who frequently order online. This insight will assist in understanding spending behaviors and potential revenue generation from this demographic. Mode of Ordering Performance:

    Evaluate Ratings by Ordering Mode: Compare the ratings received by online versus offline orders to determine which mode is preferred and delivers higher customer satisfaction. Offline Ordering Trends:

    Identify High-Order Restaurant Types: Find out which types of restaurants receive more offline orders. This information can be used to tailor promotions and offers for specific restaurant categories, enhancing customer engagement. Methodology: Data Collection:

    Utilize Zomato’s API or available datasets to gather comprehensive data on restaurant types, votes, ratings, and ordering modes. Data Cleaning and Preparation:

    Clean the dataset to handle missing values, standardize categories, and ensure data accuracy. Data Analysis:

    Employ statistical and data visualization tools to aggregate votes, analyze ratings, and explore spending patterns. Use tools like Python (Pandas, Matplotlib, Seaborn), R, or Excel for data processing and visualization. Insights and Recommendations:

    Generate insights based on the analysis and provide actionable recommendations for restaurant marketing strategies and customer engagement. This project aims to provide a detailed understanding of customer preferences and behaviors, enabling Zomato to make data-driven decisions to improve user experience and offer targeted promotions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
Organization logo

Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA

Explore at:
txtAvailable download formats
Dataset updated
Jan 22, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Bulla
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

{# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

Search
Clear search
Close search
Google apps
Main menu