71 datasets found
  1. f

    climwin: An R Toolbox for Climate Window Analysis

    • plos.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liam D. Bailey; Martijn van de Pol (2023). climwin: An R Toolbox for Climate Window Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0167980
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Liam D. Bailey; Martijn van de Pol
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When studying the impacts of climate change, there is a tendency to select climate data from a small set of arbitrary time periods or climate windows (e.g., spring temperature). However, these arbitrary windows may not encompass the strongest periods of climatic sensitivity and may lead to erroneous biological interpretations. Therefore, there is a need to consider a wider range of climate windows to better predict the impacts of future climate change. We introduce the R package climwin that provides a number of methods to test the effect of different climate windows on a chosen response variable and compare these windows to identify potential climate signals. climwin extracts the relevant data for each possible climate window and uses this data to fit a statistical model, the structure of which is chosen by the user. Models are then compared using an information criteria approach. This allows users to determine how well each window explains variation in the response variable and compare model support between windows. climwin also contains methods to detect type I and II errors, which are often a problem with this type of exploratory analysis. This article presents the statistical framework and technical details behind the climwin package and demonstrates the applicability of the method with a number of worked examples.

  2. d

    Physical Properties of Lakes: Exploratory Data Analysis

    • search.dataone.org
    • hydroshare.org
    Updated Apr 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Garcia; Kateri Salk (2022). Physical Properties of Lakes: Exploratory Data Analysis [Dataset]. https://search.dataone.org/view/sha256%3A82a3bd46ad259724cad21b7a344728253ea4e6d929f6134e946c379585f903f6
    Explore at:
    Dataset updated
    Apr 15, 2022
    Dataset provided by
    Hydroshare
    Authors
    Gabriela Garcia; Kateri Salk
    Time period covered
    May 27, 1984 - Aug 17, 2016
    Area covered
    Description

    Exploratory Data Analysis for the Physical Properties of Lakes

    This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on the physical properties of lakes.

    Introduction

    Lakes are dynamic, nonuniform bodies of water in which the physical, biological, and chemical properties interact. Lakes also contain the majority of Earth's fresh water supply. This lesson introduces exploratory data analysis using R statistical software in the context of the physical properties of lakes.

    Learning Objectives

    After successfully completing this exercise, you will be able to:

    1. Apply exploratory data analytics skills to applied questions about physical properties of lakes
    2. Communicate findings with peers through oral, visual, and written modes
  3. f

    ftmsRanalysis: An R package for exploratory data analysis and interactive...

    • plos.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue (2023). ftmsRanalysis: An R package for exploratory data analysis and interactive visualization of FT-MS data [Dataset]. http://doi.org/10.1371/journal.pcbi.1007654
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.

  4. m

    Data and R scripts for 'Reliability of geochemical analyses: Deja vu all...

    • data.mendeley.com
    Updated Mar 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ola Anfin Eggen (2019). Data and R scripts for 'Reliability of geochemical analyses: Deja vu all over again' [Dataset]. http://doi.org/10.17632/pvw557y82p.1
    Explore at:
    Dataset updated
    Mar 12, 2019
    Authors
    Ola Anfin Eggen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The zipped file contains the following: - data (as csv, in the 'data' folder), - R scripts (as Rmd, in the rro folder), - figures (as pdf, in the 'figs' folder), and - presentation (as html, in the root folder).

  5. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.

  6. HadISD: Global sub-daily, surface meteorological station data, 1931-2021,...

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Jan 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NERC EDS Centre for Environmental Data Analysis (2022). HadISD: Global sub-daily, surface meteorological station data, 1931-2021, v3.2.0.2021f [Dataset]. https://catalogue.ceda.ac.uk/uuid/5fb94c8e37f64c95b671278b0e55cdd4
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    License

    http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/

    Time period covered
    Jan 1, 1931 - Dec 31, 2021
    Area covered
    Earth
    Variables measured
    time, altitude, latitude, longitude, wind_speed, air_temperature, wind_speed_of_gust, cloud_area_fraction, cloud_base_altitude, wind_from_direction, and 6 more
    Description

    This is version v3.2.0.2021f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data.

    The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information.

    The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20220101_v3.2.1.2021f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height.

    To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.

    For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/

    References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) :

    Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note.

    Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016.

    Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012

    Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1

    For a homogeneity assessment of HadISD please see this following reference

    Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.

  7. Reddit AskScience Flair Analysis Dataset

    • kaggle.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/sumitm004/reddit-raskscience-flair-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sumit Mishra
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Context

    Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

    Content

    This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

    Mendeley Data

    Ideas for Usage

    • Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.
    • NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.
    • Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.
    • Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.
    • Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
  8. o

    t-test i Case study analize podataka

    • explore.openaire.eu
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadica Miljković (2021). t-test i Case study analize podataka [Dataset]. http://doi.org/10.5281/zenodo.11312604
    Explore at:
    Dataset updated
    May 24, 2021
    Authors
    Nadica Miljković
    Description

    Predavanje za predmet Tehnike obrade biomedicinskih signala na master akademskim studijama na Elektrotehničkom fakultetu Univerziteta u Beogradu.

  9. n

    HadISD: Global sub-daily, surface meteorological station data, 1931-2017,...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Jul 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2017, v2.0.2.2017p [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
    Explore at:
    Dataset updated
    Jul 24, 2021
    Description

    This is version 2.0.2.2017p of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data that extends HadISD v2.0.1.2016f to include 2017 and so spans 1931-2017. These data include an update to the station selected and contain 8103 stations. These are the preliminary data for this version, a finalised version will be released in a few months with any station updates. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20171231_v2-0-2-2017p.nc. The station codes can be found under the docs tab or on the archive beside the station_data folder. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep up to date with updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.

  10. Evaluation of financial statements

    • kaggle.com
    Updated Jan 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arian Ríos (2022). Evaluation of financial statements [Dataset]. https://www.kaggle.com/arianrios/evaluacin-de-estados-financieros/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arian Ríos
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📄 Data Context

    Instead of a data set, this is an exercise on financial analysis made with R vectors. The intention is to demonstrate the practicality of this tool when interpreting vector calculations, matrices or data frames. This is due to its character of being a high-level language, which allows algorithms to be expressed in a way that is appropriate to human cognitive capacity, instead of the capacity with which machines execute them.

  11. f

    Comparison of predictive power.

    • figshare.com
    xls
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alina Bazarova; Marko Raseta (2023). Comparison of predictive power. [Dataset]. http://doi.org/10.1371/journal.pone.0292597.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alina Bazarova; Marko Raseta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present an R-package for predictive modelling, CARRoT (Cross-validation, Accuracy, Regression, Rule of Ten). CARRoT is a tool for initial exploratory analysis of the data, which performs exhaustive search for a regression model yielding the best predictive power with heuristic ‘rules of thumb’ and expert knowledge as regularization parameters. It uses multiple hold-outs in order to internally validate the model. The package allows to take into account multiple factors such as collinearity of the predictors, event per variable rules (EPVs) and R-squared statistics during the model selection. In addition, other constraints, such as forcing specific terms and restricting complexity of the predictive models can be used. The package allows taking pairwise and three-way interactions between variables into account as well. These candidate models are then ranked by predictive power, which is assessed via multiple hold-out procedures and can be parallelised in order to reduce the computational time. Models which exhibited the highest average predictive power over all hold-outs are returned. This is quantified as absolute and relative error in case of continuous outcomes, accuracy and AUROC values in case of categorical outcomes. In this paper we briefly present statistical framework of the package and discuss the complexity of the underlying algorithm. Moreover, using CARRoT and a number of datasets available in R we provide comparison of different model selection techniques: based on EPVs alone, on EPVs and R-squared statistics, on lasso regression, on including only statistically significant predictors and on stepwise forward selection technique.

  12. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

  13. Plotly Dashboard Healthcare

    • kaggle.com
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A SURESH (2022). Plotly Dashboard Healthcare [Dataset]. https://www.kaggle.com/datasets/sureshmecad/plotly-dashboard-healthcare/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    A SURESH
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data Visualization

    Content

    a. Scatter plot

      i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for 
        any pair of genes.
    
      ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
    
      iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint - 
        visit https://plotly.com/r/, https://plotly.com/python)
    
      iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
        Gender/Sex column from the metadata file.
    

    b. Boxplot/violin plot

      i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across 
       multiple categories as defined by user selected variable (a column from the metadata file)
    
     ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
    

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  14. Data from: Exploratory analysis of sleep deprivation effects on gene...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lily Bai; Ramanuj Sarkar; Faith Lee; Joseph Wu; Marquis Vawter (2025). Exploratory analysis of sleep deprivation effects on gene expression and regional brain metabolism [Dataset]. http://doi.org/10.5061/dryad.pzgmsbcz0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    University of California, Irvine
    Johns Hopkins University
    Hackensack University Medical Center
    Authors
    Lily Bai; Ramanuj Sarkar; Faith Lee; Joseph Wu; Marquis Vawter
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Sleep deprivation affects cognitive performance and immune function, yet its mechanisms and biomarkers remain unclear. This study explored the relationships among gene expression, brain metabolism, sleep deprivation, and sex differences. Methods Fluorodeoxyglucose-18 positron emission tomography (18F-FDG PET) measured brain metabolism in regions of interest (ROIs), and RNA analysis of blood samples assessed gene expression pre- and post-sleep deprivation. Mixed model regression and principal component analysis (PCA) identified significant genes and regional metabolic changes. Results There were 23 and 28 differentially expressed probesets for the main effects of sex and sleep deprivation, respectively, and 55 probesets for their interaction (FDR-corrected p<0.05). Functional analysis revealed enrichment in nucleoplasm- and UBL conjugation-related genes. Genes showing significant sex effects mapped to chromosomal regions Y and 19 (Benjamini-Hochberg (BH) FDR p<0.05), with 11 genes (4%) and 29 genes (10.5%) involved, respectively. Differential gene expression highlighted sex-based differences in innate and adaptive immunity. For brain metabolism, sleep deprivation resulted in significant decreases in the left insula, medial prefrontal cortex (BA32), somatosensory cortex (BA1/2), and motor premotor cortex (BA6) and increases in the right inferior longitudinal fasciculus, primary visual cortex (BA17), amygdala, cerebellum, and bilateral pons. Hemispheric asymmetry in brain metabolism was observed, with BA6 decreases correlating with increased UBL conjugation gene expression. Conclusion Sleep deprivation broadly impacts brain metabolism, gene expression, and immune function, revealing cellular stress responses and hemispheric vulnerability. These findings enhance understanding of the molecular and functional effects of sleep deprivation. Methods Sleep Deprivation Eight healthy subjects, 4 male and 4 female, were recruited from the University of California Irvine, after IRB approval. On day 1, subjects were initially assigned a 24-hour period of normal activity (e.g. walk, talk, study, watch TV, play games, use the computer, etc.). These subjects were tested on the Psychomotor Vigilance Test (PVT) and asked to rate their subjective level of sleepiness on the Stanford Sleepiness Scale (SSS) at baseline. Higher scores indicate a longer, more delayed, response time on the PVT, while higher scores on the SSS indicate greater degrees of sleepiness. The SSS scale is shown in Table 1. Each subject’s performance on the Psychomotor Vigilance Test (PVT), and subjective sleepiness ratings (SSS) were recorded both before and after sleep deprivation (Table 2). There was no significant difference in age between male and female subjects (Table 3), all of whom had no prior psychiatric history. Blood samples were collected on baseline day at 1 p.m, pre-sleep deprivation (pre-SD). Sleep deprivation activities and blood sample acquisition times are recorded in Table 4. At the end of day 1 (11 p.m), subjects were moved to an outpatient research facility for the sleep deprivation protocol. They were requested not to nap or sleep during the sleep deprivation period, and were additionally tasked with filling out forms and answering questions about their mood every two to four hours. Staff members monitored the subjects during the sleep deprivation period. Subjects were allowed to walk, talk, study, watch TV, play games or cards, read, and use the computer, but were not allowed caffeinated foods or beverages. A second blood sample was collected 18 hours after starting sleep deprivation activities (SD Day 2, 1 p.m), subjects completed the protocol and were driven home by cab. Gene Data Processing Blood samples (3 ml) were drawn from each subject, into Tempus tubes (ABI, ThermoFisher, Carlsbad, CA) 24 hours apart. The blood samples collected at baseline and 18 hours after starting sleep deprivation activities were processed with Affymetrix HG-U133 Plus 2.0 gene expression microarray chips according to the manufacturer’s instructions (Affymetrix, ThermoFisher, Carlsbad CA). Data processing was done using R 4.2 and BioConductor 3.16 [32]. The Affymetrix HG-U133 Plus 2.0 microarray ‘cel’ files were read using the affy routine with the hgu133plus2.db package. Quantile normalization was used to standardize probeset data [33]. A linear model was fitted to the expression data for each probeset using ‘lmfit’ from the limma package, to eliminate weakly expressed probesets, and the top 40,000 probesets were found using the topTables function. Type III mixed ANOVA was implemented using the lmerTest library in R, with the main effects being sex, sleep deprivation, and sleep deprivation-sex interaction. Age and RNA integrity number (RIN) were used as covariates. The top 300 probesets for each main effect from mixed ANOVA and PCA were analyzed for enrichment using the Database for Annotation, Visualization and Integrated Discovery (DAVID) [34; 35]. Principal component analysis was conducted using the pca function with normalized and scaled expression data. F18-FDG PET Scan Processing The pre-SD and post-SD F18 FDG-PET scans were obtained from each subject. Each F18-FDG PET scan was normalized in MATLAB (Mathworks, Sherborn, Massachusetts, USA) using Statistical Parametric Mapping (SPM) 5 software (Functional Imaging Laboratory, Wellcome Department of Cognitive Neurology, University College London, London, UK) to spatially transform the images to a template conforming to the space derived from standard brains from the Montreal Neurological Institute, and convert it to the space of the stereotactic atlas of Talairach and Tournoux. The images were then smoothed with a Gaussian low-pass filter of 8mm to minimize noise and improve spatial alignment. Regions of interest (ROI) analysis was done by extracting metabolic values from regions of interest using VINCI (“Volume Imaging in Neurological Research, Co-Registration and ROI included”) software. Supplementary Figure 1 shows ROI segmentation of FDG-PET scans labeled with brain regions and Brodmann areas (BA). A type III mixed two-way ANOVA was implemented using the lmerTest library in R. The model considered sex as a between-subjects factor and condition (pre-sleep deprivation vs. post-sleep deprivation) as a within-subjects factor. Principal component analysis was performed using the pca() function in the BioConductor environment [32] in R. Prior to extracting principal components, all probesets were scaled by extracting the mean value and dividing by the standard deviation for that variable in R.

  15. Smartwatch Health Data (Uncleaned)

    • kaggle.com
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Arfath R (2025). Smartwatch Health Data (Uncleaned) [Dataset]. https://www.kaggle.com/datasets/mohammedarfathr/smartwatch-health-data-uncleaned/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammed Arfath R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset simulates health-related outputs from a smartwatch, mimicking real-world issues in data collection, making it perfect for applying data preprocessing techniques such as handling missing values, outliers, duplicates, and inconsistencies.

    Dataset Overview: Total Rows: 10,000 Total Columns: 7 Use Case: Health monitoring using smartwatch sensor data

  16. Z

    Data and Code for the paper "GUI Testing of Android Applications:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Martino (2023). Data and Code for the paper "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7260111
    Explore at:
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Luigi Libero Lucio Starace
    Anna Rita Fasolino
    Sergio Di Martino
    Porfirio Tramontana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".

    Abstract

    Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.

    Contents and Instructions

    This package contains:

    apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.

    apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.

    students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.

    compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.

    branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.

    data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.

  17. Z

    Supplementary code and data for the paper `From stage to page: language...

    • data.niaid.nih.gov
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Šeļa, Artjoms (2023). Supplementary code and data for the paper `From stage to page: language independent bootstrap measures of distinctiveness in fictional speech` [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7383686
    Explore at:
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    Nagy, Benjamin
    Šeļa, Artjoms
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The repository provides full data and processing / analysis pipeline for the paper 'From stage to page: language independent bootstrap measures of distinctiveness in fictional speech'

    Rendered notebooks are also available through Github:

    1) Preparation, energy distance and exploration (main)

    2) Keyword curves & formal modeling

    • 00_dracor_get_data.R. Script uses DraCor dedicated API to get texts spoken by characters

    • 01_distinctiveness_energy.ipynb does the heavy lifting of data wrangling, cleaning and preprocessing, plus implements energy distance bootstrapping and does exploratory analysis

    • 02_logodds_curves.R calculates keyword curves for characters

    • 03_analysis_and_models.R explores keyword curves and does Bayesian models

  18. Data from: Taxonomic revision of Stigmatomma Roger (Hymenoptera: Formicidae)...

    • zenodo.org
    • gbif.org
    • +2more
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flavia A. Esteves; Brian L. Fisher; Flavia A. Esteves; Brian L. Fisher (2024). Data from: Taxonomic revision of Stigmatomma Roger (Hymenoptera: Formicidae) in the Malagasy region [Dataset]. http://doi.org/10.5061/dryad.m7340
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Flavia A. Esteves; Brian L. Fisher; Flavia A. Esteves; Brian L. Fisher
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this study we present the first taxonomic revision of the ant genus Stigmatomma in the Malagasy biogeographic region, redescribe the previously known S. besucheti Baroni-Urbani, and describe seven new species to science (S. bolabola sp. n., S. irayhady sp. n., S. janovitsika sp. n., S. liebe sp. n., S. roahady sp. n., S. sakalava sp. n., and S. tsyhady sp. n.). The revision is based on the worker caste, but we provide brief descriptions of gynes and males for some species. Species descriptions, diagnosis, character discussion, identification key, and glossary are illustrated with 360 high-quality montage and SEM images. The distribution of Stigmatomma species in Madagascar are mapped and discussed within the context of the island's biomes and ecoregions. We also discuss how some morphometric variables describe the differences among the species in the bioregion. Open science is supported by providing access to R scripts, raw measurement data, and all specimen data used. All specimens used in this study were given unique identifies, and holotypes were imaged. Specimens and images are made accessible on AntWeb.org.

  19. d

    Data from: Age estimation of captive Asian elephants (Elephas maximus) based...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kana Arai; Huiyuan Qi; Miho Inoue-Murayama (2023). Age estimation of captive Asian elephants (Elephas maximus) based on DNA methylation: An exploratory analysis using methylation-sensitive high-resolution melting (MS-HRM) [Dataset]. http://doi.org/10.5061/dryad.qjq2bvqnb
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kana Arai; Huiyuan Qi; Miho Inoue-Murayama
    Time period covered
    Jan 1, 2023
    Description

    Age is an important parameter for bettering the understanding of biodemographic trends-development, survival, reproduction and environmental effects-critical for conservation. However, current age estimation methods are challenging to apply to many species, and no standardised technique has been adopted yet. This study examined the potential use of methylation-sensitive high-resolution melting (MS-HRM), a labour, time, and cost-effective method to estimate chronological age from DNA methylation in Asian elephants (Elephas maximus). The objective of this study was to investigate the accuracy and validation of MS-HRM use for age determination in long-lived species, such as Asian elephants. The average lifespan of Asian elephants is between 50-70 years but some have been known to survive for more than 80 years. DNA was extracted from 53 blood samples of captive Asian elephants across 11 zoos in Japan, with known ages ranging from a few months to 65 years. Methylation rates of two candidate..., , , # Estimation of captive Asian elephants (Elephas maximus) age based on DNA methylation: An exploratory analysis using methylation-sensitive high-resolution melting (MS-HRM)

    Description of the data and file structure

    1. The raw methylation data of RALYL and TET2 used in this analysis are in csv files.

      This is the dataset that we used to develop the age estimation model. The '**subject ID**' represents the same individuals. In contrast, '**sample**' represents the sample collections taken over time. In addition, cells containing '**n/a**' in our dataset within the 'sample' column are samples which were sampled recently and had no define number ID at the time (please refer to the Supplementary Information on the manuscript for more details on sampling date). '**sex**' represents the sex of the individual, where F: female and M: male. '**age**' represents the chronological age of the individual at the time of sampling. '**ralyl_methylationrate_ave**' and '**tet2_methylatio...

  20. 4

    Data from: Spatially explicit environmental variables at 25m resolution for...

    • data.4tu.nl
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anatol Helfenstein; Vera L. Mulder; Mirjam J.D. Hack-ten Broeke; Maarten van Doorn; Kees Teuling; Dennis J.J. Walvoort; Gerard B.M. Heuvelink (2024). Spatially explicit environmental variables at 25m resolution for spatial modelling in the Netherlands [Dataset]. http://doi.org/10.4121/6af610ed-9006-4ac5-b399-4795c2ac01ec.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Anatol Helfenstein; Vera L. Mulder; Mirjam J.D. Hack-ten Broeke; Maarten van Doorn; Kees Teuling; Dennis J.J. Walvoort; Gerard B.M. Heuvelink
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Netherlands
    Dataset funded by
    Wageningen Environmental Research, Wageningen University & Research, Dutch Ministry of Agriculture, Nature and Food Quality
    Description

    This dataset contains 206 spatially explicit environmental variables, also termed covariates, at 25m resolution that cover the entire Netherlands (national scale). The raster data are comprised of covariates related to the soil-forming factors (climate, organism/land use/land cover, relief/topography, parent material/geology) for the purpose of using them for digital soil mapping. However, since the covariates cover a wide range of environmental variables, they can potentially be used for spatial modelling in the Netherlands also outside the field of soil science. All covariates can also be found from the original source, but the potential strength and practicality of this dataset lies in the broad range of readily available, collected, prepared and harmonized raster data.

    The metadata of all the covariates in this dataset can be found in the "00_covariates_metadata.csv" file, including information about the names, category, value types, specific value types, type of geospatial data, file type, whether its static or dynamic, temporal coverage, date/version, resolution (all 25m), origin, source, access/license, description, processing steps and comments. The dataset includes 3 different types of files:

    • GeoTIFF (.tif): the covariates as raster data at 25m resolution in the EPSG:28992 (Amersfoort / RD New) spatial projection
    • Text (.txt): README files for each covariate with additional metadata information (filename ending in "_readme.txt")
    • Tabular data (.csv): Classification and re-classification table for categorical covariates (filename ending in "_reclassify.csv")

    Note that the reclassification tables contain potential ways to reclassify the data provided, but can be altered by the user. Reclassification may be useful for categorical covariates with a large number of classes/categories. Note that covariates with CC BY-ND 4.0 licenses, covariates that are not open data or for which the license was unknown are not shared in this dataset.

    More information about these covariates can be found in the associated scientific paper "BIS-4D: Mapping soil properties and their uncertainties at 25m resolution in the Netherlands" (Helfenstein et al., 2024, under review). Different ways of pre-processing and preparing the covariates for subsequent modelling can be found in R scripts 20-25 in the associated code repository on GitLab. This includes assembling and preparing covariates using GDAL ("20_cov_prep_gdal.R"), computing digital elevation model (DEM) derivatives using SAGA GIS ("21_cov_dem_deriv_saga.R"), deriving spectral indices from RGBNIR bands of Sentinel 2 images ("22_cov_sensing_deriv.R"), preparing categorical covariates using GDAL ("23_cov_cat_recl_gdal.R"), deriving dynamic covariates ("24_cov_dyn_prep_gdal.R") and exploratory analysis of the covariates ("25_cov_expl_analysis_clorpt.Rmd", "25_cov_expl_analysis_cont_cat.Rmd").

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Liam D. Bailey; Martijn van de Pol (2023). climwin: An R Toolbox for Climate Window Analysis [Dataset]. http://doi.org/10.1371/journal.pone.0167980

climwin: An R Toolbox for Climate Window Analysis

Explore at:
102 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Liam D. Bailey; Martijn van de Pol
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

When studying the impacts of climate change, there is a tendency to select climate data from a small set of arbitrary time periods or climate windows (e.g., spring temperature). However, these arbitrary windows may not encompass the strongest periods of climatic sensitivity and may lead to erroneous biological interpretations. Therefore, there is a need to consider a wider range of climate windows to better predict the impacts of future climate change. We introduce the R package climwin that provides a number of methods to test the effect of different climate windows on a chosen response variable and compare these windows to identify potential climate signals. climwin extracts the relevant data for each possible climate window and uses this data to fit a statistical model, the structure of which is chosen by the user. Models are then compared using an information criteria approach. This allows users to determine how well each window explains variation in the response variable and compare model support between windows. climwin also contains methods to detect type I and II errors, which are often a problem with this type of exploratory analysis. This article presents the statistical framework and technical details behind the climwin package and demonstrates the applicability of the method with a number of worked examples.

Search
Clear search
Close search
Google apps
Main menu