50 datasets found

f
R-script to Analyse Data
uvaauas.figshare.com
txt
Updated Apr 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Blanke (2022). R-script to Analyse Data [Dataset]. http://doi.org/10.21942/uva.14346842.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.21942/uva.14346842.v1
Dataset updated
Apr 4, 2022
Dataset provided by
University of Amsterdam / Amsterdam University of Applied Sciences
Authors
T. Blanke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Exploratory data analysis and visualisation of datasets
Additional file 1 of Simple but powerful interactive data analysis in R with...
springernature.figshare.com
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Svetlana Ovchinnikova; Simon Anders (2024). Additional file 1 of Simple but powerful interactive data analysis in R with R/LinkedCharts [Dataset]. http://doi.org/10.6084/m9.figshare.26677037.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26677037.v2
Dataset updated
Nov 26, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Svetlana Ovchinnikova; Simon Anders
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Zip file containing the interactive supplement.
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.
m
Data and R scripts for 'Reliability of geochemical analyses: Deja vu all...
data.mendeley.com
Updated Mar 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ola Anfin Eggen (2019). Data and R scripts for 'Reliability of geochemical analyses: Deja vu all over again' [Dataset]. http://doi.org/10.17632/pvw557y82p.1
Explore at:
Unique identifier
https://doi.org/10.17632/pvw557y82p.1
Dataset updated
Mar 12, 2019
Authors
Ola Anfin Eggen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The zipped file contains the following: - data (as csv, in the 'data' folder), - R scripts (as Rmd, in the rro folder), - figures (as pdf, in the 'figs' folder), and - presentation (as html, in the root folder).
d
Physical Properties of Lakes: Exploratory Data Analysis
search.dataone.org
hydroshare.org
Updated Apr 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriela Garcia; Kateri Salk (2022). Physical Properties of Lakes: Exploratory Data Analysis [Dataset]. https://search.dataone.org/view/sha256%3A82a3bd46ad259724cad21b7a344728253ea4e6d929f6134e946c379585f903f6
Explore at:
Dataset updated
Apr 15, 2022
Dataset provided by
Hydroshare
Authors
Gabriela Garcia; Kateri Salk
Time period covered
May 27, 1984 - Aug 17, 2016
Area covered
Description
Exploratory Data Analysis for the Physical Properties of Lakes

This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on the physical properties of lakes.

Introduction

Lakes are dynamic, nonuniform bodies of water in which the physical, biological, and chemical properties interact. Lakes also contain the majority of Earth's fresh water supply. This lesson introduces exploratory data analysis using R statistical software in the context of the physical properties of lakes.

Learning Objectives

After successfully completing this exercise, you will be able to:

Apply exploratory data analytics skills to applied questions about physical properties of lakes

Communicate findings with peers through oral, visual, and written modes
f
Data from: Superheat: An R Package for Creating Beautiful and Extendable...
tandf.figshare.com
bin
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca L. Barter; Bin Yu (2024). Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data [Dataset]. http://doi.org/10.6084/m9.figshare.6287693.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6287693.v1
Dataset updated
Mar 4, 2024
Dataset provided by
Taylor & Francis
Authors
Rebecca L. Barter; Bin Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.
f
Data_Sheet_5_A mathematical and exploratory data analysis of malaria disease...
frontiersin.figshare.com
pdf
Updated Jun 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah (2023). Data_Sheet_5_A mathematical and exploratory data analysis of malaria disease transmission through blood transfusion.PDF [Dataset]. http://doi.org/10.3389/fams.2023.1105543.s013
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fams.2023.1105543.s013
Dataset updated
Jun 11, 2023
Dataset provided by
Frontiers
Authors
Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Malaria is a mosquito-borne disease spread by an infected vector (infected female Anopheles mosquito) or through transfusion of plasmodium-infected blood to susceptible individuals. The disease burden has resulted in high global mortality, particularly among children under the age of five. Many intervention responses have been implemented to control malaria disease transmission, including blood screening, Long-Lasting Insecticide Bed Nets (LLIN), treatment with an anti-malaria drug, spraying chemicals/pesticides on mosquito breeding sites, and indoor residual spray, among others. As a result, the SIR (Susceptible—Infected—Recovered) model was developed to study the impact of various malaria control and mitigation strategies. The associated basic reproduction number and stability theory is used to investigate the stability analysis of the model equilibrium points. By constructing an appropriate Lyapunov function, the global stability of the malaria-free equilibrium is investigated. By determining the direction of bifurcation, the implicit function theorem is used to investigate the stability of the model endemic equilibrium. The model is fitted to malaria data from Benue State, Nigeria, using R and MATLAB. Estimates of parameters were made. Following that, an optimal control model is developed and analyzed using Pontryaging's Maximum Principle. The malaria-free equilibrium point is locally and globally stable if the basic reproduction number (R0) and the blood transfusion reproduction number (Rα) are both less or equal to unity. The study of the sensitive parameters of the model revealed that the transmission rate of malaria from mosquito-to-human (βmh), transmission rate from humans-to-mosquito (βhm), blood transfusion reproduction number (Rα) and recruitment rate of mosquitoes (bm) are all sensitive parameters capable of increasing the basic reproduction number (R0) thereby increasing the risk in spreading malaria disease. The result of the optimal control shows that five possible controls are effective in reducing the transmission of malaria. The study recommended the combination of five controls, followed by the combination of four and three controls is effective in mitigating malaria transmission. The result of the optimal simulation also revealed that for communities or areas where resources are scarce, the combination of Long Lasting Insecticide Treated Bednets (u2), Treatment (u3), and Indoor insecticide spray (u5) is recommended. Numerical simulations are performed to validate the model's analytical results.
Explore data formats and ingestion methods
kaggle.com
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Test Data Formats in Python

Test Data Formats in R

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.
o
t-test i Case study analize podataka
explore.openaire.eu
Updated May 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadica Miljković (2021). t-test i Case study analize podataka [Dataset]. http://doi.org/10.5281/zenodo.4784060
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4784060
Dataset updated
May 24, 2021
Authors
Nadica Miljković
Description
Predavanje za predmet Tehnike obrade biomedicinskih signala na master akademskim studijama na Elektrotehničkom fakultetu Univerziteta u Beogradu.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
f
ftmsRanalysis: An R package for exploratory data analysis and interactive...
plos.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue (2023). ftmsRanalysis: An R package for exploratory data analysis and interactive visualization of FT-MS data [Dataset]. http://doi.org/10.1371/journal.pcbi.1007654
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007654
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS Computational Biology
Authors
Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
Data Visualization Cheat sheets and Resources
kaggle.com
zip
Updated Feb 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kash (2021). Data Visualization Cheat sheets and Resources [Dataset]. https://www.kaggle.com/kaushiksuresh147/data-visualization-cheat-cheats-and-resources
Explore at:
zip(133638507 bytes)Available download formats
Dataset updated
Feb 20, 2021
Authors
Kash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Data Visualization Corpus

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1430847%2F29f7950c3b7daf11175aab404725542c%2FGettyImages-1187621904-600x360.jpg?generation=1601115151722854&alt=media" alt="">

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions

The Data Visualizaion Copus

The Data Visualization corpus consists:

32 cheat sheets: This includes A-Z about the techniques and tricks that can be used for visualization, Python and R visualization cheat sheets, Types of charts, and their significance, Storytelling with data, etc..

32 Charts: The corpus also consists of a significant amount of data visualization charts information along with their python code, d3.js codes, and presentations relation to the respective charts explaining in a clear manner!

Some recommended books for data visualization every data scientist's should read:

Beautiful Visualization by Julie Steele and Noah Iliinsky

Information Dashboard Design by Stephen Few

Knowledge is beautiful by David McCandless (Short abstract)

The Functional Art: An Introduction to Information Graphics and Visualization by Alberto Cairo

The Visual Display of Quantitative Information by Edward R. Tufte

storytelling with data: a data visualization guide for business professionals by cole Nussbaumer knaflic

Research paper - Cheat Sheets for Data Visualization Techniques by Zezhong Wang, Lovisa Sundin, Dave Murray-Rust, Benjamin Bach

Suggestions:

In case, if you find any books, cheat sheets, or charts missing and if you would like to suggest some new documents please let me know in the discussion sections!

Resources:

Charts: I personally recommend data viz catalogue, it's easy to understand with their explanation!

Python codes: Plotly for python and Python graph gallery

R codes for charts:Plotly for R

d3 codes: Visualization codes using d3

Request to kaggle users:

A kind request to kaggle users to create notebooks on different visualization charts as per their interest by choosing a dataset of their own as many beginners and other experts could find it useful!

To create interactive EDA using animation with a combination of data visualization charts to give an idea about how to tackle data and extract the insights from the data

Suggestion and queries:

Feel free to use the discussion platform of this data set to ask questions or any queries related to the data visualization corpus and data visualization techniques

Kindly upvote the dataset if you find it useful or if you wish to appreciate the effort taken to gather this corpus! Thank you and have a great day!
n
HadISD: Global sub-daily, surface meteorological station data, 1931-2022,...
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Jul 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). HadISD: Global sub-daily, surface meteorological station data, 1931-2022, v3.3.0.2022f [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=dewpoint
Explore at:
Dataset updated
Jul 24, 2021
Description
This is version v3.3.0.2022f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20230101_v3.3.1.2022f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note. Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.
Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds...
catalog.data.gov
Updated Jun 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Park Service (2024). Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds National Monument 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-florissant-fossil-beds-national-mon-2009
Explore at:
Dataset updated
Jun 5, 2024
Dataset provided by
National Park Servicehttp://www.nps.gov/
Area covered
Florissant
Description
Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Florissant Fossil Beds National Monument. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.
VAERS Data as of 19th March 2021
kaggle.com
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayathri Nagarajan (2021). VAERS Data as of 19th March 2021 [Dataset]. https://www.kaggle.com/gayathrirprog/vaers-data-as-of-19th-march-2021/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gayathri Nagarajan
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

Every story has a question that triggered it. Mine was - What are the vaccinations being administered in USA? What are people's reported incidents post the vaccine doses ?

I look at awe at which folks do EDA in kaggle and I have a long way to go.But I want to start small and I have already started my journey.The folks who do wonderful EDA are my source of inspiration and I learn by doing their notebooks in R. Iam a R fan for now.

Content

The data here was downloaded on 29th March from CDC Wonder site which helps take reports on VAERS.

Acknowledgements

My google search on VAERS and Kaggle search for VAERS got me a wonderful notebook and dataset. Thanks to folks like Ayush Garg and jmreuter for helping folks like me learn more.

Inspiration

What are the vaccinations being administered in USA? What are people's reported incidents post the vaccine doses ? Which vaccine has most side effects in all age groups ? Which vaccine has most side effects in each state?
f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes...
catalog.data.gov
Updated Jun 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Park Service (2024). Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes National Park 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-great-sand-dunes-national-park-2009-2019
Explore at:
Dataset updated
Jun 5, 2024
Dataset provided by
National Park Servicehttp://www.nps.gov/
Description
Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Great Sand Dunes National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.
Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National...
catalog.data.gov
Updated Jun 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Park Service (2024). Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National Park 2007-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-rocky-mountain-national-park-2007-2019
Explore at:
Dataset updated
Jun 5, 2024
Dataset provided by
National Park Servicehttp://www.nps.gov/
Area covered
Rocky Mountains
Description
Wetlands Ecological Integrity Depth to Water Logger data from 2007-2019 at Rocky Mountain National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.
w
Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...
data.wu.ac.at
zip
Updated Mar 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HarvestMaster (2018). Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in Low-Temperature Geothermal Play Fairway Analysis (GPFA-AB) ThermalQualityAnalysisThermalResourceInterpolationResultsDataFilesImages.zip [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/MjQ3ZDg1ZmEtMGJkZi00ZGQ5LTlhMjAtZDg1ZTBlOTZmOWMx
Explore at:
zipAvailable download formats
Dataset updated
Mar 6, 2018
Dataset provided by
HarvestMaster
Area covered
936274585b978c6848894628fe23e43e4d0f7b86
Description
This collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).

This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.

Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.

Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.

UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains 6 images (.png) including predicted and associated error for surface heat flow, depth to 80 degrees C, depth to 100 degrees C, temperature at 1.5 km, temperature at 2.5 km and temperature at 3.5 km.

The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.

Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.

The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.

A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.

Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.
Bank marketing campaigns dataset | Opening Deposit
kaggle.com
Updated Jan 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VolodymyrGavrysh (2020). Bank marketing campaigns dataset | Opening Deposit [Dataset]. https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
VolodymyrGavrysh
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Bank marketing campaigns dataset analysis # Opening a Term Deposit

It is a dataset that describing Portugal bank marketing campaigns results. Conducted campaigns were based mostly on direct phone calls, offering bank client to place a term deposit. If after all marking afforts client had agreed to place deposit - target variable marked 'yes', otherwise 'no'

Sourse of the data https://archive.ics.uci.edu/ml/datasets/bank+marketing

Citation Request:

This dataset is public available for research. The details are described in S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Title: Bank Marketing (with social/economic context)

Sources Created by: Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014

Past Usage:

The full dataset (bank-additional-full.csv) was described and analyzed in:

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.

Relevant Information:

This dataset is based on "Bank Marketing" UCI dataset (please check the description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in Moro et al., 2014. Using the rminer package and R tool (http://cran.r-project.org/web/packages/rminer/), we found that the addition of the five new social and economic attributes (made available here) lead to substantial improvement in the prediction of a success, even when the duration of the call is not included. Note: the file can be read in R using: d=read.table("bank-additional-full.csv",header=TRUE,sep=";")

The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).

Number of Instances: 41188 for bank-additional-full.csv

Number of Attributes: 20 + output attribute.

Attribute information:

For more information, read [Moro et al., 2014].

Input variables:

bank client data:

*1 - age (numeric)

*2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")

*3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)

*4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")

5 - default: has credit in default? (categorical: "no","yes","unknown")

6 - housing: has housing loan? (categorical: "no","yes","unknown")

7 - loan: has personal loan? (categorical: "no","yes","unknown")

related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: "cellular","telephone")

*9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

*10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")

*11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

*12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

*13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

*14 - previous: number of contacts performed before this campaign and for this client (numeric)

1515 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

social and economic context attributes

*16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

*17 - cons.price.idx: consumer price index - monthly indicator (numeric)

*18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

*19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): * 21 - y - h...

Facebook

Twitter

Click to copy link

Link copied

Cite

T. Blanke (2022). R-script to Analyse Data [Dataset]. http://doi.org/10.21942/uva.14346842.v1

R-script to Analyse Data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.21942/uva.14346842.v1

Dataset updated

Apr 4, 2022

Dataset provided by

University of Amsterdam / Amsterdam University of Applied Sciences

Authors

T. Blanke

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Exploratory data analysis and visualisation of datasets

Clear search

Close search

Google apps

Main menu

R-script to Analyse Data

Additional file 1 of Simple but powerful interactive data analysis in R with...

Exploratory Data Analysis (EDA) Tools Report

Data and R scripts for 'Reliability of geochemical analyses: Deja vu all...

Physical Properties of Lakes: Exploratory Data Analysis

Data from: Superheat: An R Package for Creating Beautiful and Extendable...

Data_Sheet_5_A mathematical and exploratory data analysis of malaria disease...

Explore data formats and ingestion methods

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration

t-test i Case study analize podataka

Reddit r/AskScience Flair Dataset

ftmsRanalysis: An R package for exploratory data analysis and interactive...

Data Visualization Cheat sheets and Resources

The Data Visualization Corpus

Data Visualization

The Data Visualizaion Copus

The Data Visualization corpus consists:

Suggestions:

Resources:

Request to kaggle users:

Suggestion and queries:

Kindly upvote the dataset if you find it useful or if you wish to appreciate the effort taken to gather this corpus! Thank you and have a great day!

HadISD: Global sub-daily, surface meteorological station data, 1931-2022,...

Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds...

VAERS Data as of 19th March 2021

Context

Content

Acknowledgements

Inspiration

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes...

Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National...

Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...

Bank marketing campaigns dataset | Opening Deposit

Bank marketing campaigns dataset analysis # Opening a Term Deposit

bank client data:

related with the last contact of the current campaign:

other attributes:

social and economic context attributes

R-script to Analyse Data