Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Mustafa Almitamy
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Mustafa Almitamy
Released under Apache 2.0
Facebook
TwitterThis dataset was created by chiragksharma
Facebook
TwitterThis dataset was created by chiragksharma
Facebook
TwitterThis dataset was created by srinidhi yerabati
Facebook
TwitterThis dataset was created by Bro Brother Crony420
Facebook
TwitterThis dataset was created by Sumbal Wahid
Released under Other (specified in description)
Facebook
TwitterThis dataset was created by İlyas Abbasov
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Akash Roy
Released under Apache 2.0
Facebook
TwitterThis dataset was created by ARITRA BRAHMA
Facebook
TwitterThis dataset was created by Hadeer Khaled Nabil
Facebook
TwitterThis dataset was created by Alan Diego
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.
Unlike most public datasets, this one includes a diverse mix of column types:
📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)
Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.
Feel free to:
Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!
Hope you find this helpful. Looking forward to hearing from you all.
Facebook
Twitter**Dataset Overview ** The Titanic dataset is a widely used benchmark dataset for machine learning and data science tasks. It contains information about passengers who boarded the RMS Titanic in 1912, including their age, sex, social class, and whether they survived the sinking of the ship. The dataset is divided into two main parts:
Train.csv: This file contains information about 891 passengers who were used to train machine learning models. It includes the following features:
PassengerId: A unique identifier for each passenger Survived: Whether the passenger survived (1) or not (0) Pclass: The passenger's social class (1 = Upper, 2 = Middle, 3 = Lower) Name: The passenger's name Sex: The passenger's sex (Male or Female) Age: The passenger's age Sibsp: The number of siblings or spouses aboard the ship Parch: The number of parents or children aboard the ship Ticket: The passenger's ticket number Fare: The passenger's fare Cabin: The passenger's cabin number Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton) Test.csv: This file contains information about 418 passengers who were not used to train machine learning models. It includes the same features as train.csv, but does not include the Survived label. The goal of machine learning models is to predict whether or not each passenger in the test.csv file survived.
**Data Preparation ** Before using the Titanic dataset for machine learning tasks, it is important to perform some data preparation steps. These steps may include:
Handling missing values: Some of the features in the dataset have missing values. These values can be imputed or removed, depending on the specific task. Encoding categorical variables: Some of the features in the dataset are categorical variables, such as Pclass, Sex, and Embarked. These variables need to be encoded numerically before they can be used by machine learning algorithms. Scaling numerical variables: Some of the features in the dataset are numerical variables, such as Age and Fare. These variables may need to be scaled to ensure that they are on the same scale. Data Visualization
Data visualization can be a useful tool for exploring the Titanic dataset and gaining insights into the data. Some common data visualization techniques that can be used with the Titanic dataset include:
Histograms: Histograms can be used to visualize the distribution of numerical variables, such as Age and Fare. Scatter plots: Scatter plots can be used to visualize the relationship between two numerical variables. Box plots: Box plots can be used to visualize the distribution of a numerical variable across different categories, such as Pclass and Sex. Machine Learning Tasks
The Titanic dataset can be used for a variety of machine learning tasks, including:
Classification: The most common task is to use the train.csv file to train a machine learning model to predict whether or not each passenger in the test.csv file survived. Regression: The dataset can also be used to train a machine learning model to predict the fare of a passenger based on their other features. Anomaly detection: The dataset can also be used to identify anomalies, such as passengers who are outliers in terms of their age, social class, or other features.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is based on GEO series GSE5583. OmicsDI
The experiment compares gene expression profiles between wild‑type mouse embryonic stem cells (ES cells) and ES cells in which Histone deacetylase 1 (HDAC1) has been knocked out. OmicsDI
The organism used is mouse (Mus musculus). OmicsDI
Microarray technology was employed to measure transcript abundance across the genome, aiming to identify putative HDAC1 target genes. OmicsDI +1
The dataset includes processed expression data (after normalization and log2 transformation), allowing for downstream exploratory data analysis (EDA) and differential gene expression (DGE) analysis.
As part of EDA, sample‑wise distribution plots (e.g. boxplots) are provided to assess normalization across all arrays.
The dataset also includes downstream visualizations and analysis results, such as boxplots, which help in evaluating the consistency and quality of the processed data.
Researchers can use this dataset to perform differential expression analysis between HDAC1 knockout vs wild‑type ES cells, investigate epigenetic regulation, or explore downstream effects of histone deacetylation loss.
Additionally, the dataset can serve as a reference example for microarray data preprocessing, normalization, transformation (e.g. log2), and exploratory visualization workflows.
The dataset is publicly available and sourced from a trusted repository (GEO), ensuring transparency and reproducibility of the experiment.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
• Automated parametric analysis workflow built using R Studio.
• Demonstrates core statistical analysis methods on numerical datasets.
• Includes step-by-step R scripts for performing t-tests, ANOVA, and summary statistics.
• Provides visual outputs such as boxplots and distribution plots for better interpretation.
• Designed for students, researchers, and data analysts learning statistical automation in R.
• Useful for understanding reproducible research workflows in data analysis.
• Dataset helps in teaching how to automate statistical pipelines using R programming.
Facebook
TwitterThe purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.
The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.
install.packages("tidyverse")
install.packages("dplyr")
install.packages("tsibble")
library("tidyverse")
library(readr)
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(tsibble)
Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")
Compared column names of each file to verify consistency.
colnames(Walmart)
colnames(Walmart)
dim(Walmart)
str(Walmart)
head(Walmart)
which(is.na(Walmart$Date))
sum(is.na(Walmart))
There is NA data in the set.
Walmart$Store<-as.factor(Walmart$Store)
Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)
Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble"
Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")
Walmart_Holiday<-
filter(Walmart, Holiday_Flag==1)
Walmart_Non_Holiday<-
filter(Walmart, Holiday_Flag==0)
ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.
Lets verify the results via slice_max and slice_min:
Walmart %>% slice_max(Weekly_Sales)
Walmart %>% slice_min(Weekly_Sales)
It looks the information was correct. Lets check the mean for the weekly_sales column:
mean(Walmart$Weekly_Sales)
The mean for Weekly_Sales column for the Walmart dataset was 1046965.
ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:
Walmart_Holiday %>% slice_max(Weekly_Sales)
Walmart_Holiday %>% slice_min(Weekly_Sales)
The results match what is given on the boxplot. Lets find the mean:
mean(Walmart_Holiday$Weekly_Sales)
The result was that the mean was 1122888.
ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()
Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:
Walmart_Non_Holiday %>% slice_max(Weekly_Sales)
Walmart_Non_Holiday %>% slice_min(Weekly_Sales)
mean(Walmart_Non_Holiday$Weekly_Sales)
Results matched. And the mean for weekly sales was 1041256.
ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))
According the plot, 2010 had the most sales. Lets use a boxplot to see more.
ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012',
x='Year', y='Weekly Sales')
2010 Saw higher sales numbers and higher medium
Lets start with holiday weekly sales:
ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
Facebook
TwitterThe differential diagnosis of "erythemato-squamous" diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with minimal differences. The disorders in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Usually, a biopsy is necessary for the diagnosis, but unfortunately, these diseases share many histopathological features as well.
Patients were first evaluated clinically with 12 features. Afterward, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope
In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The age feature simply represents the age of the patient.
Every other feature clinical and histopathological was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.
Distribution of each attribute: Explore the distribution of each attribute (column) in the dataset. You can use histograms or boxplots to visualize the distribution of each attribute and look for any patterns or outliers.
Correlation analysis: Use correlation matrices to explore the relationship between the different attributes in the dataset. This can help identify which attributes are most closely related to each other and may be useful in predicting the class labels.
Missing values analysis: Investigate the missing values in the Age attribute, which are represented with '?' in the dataset. Determine the proportion of missing values and evaluate whether imputation is needed.
Class distribution: Explore the distribution of the class labels in the dataset. You can use bar plots to visualize the number of instances for each class, and determine whether the dataset is balanced or imbalanced.
Feature engineering: Consider creating new features that may be useful in predicting the class labels. For example, you could create a feature that combines the presence of specific clinical attributes or histopathological attributes.
Outlier detection: Explore the presence of any outliers in the dataset. Outliers can skew the distribution of the data and impact the performance of machine learning models. You can use boxplots or scatterplots to visualize the distribution of each attribute and identify any potential outliers.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Mustafa Almitamy
Released under Apache 2.0