Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterBy Ben Jones [source]
This Kaggle dataset contains unique and fascinating insights into the 2018-2019 season of the NFL. It provides comprehensive data such as player #, position, height, weight, age, experience level in years, college attended and the team they are playing for. All these attributes can be used to expand on research within the NFL community. From uncovering demographics of individual teams to discovering correlations between players' salaries and performance - this dataset has endless possibilities for researchers to dive deeply into. Whether you are searching for predictions about future seasons or creating complex analyses using this data - it will give you a detailed view of the 2018-2019 season like never before! Explore why each team is special, who shone individually that year and what strategies could have been employed more efficiently throughout with this captivating collection of 2019-2018 NFL Players Stats & Salaries!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Get familiar with the characteristics of each column in our data set: Rk, Player, Pos, Tm, Cap Hit Player # , HT , WT Age , Exp College Team Rk Tm . Understanding these columns is key for further analysis since you can use each attribute for unique insights about NFL players' salaries and performance during this season. For example, HT (height) and WT (weight) are useful information if you want to study any correlations between player body types and their salaries or game performances. Another example would be Pos (position); it is a critical factor that determines how much a team pays its players for specific roles on the field such as quarterbacks or running backs etc.
- Use some visualizations on your data as it helps us better understand what we observe from statistical data points when placed into graphical forms like scatter plots or bar charts. Graphical representations are fantastic at helping us see correlations in our datasets; they let us draw conclusions quickly by comparing datasets side by side or juxtaposing various attributes together in order explore varying trends across different teams of players etc.. Additionally, you could also represent all 32 teams graphically according to their Cap Hits so that viewers can spot any outlier values quickly without having to scan a table full of numbers – map based visualizations come extremely handy here!
- Employ analytical techniques such as regular expression matching (RegEx) if needed; RegEx enables us detect patterns within text fields within your datasets making them exceptionally useful when trying discovering insights from large strings like college team name URLSs [for example] . This could potentially lead you towards deeper exploration into why certain franchises may have higher salaried players than others etc..
- Finally don't forget all mathematical tools available at your disposal; statistics involves sophisticated operations like proportions / ratios/ averages/ medians - be sure take advantage these basic math features because quite often they end up revealing dazzling new facets inside your datasets which help uncover more interesting connections & relationships between two separate entities such as how does height compare against drafted college etc..?
We hope these tips help those looking forward unlocking hidden gems hidden
- Analyzing the impact of position on salaries: This dataset can be used to compare salaries across different positions and analyze the correlations between players’ performance, experience, and salaries.
- Predicting future NFL MVP candidates: By analyzing popular statistical categories such as passing yards, touchdowns, interceptions and rushing yards for individual players over several seasons, researchers could use this data to predict future NFL MVPs each season.
- Exploring team demographics: By looking into individual teams' player statistics such as age, height and weight distribution, researchers can analyze and compare demographic trends across the league or within a single team during any given season
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even co...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT The desertification process causes soil degradation and a reduction in vegetation. The absence of visualisation techniques and the broad spatial and temporal dimension of the data hampers the identification of desertification and rapid decision-making by multidisciplinary teams. The 2D Scatter Plot is a two-dimensional visual analysis of reflectances in the red (630 - 690 nm) and near-infrared (760 - 900 nm) bands to visualise the spectral response of the vegetation. The hypothesis of this study is that visualising the reflectances of the vegetation by means of a 2D scatter plot will allow desertification to be inferred. The aim of this study was to identify desertified areas and characterise the spatial and temporal dynamics of the vegetation and soil during dry (DP) and rainy (RP) periods between 2000 and 2008, using a 2D scatter plot. The 2D scatter plot generated by the Envi® 4.8 software and the reflectances in bands 3 and 4 of the TM5 sensor were used within communities in the Irauçuba hub (Ceará, Brazil). The concentration densities of the near-infrared reflectances of the vegetation pixels were observed. Each community presented pixel concentrations with reflectances of less than 0.4 (40%) during each of the periods under evaluation, indicating little vegetation development, with further degradation caused by deforestation, the use of fire and overgrazing. The 2D scatter plot was able to show vegetation with low reflectance in the near infrared during both dry and rainy periods between 2000 and 2008, thereby inferring the occurrence of desertification.
Facebook
TwitterThe project is to conduct a principal components analysis of the flea beatle data (fleabeetledata.xlsx, Lubischew, A., On the use of discriminant functions in taxonomy, Biometrics 18 (1962), 455-477.). The data has two groups. You will conduct three principal component analysis, one for each individual group and one for the entire data set ignoring groups. You will use S for the PCA. (a) Carry out an initial investigation. Do not remove outliers or transform the data. Indicate if you had to process the data file in anyway. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to potential differences between the groups. (b) For the Haltica oleracea group, i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (c) For the Haltica carduorum group, i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (d) For the entire data set (ignoring groups), i. Display the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display scatter plots of pairs of principal components. Make observations about the plots. (e) Compare the results for the three principal component analyses. Do you have any conclusions? Key for Flea Beetle Data x1 = distance of transverse groove from posteriori border of prothorax x2 = length of elytra x3 length of second antennal joint x4 = length of third antennal joint
Facebook
TwitterSpatial transitions in color can aid any visual perception task, and its neural representation – an “integration of color and form†– is thought to begin at primary visual cortex (V1). Color and form integration is untested in mouse V1, yet studies show that the ventral retina provides the necessary substrate from green-sensitive rods and UV-sensitive cones. Here, we used two-photon imaging in V1 to measure spatial frequency (SF) tuning along four axes of rod and cone contrast space, including luminance and color. We first reveal that V1 has similar responsiveness to luminance and color, yet average SF tuning is significantly shifted lowpass for color. Next, guided by linear models, we used SF tuning along all four color axes to estimate the proportion of neurons that fall into classic models of color opponency – “single-†, “double-†, and “non-opponent†. Few neurons (~6%) fit the criteria for double-opponency, which are uniquely tuned for chromatic borders. Most of the population can be..., This data comes from two-photon imaging in mouse primary visual cortex. There is also Matlab code to run the simulations in figures 1, 6, 7, and 8., See uploaded README files for details. Below is the top of README_for_dataset.doc. This describes the uploaded data set used in Rhim and Nauhaus: “Joint representations of color and form in mouse visual cortex described by random pooling from rods and cones†. It is a MATLAB .mat file, where each structure pertains to a given figure. In addition to the source data for the figures, it also has the following additions:
The same data set, but prior to culling the population according to the dashed box in the Figure 2 scatter plot. See variables appended with “…_allâ€
Region-of-interest ID associated with each neuron.
Below is all the information in README_for_simulations.doc. To run the simulations for Figures 1,6,7, and 8, execute the cells in the high-level scripts of the following: Figure_1.m, Figure_6_7.m, Figure_8.m. Make sure all the other .m files are in your path.
Facebook
TwitterHi Folks,
Let's understand the importance of Data Visualization.
Here below, we have four different data sets and they are paired in the sense of x and y.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F4f6c696e3ad5e2c887b01a0bdd14b355%2Fdata_set.png?generation=1685190700223447&alt=media" alt="">
Next let's calculate some descriptive statistics such as mean, standard deviation and correlation of each variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F14765ba12bdc18b8ff67cb6a9f2d7c7a%2Fstatistics.png?generation=1685192394142325&alt=media" alt="">
After examining the descriptive statistics the above four data sets have nearly identical or similar simple descriptive statistics.
However, when we graphically plot the datasets on scatter plot, we can see the difference that these 4 datasets looks very different.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2Fdbccf9dc638d3de28930b9f660e5f5a4%2Fgarph.png?generation=1685191588780934&alt=media" alt="">
Data 1 has a clear linear relationship, Data 2 has a curved relationship that is not linear, Data 3 has a tight linear relationship with one outlier and Data 4 has a linear relationship with one large outlier.
Such datasets are known as Anscombe's Quartet
Anscombe's quartet is a classic example of the importance of data visualization.
Anscombe's quartet is a set of four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphically represented. Each dataset consists of eleven (x,y) points.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12425689%2F2b964d437afe17db949c57988b5fba05%2Fanscombes_quartet.png?generation=1685192626504792&alt=media" alt="">
Anscombe's quartet illustrates the importance of plotting data before we analyze it. Descriptive statistics can be misleading, and they can't tell us everything we need to know about a dataset. Plotting the data on charts can help us to understand the shape of the distribution and to identify any outliers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository is composed of 2 compressed files, with the contents as next described.
--- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:
+ prepare.py and main.py ⇨
The Python programs that implement the pipeline, both the auxiliary and the main pipeline
stages, respectively.
+ 'anomaly' and 'config' folders ⇨
Scripts and Python files containing the configuration and some basic functions that are
used to retrieve the information needed to process the data, like the actual resource
time series from OpenTSDB, or the job metadata from Slurm.
+ 'functions' folder ⇨
Several folders with the Python programs that implement all the stages of the pipeline,
either for the Machine Learning processing (e.g., extractors, aggregators, models), or
the technical aspect of the pipeline (e.g., pipelines, transformer).
+ plotDF.py ⇨
A Python program used to create the different plots presented, from the resource time
series to the evaluation plots.
+ several bash scripts ⇨
Used to run the experiments using a specific configuration, whether regarding which
transformers are chosen and how they are parametrized, or more technical aspects
involving how the pipeline is executed.
--- data.tar.gz --- The actual data and results, organized as follows:
+ jobs ⇨
All the jobs' resource time series plots for all the experiments, with a folder used
for each experiment. Inside each folder all the jobs are separated according to their
id, containing the plots for the different system resources (e.g., User CPU, Cached memory).
+ plots ⇨
All the predictions' plots for all the experiments in separated folders, mainly used for
evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These
plots are available for all the predictors resulting from the pipeline execution. In
addition, for each predictor it is also possible to visualize the resource time series
grouped by clusters. Finally, the projections as generated by the dimension reduction
models, and the outliers detected, are also available for each experiment.
+ datasets ⇨
The datasets used for the experiments, which include the lists of job IDs to be processed
(CSV files) and the results of each stage of the pipeline (e.g., features, predictions),
and the output text files as generated by several pipeline stages. Among these latter
files it is worth to note the evaluation ones, that include all the predictions scores.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper proposes a dynamic analytical processing (DAP) visualization tool based on the Bubble-Wall Plot. It can be handily used to develop visual warning systems for visualizing the dynamic analytical processes of hazard data. Comparative analysis and case study methods are used in this research. Based on a literature review of Q1 publications since 2017, 23 types of data visualization approaches/tools are identified, including seven anomaly data visualization tools. This study presents three significant findings by comparing existing data visualization approaches. The primary finding is that no single visualization tool can fully satisfy industry requirements. This finding motivates academics to develop new DAP visualization tools. The second finding is that there are different views of Line Charts and various perspectives on Scatter Plots. The other one is that different researchers may perceive an existing data visualization tool differently, such as arguments between Scatter Plots and Line Charts and diverse opinions about Parallel Coordinate Plots and Scatter Plots. Users’ awareness rises when they choose data visualization tools that satisfy their requirements. By conducting a comparative analysis based on five categories (Style, Value, Change, Correlation, and Others) with 26 subcategories of metric features, results show that this new tool can effectively solve the limitations of existing visualization tools as it appears to have three remarkable characteristics: the simplest cartographic tool, the most straightforward visual result, and the most intuitive tool. Furthermore, this paper illustrates how the Bubble-Wall Plot can be effectively applied to develop a warning system for presenting dynamic analytical processes of hazard data in the coal mine. Lastly, this paper provides two recommendations, one implication, six research limitations, and eleven further study topics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Input data, code, output data, summary plots, and run logs for investigation of the population genetics of the Drosophila melanogaster Sussex LHM sample. For output graphs, see popgen_plots.png Input data are 'plink binary' format. Code is a unix/linux shell script containing commands for Plink to perform population genetic tests. The two R scripts contain i. a short command for making a subpopulation file, ii. commands for plotting the output data. Both are initated in the shell script. Platform and version information are available in the log files. Other information available in the shell and R scripts. Broad observations are that the allele frequency disibribution is normal except a few humps around MAF 0.2-0.3 in the autosomes. Linkage disequilbrium, on average, levels-out after ~200bp but there can still be some at distances of 300Kb. The population appears to be divided into four genetically distinct groups (on the IBD-PCA scatter plot), with Fst analysis indicating that this is caused by genetic variation around the centromeres. This is possibly caused by historic admixture, and low centromeric recombination.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have used the same color coding convention we have used in Figure 12. We plot the values of two modified statistical complexities, which we will call M-Gleason 3 and M-Gleason 5. Instead of using the equiprobable distribution as our probability distribution of reference (for the computation of the Jensen-Shannon Divergence of the gene expression profile to this distribution), as required for the MPR-Statistical Complexity calculation, we used a different one. For the M-Gleason 3, the probability distribution of the reference is obtained averaging all the probability distributions of the samples that have been labelled as Gleason 3 (analogously, we calculated M-Gleason 5). This is analogous to our approach in melanoma (Figure 5) in which we used normal and metastatic samples as reference sets for a modified statistical complexity. We observe that, even in this case, 02_003E and 03_063 continue to appear as outliers. In addition to the evidence, we have observed that the deletion of these two samples did not significantly alter the identification of biomarkers.
Facebook
TwitterUse the Chart Viewer template to display bar charts, line charts, pie charts, histograms, and scatterplots to complement a map. Include multiple charts to view with a map or side by side with other charts for comparison. Up to three charts can be viewed side by side or stacked, but you can access and view all the charts that are authored in the map. Examples: Present a bar chart representing average property value by county for a given area. Compare charts based on multiple population statistics in your dataset. Display an interactive scatterplot based on two values in your dataset along with an essential set of map exploration tools. Data requirements The Chart Viewer template requires a map with at least one chart configured. Key app capabilities Multiple layout options - Choose Stack to display charts stacked with the map, or choose Side by side to display charts side by side with the map. Manage chart - Reorder, rename, or turn charts on and off in the app. Multiselect chart - Compare two charts in the panel at the same time. Bookmarks - Allow users to zoom and pan to a collection of preset extents that are saved in the map. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains regionalization factors for electricity generation and demand time series in Germany for the years 2019 - 2022. The factors can be used to distribute national generation and demand time series available from SMARD or ENTSO-E to federal state level. The methods underlying the regionalization factors are described in [1], with a focus on the year 2021. However, an extended version of the dataset covering the years 2019-2022 is also included for comprehensive analysis. Moreover, the dataset comprises the corresponding regionalized generation and demand time series at the federal state level of Germany. This time series has been generated using the provided distribution factors for the years 2019-2022 and corresponding generation and demand time series from SMARD [2]. Addtionally, the regionalization methodology for the distributed generation and demand data for the year 2021 has been supplemented with validation data, as described in [1]. This data has been cross-checked against the available SMARD Transmission System Operator (TSO) data. A description of the preprocessing required to obtain the TSO data comparison is provided in a separate .txt file. A PDF document has been prepared, which includes scatter plots that illustrate a comparison between actual and allocated generation per production type or demand data for TSOs on an hourly basis for the year 2021.
"static_regionalization_factors.2021[csv, xlsx]"
Each column corresponds to one factor per federal state and per production type or demand. Regionalization factors are based on share of generation capacity in each state (generation) or population and GDP (demand).
"dynamic_regionalization_factors_2021.[csv, xlsx]"
“dynamic_regionalization_factors_all.[csv, xlsx]”
Each column corresponds to one factor per federal state and per production type or demand. Each row corresponds to a specific hour of the years 2019 through 2022. Regionalization factors are based on a combination of per unit generation data and share of generation capacity in each state, simulated renewable generation data based on spatio-temporal weather data and distribution of wind and solar generation capacities, and a regionalized load dataset for 2015 [3].
“time_series_federal_states_all.[csv, xlsx]”
Each column corresponds to the allocated electricity generation or demand per federal state per production type or demand in units of MWh. Each row corresponds to a specific hour of the years 2019 through 2022. The regionalized generation and demand time series has been created by utilizing the dynamic regionalization factors provided in the dataset, in conjunction with the national electricity generation and demand data of Germany as provided by SMARD [2].
“TSO_actual.[csv, xlsx]”
“TSO_allocated.[csv, xlsx]”
Each column corresponds to the spatially aggregated electricity generation per type or demand per TSO in units of GWh. Each row corresponds to one hour of the year 2021. The TSOs in Germany do not hold direct responsibility for individual federal states, but rather for specific regions. In order to assess the validity of the regionalization methodology employed, it was necessary to generate data at the NUTS3 level and subsequently aggregate it to correspond with the relevant TSOs. The data is pre-processed at NUTS3 level and then undergoes the same methodology as outlined in [1]. The preprocessing steps required to map the installed capacity to the TSO level are explained in the accompanying .txt file. The allocated generation and demand data are aggregated to correspond to the TSO level using a shapefile of mapped regions in Germany that correspond to the TSOs [4]. The actual TSO data is generation and demand as published by SMARD [2]. The accompanying PDF presents scatter plots that showcase the actual vs allocated hourly generation types or demand per TSO, expanding on the information provided in the article.
[1] M. Sundblad, T. Fürmann, A. Weidlich and M. Schäfer, "Load and generation time series for German federal states: Static vs. dynamic regionalization factors," 2023 Open Source Modelling and Simulation of Energy Systems (OSMSES), Aachen, Germany, 2023, pp. 1-6, doi: 10.1109/OSMSES58477.2023.10089686.
[2] Bundesnetzagentur | SMARD.de
[3] Matthias Kühnbach, Anke Bekk, and Anke Weidlich (2021). Prepared for regional self-supply? On the regional fit of electricity demand and supply in Germany. Energy Strategy Reviews, 34:100609, 20
[4] Frysztacki, Martha Maria. (2023). Mapping of districts to control zones of German Transmission System Operators (TSOs) (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7530196
Facebook
Twittera) Description: A synthetic dataset consisting of 20.000 power and wind speed values. The goal of this dataset is to objectively quantify power curve modelling techniques for wind turbines.
b) Size: 580.0 kB
c) Platform: Any OS or programming language can read a txt file d) Environment: As this is a txt file, any modern OS will do. The txt file consists of comma seperated values so all modern programming languages can be used to read this file.
e) Major Component Description: There are 20.001 rows in the txt file. The first row indicates the headers of the columns. The other 20.000 lines indicate the corresponding values of the column. There are two columns, the first is the power and the second the wind speed.
f) Detailed Set-up Instructions: This depends on the platform and programming language. Since this is a txt file with tab seperated values, a broad range of options are possible and can be looked up.
g) Detailed Run Instructions: / h) Output Description: When plotting the wind speed values vs the power values using a scatter plot (e.g. matlab or python matplotlib), a power curve can be seen.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Short-period variability in plankton communities is poorly documented, especially for variations occurring in specific groups in the assemblage because traditional analysis is laborious and time-consuming. Moreover, it does not allow the high sampling frequency required for decision making. To overcome this limitation, we tested the submersible CytoSub flow cytometer. This device was anchored at a distance of approximately 10 metres from the low tide line at a depth of 1.5 metres for 12 hours to monitor the plankton at a site in the biological reserve of Barra da Tijuca beach, Rio de Janeiro. Data analysis was performed with two-dimensional scatter plots, individual pulse shapes and micro images acquisition. High-frequency monitoring results of two interesting groups are shown. The abundance and carbon biomass of ciliates were relatively stable, whereas those from dinoflagellates were highly variable along the day. The linear regression of biovolume measures between classical microscopy and in situ flow cytometry demonstrate high degree of adjustment. Despite the success of the trial and the promising results obtained, the large volume of images generated by the method also creates a need to develop pattern recognition models for automatic classification of in situ cytometric images.
Facebook
TwitterThe project is to conduct a principal components analysis of the Mali Farm data (malifarmdata.xlsx, R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, Pearson, New Jersey, 2019.). You will use S for the PCA. (a) Store the data in matrix X. (b) Carry out an initial investigation. Indicate if you had to process the data file in anyway. Do not transform the data. Explain any conclusions drawn from the evidence and backup your conclusions. Hint: Pay attention to detection of outliers. i. The data in rows 25, 34, 52, 57, 62, 69, 72 are outliers. Provide at least two indicators for each of these data that justify this claim. ii. Explain any other conclusions drawn from initial investigation. (c) Create a data matrix X by removing the outliers. (d) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (e) Carry out principal component analyses on X. i. Give the relevant sample covariance matrix S. ii. List the eigenvalues and describe the percent contributions to the variance. iii. Determine the number of principal components to retain and justify your answer by considering at least three methods. iv. Give the eigenvectors for the principal components you retain. v. Considering the coefficients of the principal components, Describe dependencies of the principal components on the variables. vi. Using at least the first two principal components, display appropriate scatter plots of pairs of principal components. Make observations about the plots. (f) Compare the results for the two analyses. How much effect did the outliers have on the principal component analysis? Which result do you like more and why? (g) Include your code. Key for Mali farm data Family Dist RD number of people in the household distance in kilometers to the nearest passable road Cotton = hectares of cotton planted in 2000 Maize = hectares of maize planted in 2000 Sorg = hectares of sorghum planted in 2000 Millet = hectares of millet planted in 2000 Bull = total number of bullocks Cattle = total number of cattle Goat = total number of goats
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
TwitterThis data set contains 20 multispectral surface reflectance images collected by the EO-1 satellite Hyperion sensor at 30-m resolution and covering the entire Amazon Basin for 2002 - 2005. All images were converted to GeoTiff format for distribution. The respective ENVI *.hdr files are included as companion files and contain image projection and band information.The selected multispectral images were processed using ENVI software as described in Chambers et al. (2009). Bands with uncalibrated wavelengths and those with low spectral response were removed leaving a spectral subset of generally 196 bands (some images have fewer). A cloud mask was developed using 2-d scatter plots of variable reflectance bands to highlight clouds as regions of interest (ROIs), allowing clouds and cloud edges to be masked. A de-streaking algorithm was then applied to the image to reduce variance in balance between the vertical columns. Apparent surface reflectance was calculated for this balanced image using the atmospheric correction algorithm ACORN in 1.5pb mode (AIG-LLC, Boulder, CO). The images (18 of the 20) were georeferenced using the corresponding Advanced Land Imager (ALI) satellite images.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is supplementary material to our recent paper, which is under review now and will be provided once it is published.
An active composite (AC) plate considered here consists of two materials: an active material and a passive material. The two materials have a strain mismatch upon actuation, thus inducing the shape transformation of AC plates. The shape depends on the material distribution. The voxel-level material distribution can be easily 3D printed, allowing the printed objects to change their shapes, which is the so-called 4D printing.
There are four types of material distributions defined on 15x15x2 voxels: fully random, island, hierarchical, and spinodal. Two additional types are the hierarchical 3x3 and 5x5, used for testing.
All data are generated through Abaqus/Standard, using the incompressible neo-Hookean material model, and the shapes are extracted and stored through MATLAB. These shapes are corresponding to the strain mismatch of 0.05.
Trained networks:
In addition to the datasets, the trained networks for forward predictions are also provided. These networks are trained on the so-called original dataset (containing the fully random and island datasets). Specifically, the network for (x,y) is in
mat_netD2s_f5convBCxyzEnd_xy_12epoDrop.mat
and the network for (z) is in
mat_netD2s_f5convBCxyzEnd_z_12epoDrop.mat.
These two files also contain the corresponding network-predicted coordinates on the validation dataset. These coordinates data, more specifically,
f5_XYZPredxy, f5_XYZValidxy for (x,y), and,
f5_XYZPredz, f5_XYZValidz for (z),
are the source data used to generate scatter plots of the predicted versus true coordinates values, i.e., Fig. 3g to 3h of our paper.
Uploads in Version 3:
mat_GCN3z_Acht21x64_SC4.mat contains the best-performing graph convolutional network (GCN) and the related source data for Supplementary Fig. 5 of our paper.
mat_netD2s_newData1_xy_12epoDrop.mat and mat_netD2s_newData1_z_12epoDrop.mat store the ResNets for trained using the new dataset (containing the fully random, island, hierarchical and spinodal datasets). They also store the related source data for Supplementary Fig. 7 and 8 of our paper.
Facebook
TwitterThe dataset for the article "The current utilization status of wearable devices in clinical research".Analyses were performed by utilizing the JMP Pro 16.10.The file extension "jrp" is a file of the statistical analysis software JMP, which contains both the analysis code and the data set.In case JMP is not available, a "csv" file as a data set and JMP script, the analysis code, are prepared in "rtf" format.JMP repot files used to create the figures(Figure 2. 2D scatterplot of relationships between annual enrollment and trial number over time),including the csv dataset of JMP repot files and JMP scripts.Figure 2. 2D scatterplot of relationships between annual enrollment and trial number over time2d plot.jrp2d plot(JMP script).rftdataset_2d.csv
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?