44 datasets found

f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
m
Ultimate_Analysis
data.mendeley.com
Updated Jan 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
Explore at:
Unique identifier
https://doi.org/10.17632/t8x96g88p3.2
Dataset updated
Jan 28, 2022
Authors
Akara Kijkarncharoensin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Data from: Worldwide benchmark of modelled solar irradiance data annex
zenodo.org
portaldelainvestigacion.uma.es
+1more
bin, zip
Updated Apr 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Forstinger; Anne Forstinger; Stefan Wilbert; Stefan Wilbert; Adam R. Jensen; Adam R. Jensen; Birk Kraas; Carlos Fernández-Peruchena; Carlos Fernández-Peruchena; Christian Gueymard; Christian Gueymard; Dario Ronzio; Dazhi Yang; Dazhi Yang; Elena Collino; Elena Collino; Jesús Polo Martinez; Jesús Polo Martinez; Jose A. Ruiz-Arias; Jose A. Ruiz-Arias; Natalie Hanrieder; Philippe Blanc; Philippe Blanc; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; Birk Kraas; Dario Ronzio; Natalie Hanrieder (2023). Worldwide benchmark of modelled solar irradiance data annex [Dataset]. http://doi.org/10.5281/zenodo.7867003
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7867003
Dataset updated
Apr 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anne Forstinger; Anne Forstinger; Stefan Wilbert; Stefan Wilbert; Adam R. Jensen; Adam R. Jensen; Birk Kraas; Carlos Fernández-Peruchena; Carlos Fernández-Peruchena; Christian Gueymard; Christian Gueymard; Dario Ronzio; Dazhi Yang; Dazhi Yang; Elena Collino; Elena Collino; Jesús Polo Martinez; Jesús Polo Martinez; Jose A. Ruiz-Arias; Jose A. Ruiz-Arias; Natalie Hanrieder; Philippe Blanc; Philippe Blanc; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; Birk Kraas; Dario Ronzio; Natalie Hanrieder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data annex contains the supplementary data to the IEA PVPS Task 16 report "Worldwide benchmark of modeled solar irradiance data" from 2023. The dataset includes visualizations and tables of the results as well as information concerning the reference stations.

The dataset contains the following type of files:

StationList.xlsx: list of all stations, including their coordinates, climate zone, station code, continent, altitude AMSL, data source, number of available test data sets, station type (Tier-1 or Tier-2), and available calibration record.

Result tables in folder “ResultTables”: Folders “climate_zones” and “continents” contain the tables described in Section 5.3. The filenames are “Component_metric_in_subgroup.html” with “component” DNI or GHI, “metric” describing the metric (see Table 3), and “subgroup” describing the continent or climate zone.

World maps: The folder “Resultmaps” contains world maps of the metrics described in Section 5.2. Either four or three metrics, depending on the map, are included in each pdf. A legend describing the meaning of the point size is also included.

Scatter plots of test vs. reference irradiance: The folder “Scatterplots” contains two folders, “DNI” and “GHI”, for the two investigated components. Three subfolders are also contained in these two folders:

The subfolders “plotsPerSiteYear” contain plots named “scatOverviewCOMPONENT_SITEYYYY.png”, where “COMPONENT” is either DNI or GHI, SITE is the three-letter site abbreviation, and YYYY is the evaluated year. The png plots include the scatterplots for all test data sets evaluated for the case specified by the filename.

The subfolders “plotsPerTestdataProvider” contain plots named “scatOverviewTESTDATASET_COMPONENTYYYY.png”, where “TESTDATASET” describes the test data set, “COMPONENT” is either DNI or GHI, and YYYY is the evaluated year. The png plots include the scatterplots for all sites evaluated for the case specified by the filename.

The subfolders “plotsPerTestdataProviderSamePosPerStat” contain the same scatterplots as “plotsPerTestdataProvider”, but using a slightly different visualization method. Here, the position of each scatterplot for a given site within the plot is always the same. Although this yields many empty subplots and small scatterplots, it can be helpful to rapidly browse through the plots if only one or a few stations are of interest.
f
Data from: Detecting desertification in different years and rainfall regimes...
scielo.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thiago Costa dos Santos; Adunias dos Santos Teixeira; Fabrício da Silva Terra; Luis Clenio Jário Moreira; Raul Shiso Toma (2023). Detecting desertification in different years and rainfall regimes by 2D Scatter Plot [Dataset]. http://doi.org/10.6084/m9.figshare.19904126.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19904126.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELO journals
Authors
Thiago Costa dos Santos; Adunias dos Santos Teixeira; Fabrício da Silva Terra; Luis Clenio Jário Moreira; Raul Shiso Toma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The desertification process causes soil degradation and a reduction in vegetation. The absence of visualisation techniques and the broad spatial and temporal dimension of the data hampers the identification of desertification and rapid decision-making by multidisciplinary teams. The 2D Scatter Plot is a two-dimensional visual analysis of reflectances in the red (630 - 690 nm) and near-infrared (760 - 900 nm) bands to visualise the spectral response of the vegetation. The hypothesis of this study is that visualising the reflectances of the vegetation by means of a 2D scatter plot will allow desertification to be inferred. The aim of this study was to identify desertified areas and characterise the spatial and temporal dynamics of the vegetation and soil during dry (DP) and rainy (RP) periods between 2000 and 2008, using a 2D scatter plot. The 2D scatter plot generated by the Envi® 4.8 software and the reflectances in bands 3 and 4 of the TM5 sensor were used within communities in the Irauçuba hub (Ceará, Brazil). The concentration densities of the near-infrared reflectances of the vegetation pixels were observed. Each community presented pixel concentrations with reflectances of less than 0.4 (40%) during each of the periods under evaluation, indicating little vegetation development, with further degradation caused by deforestation, the use of fire and overgrazing. The 2D scatter plot was able to show vegetation with low reflectance in the near infrared during both dry and rainy periods between 2000 and 2008, thereby inferring the occurrence of desertification.
Large-Scale Preference Dataset
kaggle.com
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Large-Scale Preference Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/large-scale-preference-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Large-Scale Preference Dataset

Training Powerful Reward & Critic Models with Aligned Language Models

By Huggingface Hub [source]

About this dataset

UltraFeedback is an unprecedentedly expansive, meticulously detailed, and multifarious preference dataset built exclusively to train powerful reward and critic models with aligned language models. With thousands of prompts lifted from countless distinct sources like UltraChat, ShareGPT, Evol-Instruet, TruthfulQA and more, UltraFeedback contains an overwhelming 256k samples – perfect for introducing to a wide array of AI-driven projects. Dive into the selection of correct answers and incorrect answers attached to this remarkable collection easily within the same data file! Get up close in exploring options presented in UltraFeedback – a groundbreaking new opportunity for data collectors!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The first step is to understand the content of the dataset, including source, models, correct answers and incorrect answers. Knowing which language models (LM) were used to generate completions can help you better interpret the data in this dataset.

Once you are familiar with the column titles and their meanings it’s time to begin exploring! To maximize your insight into this data set use a variety of visualization techniques such as scatter plots or bar charts to view sample distributions across different LMs or answer types. Analyzing trends between incorrect and correct answers through data manipulation techniques such as merging sets can also provide valuable insights into preferences across different prompts and sources.

Finally, you may want to try running LR or other machine learning models on this dataset in order to create simple models for predicting preferences when given inputs from real world scenarios related to specific tasks that require nuanced understanding of instructions provided by one’s peers or superiors.

The possibilities for further exploration of this dataset are endless - now let’s get started!

Research Ideas

Training sentence completion models on the dataset to generate responses with high accuracy and diversity.

Creating natural language understanding (NLU) tasks such as question-answering and sentiment analysis using the aligned dataset as training/testing sets.

Developing strongly supervised learning algorithms that are able to use techniques like reward optimization with potential translation applications in developing machine translation systems from scratch or upstream text-generation tasks like summarization, dialog generation, etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------| | source | The source of the data. (String) | | instruction | The instruction given to the language models. (String) | | models | The language models used to generate the completions. (String) | | correct_answers | The correct answers to the instruction. (String) | | incorrect_answers | The incorrect answers to the instruction. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
NBA Rookies Performance Statistics and Minutes
kaggle.com
Updated Jan 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). NBA Rookies Performance Statistics and Minutes [Dataset]. https://www.kaggle.com/datasets/thedevastator/nba-rookies-performance-statistics-and-minutes-p/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2023
Dataset provided by
Kaggle
Authors
The Devastator
Description
NBA Rookies Performance Statistics and Minutes Played: 1980-2016

Tracking Basketball Prodigies' Growth and Achievements

By Gabe Salzer [source]

About this dataset

This dataset contains essential performance statistics for NBA rookies from 1980-2016. Here you can find minute per game stats, points scored, field goals made and attempted, three-pointers made and attempted, free throws made and attempted (with the respective percentages for each), offensive rebounds, defensive rebounds, assists, steals blocks turnovers efficiency rating and Hall of Fame induction year. It is organized in descending order by minutes played per game as well as draft year. This Kaggle dataset is an excellent resource for basketball analysts to gain a better understanding of how rookies have evolved over the years—from their stats to how they were inducted into the Hall of Fame. With its great detail on individual players' performance data this dataset allows you to compare their performances against different eras in NBA history along with overall trends in rookie statistics. Compare rookies drafted far apart or those that played together- whatever your goal may be!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is perfect for providing insight into the performance of NBA rookies over an extended period of time. The data covers rookie stats from 1980 to 2016 and includes statistics such as points scored, field goals made, free throw percentage, offensive rebounds, defensive rebounds and assists. It also provides the name of each rookie along with the year they were drafted and their Hall of Fame class.

This data set is useful for researching how rookies’ stats have changed over time in order to compare different eras or identify trends in player performance. It can also be used to evaluate players by comparing their stats against those of other players or previous years’ stats.

In order to use this dataset effectively, a few tips are helpful:

Consider using Field Goal Percentage (FG%), Three Point Percentage (3P%) and Free Throw Percentage (FT%) to measure a player’s efficiency beyond just points scored or field goals made/attempted (FGM/FGA).

Lookout for anomalies such as low efficiency ratings despite high minutes played as this could indicate that either a player has not had enough playing time in order for their statistics to reach what would be per game average when playing more minutes or that they simply did not play well over that short period with limited opportunities.

Try different visualizations with the data such as histograms, line graphs and scatter plots because each may offer different insights into varied aspects of the data set like comparison between individual years vs aggregate trends over multiple years etc.

Lastly it is important keep in mind whether you're dealing with cumulative totals over multiple seasons versus looking at individual season averages or per game numbers when attempting analysis on these sets!

Research Ideas

Evaluating the performance of historical NBA rookies over time and how this can help inform future draft picks in the NBA.

Analysing the relative importance of certain performance stats, such as three-point percentage, to overall success and Hall of Fame induction from 1980-2016.

Comparing rookie seasons across different years to identify common trends in terms of statistical contributions and development over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: NBA Rookies by Year_Hall of Fame Class.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | Name | The name of...
Additional file 1 of ChromoMap: an R package for interactive visualization...
springernature.figshare.com
datasetcatalog.nlm.nih.gov
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lakshay Anand; Carlos M. Rodriguez Lopez (2023). Additional file 1 of ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes [Dataset]. http://doi.org/10.6084/m9.figshare.18230845.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18230845.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lakshay Anand; Carlos M. Rodriguez Lopez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Example of chromoMap interactive plot constructed using various features of chromoMap including polyploidy (used as multi-track), feature-associated data visualization (scatter and bar plots), chromosome heatmaps, data filters (color-coded scatter and bars). Differential gene expression in a cohort of patients positive for COVID19 and healthy individuals (NCBI Gene Expression Omnibus id: GSE162835) [12]. Each set of five tracks labeled with the same chromosome ID (e.g. 1-22, X & Y) contains the following information: From top to bottom: (1) number of differentially expressed genes (DEGs) (FDR < 0.05) (bars over the chromosome depictions) per genomic window (green boxes within the chromosome). Windows containing ≥ 5 DEGs are shown in yellow. (2) DEGs (FDR < 0.05) between healthy individuals and patients positive for COVID19 visualized as a scatterplot above the chromosome depiction (genes with logFC ≥ 2 or logFC ≤ −2 are highlighted in orange). Dots above the grey dashed line represent upregulated genes in COVID19 positive patients. Heatmap within chromosome depictions indicates the average LogFC value per window. (3–4) Normalized expression of differentially expressed genes (scatterplot) and of each genomic window containing DEG (green scale heatmap) in (3) patients with severe/critical outcomes and (4) asymptomatic/mild outcome patients. (5) logFC of DEGs between healthy individuals and patients positive for COVID19 visualized as scatter plot color-coded based on the metabolic pathway each DEG belongs to.
US Regional Sales Data
kaggle.com
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abu Talha (2023). US Regional Sales Data [Dataset]. https://www.kaggle.com/datasets/talhabu/us-regional-sales-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abu Talha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides comprehensive insights into US regional sales data across different sales channels, including In-Store, Online, Distributor, and Wholesale. With a total of 17,992 rows and 15 columns, this dataset encompasses a wide range of information, from order and product details to sales performance metrics. It offers a comprehensive overview of sales transactions and customer interactions, enabling deep analysis of sales patterns, trends, and potential opportunities.

Columns in the dataset: - OrderNumber: A unique identifier for each order. - Sales Channel: The channel through which the sale was made (In-Store, Online, Distributor, Wholesale). - WarehouseCode: Code representing the warehouse involved in the order. - ProcuredDate: Date when the products were procured. - OrderDate: Date when the order was placed. - ShipDate: Date when the order was shipped. - DeliveryDate: Date when the order was delivered. - SalesTeamID: Identifier for the sales team involved. - CustomerID: Identifier for the customer. - StoreID: Identifier for the store. - ProductID: Identifier for the product. - Order Quantity: Quantity of products ordered. - Discount Applied: Applied discount for the order. - Unit Cost: Cost of a single unit of the product. - Unit Price: Price at which the product was sold.

This dataset serves as a valuable resource for analysing sales trends, identifying popular products, assessing the performance of different sales channels, and optimising pricing strategies for different regions.

Visualization Ideas:

Time Series Analysis: Plot sales trends over time to identify seasonal patterns and changes in demand.

Sales Channel Comparison: Compare sales performance across different channels using bar charts or line graphs.

Product Analysis: Visualise the distribution of sales across different products using pie charts or bar plots.

Discount Analysis: Analyse the impact of discounts on sales using scatter plots or line graphs.

Regional Performance: Create maps to visualise sales performance across different regions.

Data Modelling and Machine Learning Ideas (Price Prediction): - Linear Regression: Build a linear regression model to predict the unit price based on features such as order quantity, discount applied, and unit cost. - Random Forest Regression: Use a random forest regression model to predict the price, taking into account multiple features and their interactions. - Neural Networks: Train a neural network to predict unit price using deep learning techniques, which can capture complex relationships in the data. - Feature Importance Analysis: Identify the most influential features affecting price prediction using techniques like feature importance scores from tree-based models. - Time Series Forecasting: Develop a time series forecasting model to predict future prices based on historical sales data. - These visualisation and modelling ideas can help you gain valuable insights from the sales data and create predictive models to optimise pricing strategies and improve sales performance.
e
Annual Time Series of Air Temperature, Precipitation, and Urban Area Extent...
b2find.eudat.eu
Updated Mar 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Annual Time Series of Air Temperature, Precipitation, and Urban Area Extent in Modena, Italy - Files - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/46e3fcf8-0259-5400-ad52-45a02ed2d903
Explore at:
Dataset updated
Mar 20, 2024
Area covered
Italy, Modena
Description
An uninterrupted data set of 139 annual values of local mean air temperature T, cumulative precipitation depth P, urban area extent A, global mean surface air temperature G, and global CO2 concentration C for the 1881-2019 period of time is shared with the scientific community. The Matlab 2021a code na.m performing a nonlinear analysis of the data contained in the file ts.dat is also shared with the scientific community. The code loads file ts.dat and generates the PDF files of this dataset. File README.txt contains the description of this dataset and its files.The shared data can be found in the ASCII text file ts.dat (as well as in dataset doi:10.1594/PANGAEA.938739, which has been created from that file). The first column, having header year, contains the year. The second column, having header T (°C), contains the local mean air temperature T in Celsius degrees observed in Modena. The third column, having header P (mm), contains the cumulative precipitation depth P in millimeters in Modena. The fourth column, having header A (km2), contains the urban area extent A in square kilometers of Modena. The fifth column, having header G (°C), contains the global mean surface air temperature G in Celsius degrees obtained by adding the GISTEMP temperature change to the average temperature observed in Modena in the 1951–1980 base period (https://data.giss.nasa.gov/gistemp/). The sixth column, having heather C (ppm), contains the global CO2 concentration C in parts per million estimated from ice cores, from 1881 to 1958 (https://cdiac.ess-dive.lbl.gov/trends/co2/lawdome-data.html), and observed in the Mauna Loa Observatory (latitude 19.5362°N, longitude 155.5763°W, elevation 3397.00 m asl), Hawaii, from 1959 to 2019 (https://gml.noaa.gov/ccgg/trends/data.html).The Matlab 2021a code na.m performing a nonlinear analysis of the data contained in the file ts.dat is also shared with the scientific community. The Matlab 2021a code na.m loads the file ts.dat and generates the the following PDF files:- PDF file lg.pdf. Comparison between local temperature in Modena and global temperatures obtained from the NASA GISTEMP temperature change projected to Modena.- PDF file dm.pdf. Scatter plot matrix of T, P, A, G, and C.- PDF file vm.pdf. Scatter plot matrix for the first differences of T, P, A, G, and C.- PDF file pm.pdf. Generalized additive model predictions of T, P, A, G, and C, denoted as T', P', A', G', and C', obtained from single predictors T, P, A, G, and C.- PDF file gam.pdf. Generalized additive model predictions of T and G, denoted as T' and G', respectively, obtained from multiple predictors based on T, P, A, G, and C.The nonlinear analysis performed by using the data set contained in the ASCII text file ts.dat and the Matlab 2021a code na.m are described in Orlandini et al 2021 and available from the authors sharing the present data set upon request.
c
Shedding new light on the integrity of gold nanoparticle-fluorophore...
research-data.cardiff.ac.uk
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Panagiota Giannakopoulou; Joseph Williams; Paul Moody; Edward Sayers; JP Magnusson; Iestyn Pope; Lukas Payne; C Alexander; Arwyn Jones; Wolfgang Langbein; Peter Watson; Paola Borri (2024). Shedding new light on the integrity of gold nanoparticle-fluorophore conjugates for cell biology with four-wave-mixing microscopy - dataset [Dataset]. http://doi.org/10.17035/d.2019.0081702601
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2019.0081702601
Dataset updated
Sep 18, 2024
Dataset provided by
Cardiff University
Authors
Panagiota Giannakopoulou; Joseph Williams; Paul Moody; Edward Sayers; JP Magnusson; Iestyn Pope; Lukas Payne; C Alexander; Arwyn Jones; Wolfgang Langbein; Peter Watson; Paola Borri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a cross-disciplinary work at the physics/life science interface which addresses an important question in the use of gold nanoparticles (AuNPs) conjugated to fluorescent molecules for cell biology, namely whether the fluorophore is a faithful reporter of the nanoparticle location.AuNPs are among the most widely investigated systems in nano-medicine research for applications in intracellular imaging and sensing, drug delivery and photothermal therapy, owing to their small sizes, biocompatibility, ease of surface functionalisation and bio-conjugation.In this context, a particularly interesting system is that of a AuNP-fluorophore conjugate, whereby a fluorescently labelled biomolecule (e.g. a protein ligand, nucleotide, peptide, antibody) is attachedonto the AuNP surface, and its uptake and intracellular fate is followed in situ in real time by fluorescence microscopy. AuNPs are historically well known to biologists as markers for electron microscopy due to their high electron density; hence these conjugate are specifically useful probes for correlative light electron microscopy. However, an important question that has remained elusive to answer is whether the fluorescence readout is actually a reliable reporter of the AuNP location. This is because it is challenging with current optical techniques to directly visualise a single small AuNP against the endogenous scattering, absorption and phase contrast in a highly heterogeneous three-dimensional cellular environment.These data demonstrate the application of a novel optical microscopy technique developed in our lab (four-wave mixing (FWM) interferometry) to directly image single small AuNPs background-free inside cells with high 3D spatial resolution. The data show four different AuNP-fluorophore conjugates imaged inside two different cell types. By correlative fluorescence-FWM microscopy, the data show that, in most cases, fluorescence emission originated from unbound fluorophores rather than from fluorophores attachedto nanoparticles. Fluorescence detection was also severely limited by photobleaching, quenching and autofluorescence background.The datasets consist of images and numerical data. Images consist of two groups: experimental and calculated datasets.Experimental images are optical microscopy datasets obtained using: i) Differential Interference Contrast Microscopy (DIC), 2) FWM microscopy, 3) Confocal fluorescence microscopy, 4) extinction microscopy, 5) wide-field epifluorescence microscopy. Calculated datasets are images of the cross correlation coefficient as a function of relative translation coordinates, calculated from the experimental images. Numerical data consist of:1) One dimensional cut profiles along images. Data are provided as Origin plots where original datasets can be retrieved.2) Plots of representative values of extinction cross-sections. Data are provided as Origin plots where original datasets can be retrieved.3) Scatter plot from a two-channel/colour fluorescence image, showing the intensity of one colour channel in a given pixel as the x-coordinate and the fluorescence intensity of the second channel at the same pixel as the y-coordinate. Data are provided as Origin plots where original datasets can be retrieved.4) Scatter plot showing the fluorescence flux (in units of detected photoelectrons/s) versus extinction cross-section of nanoparticles. Data are provided as Origin plots where original datasets can be retrieved.Research results based upon these data are published at https://doi.org/10.1039/C9NR08512B
d
Representations of color and form in mouse visual cortex
search.dataone.org
data.niaid.nih.gov
+1more
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issac Rhim; Ian Nauhaus (2023). Representations of color and form in mouse visual cortex [Dataset]. http://doi.org/10.5061/dryad.t1g1jwt3r
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.t1g1jwt3r
Dataset updated
Nov 29, 2023
Dataset provided by
Dryad Digital Repository
Authors
Issac Rhim; Ian Nauhaus
Time period covered
Jan 1, 2021
Description
Spatial transitions in color can aid any visual perception task, and its neural representation â€“ an â€œintegration of color and formâ€ â€“ is thought to begin at primary visual cortex (V1). Color and form integration is untested in mouse V1, yet studies show that the ventral retina provides the necessary substrate from green-sensitive rods and UV-sensitive cones. Here, we used two-photon imaging in V1 to measure spatial frequency (SF) tuning along four axes of rod and cone contrast space, including luminance and color. We first reveal that V1 has similar responsiveness to luminance and color, yet average SF tuning is significantly shifted lowpass for color. Next, guided by linear models, we used SF tuning along all four color axes to estimate the proportion of neurons that fall into classic models of color opponency â€“ â€œsingle-â€ , â€œdouble-â€ , and â€œnon-opponentâ€ . Few neurons (~6%) fit the criteria for double-opponency, which are uniquely tuned for chromatic borders. Most of the population can be..., This data comes from two-photon imaging in mouse primary visual cortex. There is also Matlab code to run the simulations in figures 1, 6, 7, and 8., See uploaded README files for details.Â Below is the top of README_for_dataset.doc. This describes the uploaded data set used in Rhim and Nauhaus: â€œJoint representations of color and form in mouse visual cortex described by random pooling from rods and conesâ€ . It is a MATLAB .mat file, where each structure pertains to a given figure. In addition to the source data for the figures, it also has the following additions:

The same data set, but prior to culling the population according to the dashed box in the Figure 2 scatter plot. See variables appended with â€œâ€¦_allâ€

Region-of-interest ID associated with each neuron.

Below is all the information in README_for_simulations.doc. To run the simulations for Figures 1,6,7, and 8, execute the cells in the high-level scripts of the following: Figure_1.m, Figure_6_7.m, Figure_8.m. Make sure all the other .m files are in your path.
e
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...
b2find.eudat.eu
Updated Aug 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3524622d-2099-554c-826a-f2155c3f4bb4
Explore at:
Dataset updated
Aug 17, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.
m
Data from: Data-driven Multivariate Power Curve Modeling of Offshore Wind...
data.mendeley.com
narcis.nl
Updated Jul 25, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Janssens (2016). Data-driven Multivariate Power Curve Modeling of Offshore Wind Turbines [Dataset]. http://doi.org/10.17632/gst3cdfnn5.1
Explore at:
Unique identifier
https://doi.org/10.17632/gst3cdfnn5.1
Dataset updated
Jul 25, 2016
Authors
Olivier Janssens
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
a) Description: A synthetic dataset consisting of 20.000 power and wind speed values. The goal of this dataset is to objectively quantify power curve modelling techniques for wind turbines.

b) Size: 580.0 kB

c) Platform: Any OS or programming language can read a txt file d) Environment: As this is a txt file, any modern OS will do. The txt file consists of comma seperated values so all modern programming languages can be used to read this file.

e) Major Component Description: There are 20.001 rows in the txt file. The first row indicates the headers of the columns. The other 20.000 lines indicate the corresponding values of the column. There are two columns, the first is the power and the second the wind speed.

f) Detailed Set-up Instructions: This depends on the platform and programming language. Since this is a txt file with tab seperated values, a broad range of options are possible and can be looked up.

g) Detailed Run Instructions: / h) Output Description: When plotting the wind speed values vs the power values using a scatter plot (e.g. matlab or python matplotlib), a power curve can be seen.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Data Set for: Step-by-Step Calculation and Spreadsheet Tools for Predicting...
catalog.data.gov
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data Set for: Step-by-Step Calculation and Spreadsheet Tools for Predicting Stressor Levels that Extirpate Genera and Species [Dataset]. https://catalog.data.gov/dataset/data-set-for-step-by-step-calculation-and-spreadsheet-tools-for-predicting-stressor-levels
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The data includes measured data from Ecoregions 69 and 70 in West Virginia. Paired biological and chemical grab samples are included. These data were used to estimate SC extirpation concentration (XC95) for benthic invertebrate genera. Also included are cumulative frequency distribution plots, scatter plots fitted with generalized additive models, and biogeographical maps of observations of each genus. The metadata and full data set is available in Supplemental Appendices S4 and S5, respectively. The output of 176 XC95 values from Ecoregions 69 and 70 are provided in Supplemental Appendix S6. Supplemental Appendix S7 depicts the probability of observing a genus for discrete ranges of SC. Supplemental Appendix S8 depicts the proportion of occurrence of a genus for discrete ranges of SC. Supplemental Appendix S9 shows the biogeographic distributions of the genera included in the data set. We also discuss limitations of this method to help avoid misinterpretations and inferential errors. A data dictionary is provided in Cond_DataFileColumnMetada-20161221. This dataset is associated with the following publication: Cormier, S., L. Zheng, E. Leppo, and A. Hamilton. Step-by-Step Calculation and Spreadsheet Tools for Predicting Stressor Levels that Extirpate Genera and Species. Integrated Environmental Assessment and Management. Allen Press, Inc., Lawrence, KS, USA, 14(2): 174-180, (2018).
e
Exploring the SDSS data set. I. EMP & CV stars
b2find.eudat.eu
Updated Feb 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Exploring the SDSS data set. I. EMP & CV stars [Dataset]. https://b2find.eudat.eu/dataset/2a30d2b3-32f1-59f8-b3ab-ad01de5a7de9
Explore at:
Dataset updated
Feb 10, 2017
Description
We present the results of a search for extremely metal-poor (EMP), carbon-enhanced metal-poor (CEMP), and cataclysmic variable (CV) stars using a new exploration tool based on linked scatter plots (LSPs). Our approach is especially designed to work with very large spectrum data sets such as the SDSS, LAMOST, RAVE, and Gaia data sets, and it can be applied to stellar, galaxy, and quasar spectra. As a demonstration, we conduct our search using the SDSS DR10 data set. We first created a 3326-dimensional phase space containing nearly 2 billion measures of the strengths of over 1600 spectral features in 569738 SDSS stars. These measures capture essentially all the stellar atomic and molecular species visible at the resolution of SDSS spectra. We show how LSPs can be used to quickly isolate and examine interesting portions of this phase space. To illustrate, we use LSPs coupled with cuts in selected portions of phase space to extract EMP stars, CEMP stars, and CV stars. We present identifications for 59 previously unrecognized candidate EMP stars and 11 previously unrecognized candidate CEMP stars. We also call attention to 2 candidate He II emission CV stars found by the LSP approach that have not yet been discussed in the literature.
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
c
Data from: Quantitative imaging of lipids in live mouse oocytes and early...
research-data.cardiff.ac.uk
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J Bradley; Iestyn Pope; Francesco Masia; Wolfgang Langbein; Karl Swann; Paola Borri (2024). Quantitative imaging of lipids in live mouse oocytes and early embryos using CARS microscopy [Dataset]. http://doi.org/10.17035/d.2016.0008223993
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2016.0008223993
Dataset updated
Sep 18, 2024
Dataset provided by
Cardiff University
Authors
J Bradley; Iestyn Pope; Francesco Masia; Wolfgang Langbein; Karl Swann; Paola Borri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mammalian oocytes contain lipid droplets (LDs) that are a store of fatty acids, whose metabolism plays a significant role in pre-implantation development. Fluorescent staining has previously been used to image lipid droplets in mammalian oocytes and embryos, but this method is not quantitative and often incompatible with live cell imaging and subsequent development.These data show the application of chemically specific, label-free coherent anti-Stokes Raman scattering (CARS) microscopy to mouse oocytes and pre-implantation embryos. The data show that CARS imaging can quantify the size, number and spatial distribution of lipid droplets in living mouse oocytes and embryos up to the blastocyst stage. Notably, it can be used in a way that does not compromise oocyte maturation or embryo development.The data also correlate CARS with two-photon fluorescence microscopy simultaneously acquired using fluorescent lipid probes on fixed samples, and demonstrate only a partial degree of correlation, depending on the lipid probe, clearly exemplifying the limitation of lipid labelling.In addition, the data show that differences in the chemical composition of lipid droplets in living oocytes matured in media supplemented with different amounts of saturated and unsaturated fatty acids can be detected using CARS hyperspectral imaging. These data demonstrate that CARS microscopy provides a novel non-invasive method of quantifying lipid content, type and spatial distribution with sub-micron resolution in living mammalian oocytes and embryos.The data sets consists of optical microscopy images and numerical data. Microscope images show oocytes and early embryos (as cross sections in two dimensions or as maximum intensity projections), obtained using Differential Interference Contrast microscopy (DIC), CARS microscopy, and fluorescence microscopy. Lipid droplets of oocytes and early embryos are specifically visualised in the CARS microscopy images.Numerical data consist of the following groups:1) histogram of the occurrence of the aggregate size (number of lipid droplets per aggregate) in a representative egg. The data set is an ascii file with X and Y columns. X is the aggregate size and Y the occurrence.2) Scatter plot of the square root of the sum of the squared aggregate size against the total number of lipid droplets, in ensembles of eggs and embryos. The data set is an ascii file with X and Y columns. X is square root of the sum of the squared aggregate size and Y is the total number of lipid droplets.3) Vibrational Raman-like spectra obtained from CARS hyperspectral images of lipid droplets in representative eggs and embryos. The data set is an ascii file with X and Y columns. X is the wavenumber and Y is CARS susceptibility (imaginary part).4) histogram of the occurrence of the LD effective diameter in a representative egg. The data set is an ascii file with X and Y columns. X is the LD diameter and Y the occurrence.5) Scatter plot of the diameter of LDs against the total number of LDs, in ensembles of eggs and embryos. The data set is an ascii file with X and Y columns. X is diameter of LDs and Y is the total number of lipid droplets.Results derived from these data are published at http://dx.doi.org/10.1242/dev.129908
f
Three-dimensional scatter plots showing the probability of elimination at...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 31, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younes, Mohamed; Barrey, Eric; Cottin, François; Robert, Céline (2015). Three-dimensional scatter plots showing the probability of elimination at vet gates 2, 3, 4 and 5, according to the corresponding logistic regressions with a fixed HR of 64 and the AS and CRT measured at the previous vet gate (n-1). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001868318
Explore at:
Dataset updated
Aug 31, 2015
Authors
Younes, Mohamed; Barrey, Eric; Cottin, François; Robert, Céline
Description
Red corresponds to a probability of elimination of 60 to 80%, whereas brown (the darkest areas) corresponds to a probability of 80 to 100%. The white line corresponds to a probability of elimination of 70% (the threshold chosen to compute the probability of elimination in an independent data set used for validation).
UK National Databank of Moored Current Meter Data (1967-)
bodc.ac.uk
data-search.nerc.ac.uk
nc
Updated Jan 30, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Oceanographic Data Centre (2017). UK National Databank of Moored Current Meter Data (1967-) [Dataset]. https://www.bodc.ac.uk/resources/inventories/edmed/report/157/
Explore at:
ncAvailable download formats
Dataset updated
Jan 30, 2017
Dataset authored and provided by
British Oceanographic Data Centrehttp://www.bodc.ac.uk/
License
https://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/
Time period covered
1967 - Present
Area covered
Norwegian Sea, Inner Seas off the West Coast of Scotland, North Sea, Indian Ocean, English Channel, Mediterranean Sea, South Atlantic Ocean, Irish Sea, North Atlantic Ocean,
Description
The data set comprises more than 7000 time series of ocean currents from moored instruments. The records contain horizontal current speed and direction and often concurrent temperature data. They may also contain vertical velocities, pressure and conductivity data. The majority of data originate from the continental shelf seas around the British Isles (for example, the North Sea, Irish Sea, Celtic Sea) and the North Atlantic. Measurements are also available for the South Atlantic, Indian, Arctic and Southern Oceans and the Mediterranean Sea. Data collection commenced in 1967 and is currently ongoing. Sampling intervals normally vary between 5 and 60 minutes. Current meter deployments are typically 2-8 weeks duration in shelf areas but up to 6-12 months in the open ocean. About 25 per cent of the data come from water depths of greater than 200m. The data are processed and stored by the British Oceanographic Data Centre (BODC) and a computerised inventory is available online. Data are quality controlled prior to loading to the databank. Data cycles are visually inspected by means of a sophisticated screening software package. Data from current meters on the same mooring or adjacent moorings can be overplotted and the data can also be displayed as time series or scatter plots. Series header information accompanying the data is checked and documentation compiled detailing data collection and processing methods.

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

figshare

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

Ultimate_Analysis

Data from: Worldwide benchmark of modelled solar irradiance data annex

Data from: Detecting desertification in different years and rainfall regimes...

Large-Scale Preference Dataset

Large-Scale Preference Dataset

Training Powerful Reward & Critic Models with Aligned Language Models

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

NBA Rookies Performance Statistics and Minutes

NBA Rookies Performance Statistics and Minutes Played: 1980-2016

Tracking Basketball Prodigies' Growth and Achievements

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Additional file 1 of ChromoMap: an R package for interactive visualization...

US Regional Sales Data

Annual Time Series of Air Temperature, Precipitation, and Urban Area Extent...

Shedding new light on the integrity of gold nanoparticle-fluorophore...

Representations of color and form in mouse visual cortex

Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

Data from: Data-driven Multivariate Power Curve Modeling of Offshore Wind...

Cdd Dataset

Data Set for: Step-by-Step Calculation and Spreadsheet Tools for Predicting...

Exploring the SDSS data set. I. EMP & CV stars

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

Data from: Quantitative imaging of lipids in live mouse oocytes and early...

Three-dimensional scatter plots showing the probability of elimination at...

UK National Databank of Moored Current Meter Data (1967-)

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate