44 datasets found
  1. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  2. m

    Ultimate_Analysis

    • data.mendeley.com
    Updated Jan 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
    Explore at:
    Dataset updated
    Jan 28, 2022
    Authors
    Akara Kijkarncharoensin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

    The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

    An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

    A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

    The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

    Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.

  3. Data from: Worldwide benchmark of modelled solar irradiance data annex

    • zenodo.org
    • portaldelainvestigacion.uma.es
    • +1more
    bin, zip
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne Forstinger; Anne Forstinger; Stefan Wilbert; Stefan Wilbert; Adam R. Jensen; Adam R. Jensen; Birk Kraas; Carlos Fernández-Peruchena; Carlos Fernández-Peruchena; Christian Gueymard; Christian Gueymard; Dario Ronzio; Dazhi Yang; Dazhi Yang; Elena Collino; Elena Collino; Jesús Polo Martinez; Jesús Polo Martinez; Jose A. Ruiz-Arias; Jose A. Ruiz-Arias; Natalie Hanrieder; Philippe Blanc; Philippe Blanc; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; Birk Kraas; Dario Ronzio; Natalie Hanrieder (2023). Worldwide benchmark of modelled solar irradiance data annex [Dataset]. http://doi.org/10.5281/zenodo.7867003
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Apr 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anne Forstinger; Anne Forstinger; Stefan Wilbert; Stefan Wilbert; Adam R. Jensen; Adam R. Jensen; Birk Kraas; Carlos Fernández-Peruchena; Carlos Fernández-Peruchena; Christian Gueymard; Christian Gueymard; Dario Ronzio; Dazhi Yang; Dazhi Yang; Elena Collino; Elena Collino; Jesús Polo Martinez; Jesús Polo Martinez; Jose A. Ruiz-Arias; Jose A. Ruiz-Arias; Natalie Hanrieder; Philippe Blanc; Philippe Blanc; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; Birk Kraas; Dario Ronzio; Natalie Hanrieder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data annex contains the supplementary data to the IEA PVPS Task 16 report "Worldwide benchmark of modeled solar irradiance data" from 2023. The dataset includes visualizations and tables of the results as well as information concerning the reference stations.

    The dataset contains the following type of files:

    • StationList.xlsx: list of all stations, including their coordinates, climate zone, station code, continent, altitude AMSL, data source, number of available test data sets, station type (Tier-1 or Tier-2), and available calibration record.
    • Result tables in folder “ResultTables”: Folders “climate_zones” and “continents” contain the tables described in Section 5.3. The filenames are “Component_metric_in_subgroup.html” with “component” DNI or GHI, “metric” describing the metric (see Table 3), and “subgroup” describing the continent or climate zone.
    • World maps: The folder “Resultmaps” contains world maps of the metrics described in Section 5.2. Either four or three metrics, depending on the map, are included in each pdf. A legend describing the meaning of the point size is also included.
    • Scatter plots of test vs. reference irradiance: The folder “Scatterplots” contains two folders, “DNI” and “GHI”, for the two investigated components. Three subfolders are also contained in these two folders:
      • The subfolders “plotsPerSiteYear” contain plots named “scatOverviewCOMPONENT_SITEYYYY.png”, where “COMPONENT” is either DNI or GHI, SITE is the three-letter site abbreviation, and YYYY is the evaluated year. The png plots include the scatterplots for all test data sets evaluated for the case specified by the filename.
      • The subfolders “plotsPerTestdataProvider” contain plots named “scatOverviewTESTDATASET_COMPONENTYYYY.png”, where “TESTDATASET” describes the test data set, “COMPONENT” is either DNI or GHI, and YYYY is the evaluated year. The png plots include the scatterplots for all sites evaluated for the case specified by the filename.
      • The subfolders “plotsPerTestdataProviderSamePosPerStat” contain the same scatterplots as “plotsPerTestdataProvider”, but using a slightly different visualization method. Here, the position of each scatterplot for a given site within the plot is always the same. Although this yields many empty subplots and small scatterplots, it can be helpful to rapidly browse through the plots if only one or a few stations are of interest.
  4. f

    Data from: Detecting desertification in different years and rainfall regimes...

    • scielo.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thiago Costa dos Santos; Adunias dos Santos Teixeira; Fabrício da Silva Terra; Luis Clenio Jário Moreira; Raul Shiso Toma (2023). Detecting desertification in different years and rainfall regimes by 2D Scatter Plot [Dataset]. http://doi.org/10.6084/m9.figshare.19904126.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELO journals
    Authors
    Thiago Costa dos Santos; Adunias dos Santos Teixeira; Fabrício da Silva Terra; Luis Clenio Jário Moreira; Raul Shiso Toma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT The desertification process causes soil degradation and a reduction in vegetation. The absence of visualisation techniques and the broad spatial and temporal dimension of the data hampers the identification of desertification and rapid decision-making by multidisciplinary teams. The 2D Scatter Plot is a two-dimensional visual analysis of reflectances in the red (630 - 690 nm) and near-infrared (760 - 900 nm) bands to visualise the spectral response of the vegetation. The hypothesis of this study is that visualising the reflectances of the vegetation by means of a 2D scatter plot will allow desertification to be inferred. The aim of this study was to identify desertified areas and characterise the spatial and temporal dynamics of the vegetation and soil during dry (DP) and rainy (RP) periods between 2000 and 2008, using a 2D scatter plot. The 2D scatter plot generated by the Envi® 4.8 software and the reflectances in bands 3 and 4 of the TM5 sensor were used within communities in the Irauçuba hub (Ceará, Brazil). The concentration densities of the near-infrared reflectances of the vegetation pixels were observed. Each community presented pixel concentrations with reflectances of less than 0.4 (40%) during each of the periods under evaluation, indicating little vegetation development, with further degradation caused by deforestation, the use of fire and overgrazing. The 2D scatter plot was able to show vegetation with low reflectance in the near infrared during both dry and rainy periods between 2000 and 2008, thereby inferring the occurrence of desertification.

  5. Large-Scale Preference Dataset

    • kaggle.com
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Large-Scale Preference Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/large-scale-preference-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Large-Scale Preference Dataset

    Training Powerful Reward & Critic Models with Aligned Language Models

    By Huggingface Hub [source]

    About this dataset

    UltraFeedback is an unprecedentedly expansive, meticulously detailed, and multifarious preference dataset built exclusively to train powerful reward and critic models with aligned language models. With thousands of prompts lifted from countless distinct sources like UltraChat, ShareGPT, Evol-Instruet, TruthfulQA and more, UltraFeedback contains an overwhelming 256k samples – perfect for introducing to a wide array of AI-driven projects. Dive into the selection of correct answers and incorrect answers attached to this remarkable collection easily within the same data file! Get up close in exploring options presented in UltraFeedback – a groundbreaking new opportunity for data collectors!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The first step is to understand the content of the dataset, including source, models, correct answers and incorrect answers. Knowing which language models (LM) were used to generate completions can help you better interpret the data in this dataset.

    Once you are familiar with the column titles and their meanings it’s time to begin exploring! To maximize your insight into this data set use a variety of visualization techniques such as scatter plots or bar charts to view sample distributions across different LMs or answer types. Analyzing trends between incorrect and correct answers through data manipulation techniques such as merging sets can also provide valuable insights into preferences across different prompts and sources.

    Finally, you may want to try running LR or other machine learning models on this dataset in order to create simple models for predicting preferences when given inputs from real world scenarios related to specific tasks that require nuanced understanding of instructions provided by one’s peers or superiors.

    The possibilities for further exploration of this dataset are endless - now let’s get started!

    Research Ideas

    • Training sentence completion models on the dataset to generate responses with high accuracy and diversity.
    • Creating natural language understanding (NLU) tasks such as question-answering and sentiment analysis using the aligned dataset as training/testing sets.
    • Developing strongly supervised learning algorithms that are able to use techniques like reward optimization with potential translation applications in developing machine translation systems from scratch or upstream text-generation tasks like summarization, dialog generation, etc

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------------|:---------------------------------------------------------------| | source | The source of the data. (String) | | instruction | The instruction given to the language models. (String) | | models | The language models used to generate the completions. (String) | | correct_answers | The correct answers to the instruction. (String) | | incorrect_answers | The incorrect answers to the instruction. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  6. NBA Rookies Performance Statistics and Minutes

    • kaggle.com
    Updated Jan 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). NBA Rookies Performance Statistics and Minutes [Dataset]. https://www.kaggle.com/datasets/thedevastator/nba-rookies-performance-statistics-and-minutes-p/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    NBA Rookies Performance Statistics and Minutes Played: 1980-2016

    Tracking Basketball Prodigies' Growth and Achievements

    By Gabe Salzer [source]

    About this dataset

    This dataset contains essential performance statistics for NBA rookies from 1980-2016. Here you can find minute per game stats, points scored, field goals made and attempted, three-pointers made and attempted, free throws made and attempted (with the respective percentages for each), offensive rebounds, defensive rebounds, assists, steals blocks turnovers efficiency rating and Hall of Fame induction year. It is organized in descending order by minutes played per game as well as draft year. This Kaggle dataset is an excellent resource for basketball analysts to gain a better understanding of how rookies have evolved over the years—from their stats to how they were inducted into the Hall of Fame. With its great detail on individual players' performance data this dataset allows you to compare their performances against different eras in NBA history along with overall trends in rookie statistics. Compare rookies drafted far apart or those that played together- whatever your goal may be!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is perfect for providing insight into the performance of NBA rookies over an extended period of time. The data covers rookie stats from 1980 to 2016 and includes statistics such as points scored, field goals made, free throw percentage, offensive rebounds, defensive rebounds and assists. It also provides the name of each rookie along with the year they were drafted and their Hall of Fame class.

    This data set is useful for researching how rookies’ stats have changed over time in order to compare different eras or identify trends in player performance. It can also be used to evaluate players by comparing their stats against those of other players or previous years’ stats.

    In order to use this dataset effectively, a few tips are helpful:

    • Consider using Field Goal Percentage (FG%), Three Point Percentage (3P%) and Free Throw Percentage (FT%) to measure a player’s efficiency beyond just points scored or field goals made/attempted (FGM/FGA).

    • Lookout for anomalies such as low efficiency ratings despite high minutes played as this could indicate that either a player has not had enough playing time in order for their statistics to reach what would be per game average when playing more minutes or that they simply did not play well over that short period with limited opportunities.

    • Try different visualizations with the data such as histograms, line graphs and scatter plots because each may offer different insights into varied aspects of the data set like comparison between individual years vs aggregate trends over multiple years etc.

      Lastly it is important keep in mind whether you're dealing with cumulative totals over multiple seasons versus looking at individual season averages or per game numbers when attempting analysis on these sets!

    Research Ideas

    • Evaluating the performance of historical NBA rookies over time and how this can help inform future draft picks in the NBA.
    • Analysing the relative importance of certain performance stats, such as three-point percentage, to overall success and Hall of Fame induction from 1980-2016.
    • Comparing rookie seasons across different years to identify common trends in terms of statistical contributions and development over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: NBA Rookies by Year_Hall of Fame Class.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | Name | The name of...

  7. Additional file 1 of ChromoMap: an R package for interactive visualization...

    • springernature.figshare.com
    • datasetcatalog.nlm.nih.gov
    html
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lakshay Anand; Carlos M. Rodriguez Lopez (2023). Additional file 1 of ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes [Dataset]. http://doi.org/10.6084/m9.figshare.18230845.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Lakshay Anand; Carlos M. Rodriguez Lopez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Example of chromoMap interactive plot constructed using various features of chromoMap including polyploidy (used as multi-track), feature-associated data visualization (scatter and bar plots), chromosome heatmaps, data filters (color-coded scatter and bars). Differential gene expression in a cohort of patients positive for COVID19 and healthy individuals (NCBI Gene Expression Omnibus id: GSE162835) [12]. Each set of five tracks labeled with the same chromosome ID (e.g. 1-22, X & Y) contains the following information: From top to bottom: (1) number of differentially expressed genes (DEGs) (FDR < 0.05) (bars over the chromosome depictions) per genomic window (green boxes within the chromosome). Windows containing ≥ 5 DEGs are shown in yellow. (2) DEGs (FDR < 0.05) between healthy individuals and patients positive for COVID19 visualized as a scatterplot above the chromosome depiction (genes with logFC ≥ 2 or logFC ≤ −2 are highlighted in orange). Dots above the grey dashed line represent upregulated genes in COVID19 positive patients. Heatmap within chromosome depictions indicates the average LogFC value per window. (3–4) Normalized expression of differentially expressed genes (scatterplot) and of each genomic window containing DEG (green scale heatmap) in (3) patients with severe/critical outcomes and (4) asymptomatic/mild outcome patients. (5) logFC of DEGs between healthy individuals and patients positive for COVID19 visualized as scatter plot color-coded based on the metabolic pathway each DEG belongs to.

  8. US Regional Sales Data

    • kaggle.com
    Updated Aug 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Talha (2023). US Regional Sales Data [Dataset]. https://www.kaggle.com/datasets/talhabu/us-regional-sales-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abu Talha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides comprehensive insights into US regional sales data across different sales channels, including In-Store, Online, Distributor, and Wholesale. With a total of 17,992 rows and 15 columns, this dataset encompasses a wide range of information, from order and product details to sales performance metrics. It offers a comprehensive overview of sales transactions and customer interactions, enabling deep analysis of sales patterns, trends, and potential opportunities.

    Columns in the dataset: - OrderNumber: A unique identifier for each order. - Sales Channel: The channel through which the sale was made (In-Store, Online, Distributor, Wholesale). - WarehouseCode: Code representing the warehouse involved in the order. - ProcuredDate: Date when the products were procured. - OrderDate: Date when the order was placed. - ShipDate: Date when the order was shipped. - DeliveryDate: Date when the order was delivered. - SalesTeamID: Identifier for the sales team involved. - CustomerID: Identifier for the customer. - StoreID: Identifier for the store. - ProductID: Identifier for the product. - Order Quantity: Quantity of products ordered. - Discount Applied: Applied discount for the order. - Unit Cost: Cost of a single unit of the product. - Unit Price: Price at which the product was sold.

    This dataset serves as a valuable resource for analysing sales trends, identifying popular products, assessing the performance of different sales channels, and optimising pricing strategies for different regions.

    Visualization Ideas:

    • Time Series Analysis: Plot sales trends over time to identify seasonal patterns and changes in demand.
    • Sales Channel Comparison: Compare sales performance across different channels using bar charts or line graphs.
    • Product Analysis: Visualise the distribution of sales across different products using pie charts or bar plots.
    • Discount Analysis: Analyse the impact of discounts on sales using scatter plots or line graphs.
    • Regional Performance: Create maps to visualise sales performance across different regions.

    Data Modelling and Machine Learning Ideas (Price Prediction): - Linear Regression: Build a linear regression model to predict the unit price based on features such as order quantity, discount applied, and unit cost. - Random Forest Regression: Use a random forest regression model to predict the price, taking into account multiple features and their interactions. - Neural Networks: Train a neural network to predict unit price using deep learning techniques, which can capture complex relationships in the data. - Feature Importance Analysis: Identify the most influential features affecting price prediction using techniques like feature importance scores from tree-based models. - Time Series Forecasting: Develop a time series forecasting model to predict future prices based on historical sales data. - These visualisation and modelling ideas can help you gain valuable insights from the sales data and create predictive models to optimise pricing strategies and improve sales performance.

  9. e

    Annual Time Series of Air Temperature, Precipitation, and Urban Area Extent...

    • b2find.eudat.eu
    Updated Mar 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Annual Time Series of Air Temperature, Precipitation, and Urban Area Extent in Modena, Italy - Files - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/46e3fcf8-0259-5400-ad52-45a02ed2d903
    Explore at:
    Dataset updated
    Mar 20, 2024
    Area covered
    Italy, Modena
    Description

    An uninterrupted data set of 139 annual values of local mean air temperature T, cumulative precipitation depth P, urban area extent A, global mean surface air temperature G, and global CO2 concentration C for the 1881-2019 period of time is shared with the scientific community. The Matlab 2021a code na.m performing a nonlinear analysis of the data contained in the file ts.dat is also shared with the scientific community. The code loads file ts.dat and generates the PDF files of this dataset. File README.txt contains the description of this dataset and its files.The shared data can be found in the ASCII text file ts.dat (as well as in dataset doi:10.1594/PANGAEA.938739, which has been created from that file). The first column, having header year, contains the year. The second column, having header T (°C), contains the local mean air temperature T in Celsius degrees observed in Modena. The third column, having header P (mm), contains the cumulative precipitation depth P in millimeters in Modena. The fourth column, having header A (km2), contains the urban area extent A in square kilometers of Modena. The fifth column, having header G (°C), contains the global mean surface air temperature G in Celsius degrees obtained by adding the GISTEMP temperature change to the average temperature observed in Modena in the 1951–1980 base period (https://data.giss.nasa.gov/gistemp/). The sixth column, having heather C (ppm), contains the global CO2 concentration C in parts per million estimated from ice cores, from 1881 to 1958 (https://cdiac.ess-dive.lbl.gov/trends/co2/lawdome-data.html), and observed in the Mauna Loa Observatory (latitude 19.5362°N, longitude 155.5763°W, elevation 3397.00 m asl), Hawaii, from 1959 to 2019 (https://gml.noaa.gov/ccgg/trends/data.html).The Matlab 2021a code na.m performing a nonlinear analysis of the data contained in the file ts.dat is also shared with the scientific community. The Matlab 2021a code na.m loads the file ts.dat and generates the the following PDF files:- PDF file lg.pdf. Comparison between local temperature in Modena and global temperatures obtained from the NASA GISTEMP temperature change projected to Modena.- PDF file dm.pdf. Scatter plot matrix of T, P, A, G, and C.- PDF file vm.pdf. Scatter plot matrix for the first differences of T, P, A, G, and C.- PDF file pm.pdf. Generalized additive model predictions of T, P, A, G, and C, denoted as T', P', A', G', and C', obtained from single predictors T, P, A, G, and C.- PDF file gam.pdf. Generalized additive model predictions of T and G, denoted as T' and G', respectively, obtained from multiple predictors based on T, P, A, G, and C.The nonlinear analysis performed by using the data set contained in the ASCII text file ts.dat and the Matlab 2021a code na.m are described in Orlandini et al 2021 and available from the authors sharing the present data set upon request.

  10. c

    Shedding new light on the integrity of gold nanoparticle-fluorophore...

    • research-data.cardiff.ac.uk
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panagiota Giannakopoulou; Joseph Williams; Paul Moody; Edward Sayers; JP Magnusson; Iestyn Pope; Lukas Payne; C Alexander; Arwyn Jones; Wolfgang Langbein; Peter Watson; Paola Borri (2024). Shedding new light on the integrity of gold nanoparticle-fluorophore conjugates for cell biology with four-wave-mixing microscopy - dataset [Dataset]. http://doi.org/10.17035/d.2019.0081702601
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Cardiff University
    Authors
    Panagiota Giannakopoulou; Joseph Williams; Paul Moody; Edward Sayers; JP Magnusson; Iestyn Pope; Lukas Payne; C Alexander; Arwyn Jones; Wolfgang Langbein; Peter Watson; Paola Borri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a cross-disciplinary work at the physics/life science interface which addresses an important question in the use of gold nanoparticles (AuNPs) conjugated to fluorescent molecules for cell biology, namely whether the fluorophore is a faithful reporter of the nanoparticle location.AuNPs are among the most widely investigated systems in nano-medicine research for applications in intracellular imaging and sensing, drug delivery and photothermal therapy, owing to their small sizes, biocompatibility, ease of surface functionalisation and bio-conjugation.In this context, a particularly interesting system is that of a AuNP-fluorophore conjugate, whereby a fluorescently labelled biomolecule (e.g. a protein ligand, nucleotide, peptide, antibody) is attachedonto the AuNP surface, and its uptake and intracellular fate is followed in situ in real time by fluorescence microscopy. AuNPs are historically well known to biologists as markers for electron microscopy due to their high electron density; hence these conjugate are specifically useful probes for correlative light electron microscopy. However, an important question that has remained elusive to answer is whether the fluorescence readout is actually a reliable reporter of the AuNP location. This is because it is challenging with current optical techniques to directly visualise a single small AuNP against the endogenous scattering, absorption and phase contrast in a highly heterogeneous three-dimensional cellular environment.These data demonstrate the application of a novel optical microscopy technique developed in our lab (four-wave mixing (FWM) interferometry) to directly image single small AuNPs background-free inside cells with high 3D spatial resolution. The data show four different AuNP-fluorophore conjugates imaged inside two different cell types. By correlative fluorescence-FWM microscopy, the data show that, in most cases, fluorescence emission originated from unbound fluorophores rather than from fluorophores attachedto nanoparticles. Fluorescence detection was also severely limited by photobleaching, quenching and autofluorescence background.The datasets consist of images and numerical data. Images consist of two groups: experimental and calculated datasets.Experimental images are optical microscopy datasets obtained using: i) Differential Interference Contrast Microscopy (DIC), 2) FWM microscopy, 3) Confocal fluorescence microscopy, 4) extinction microscopy, 5) wide-field epifluorescence microscopy. Calculated datasets are images of the cross correlation coefficient as a function of relative translation coordinates, calculated from the experimental images. Numerical data consist of:1) One dimensional cut profiles along images. Data are provided as Origin plots where original datasets can be retrieved.2) Plots of representative values of extinction cross-sections. Data are provided as Origin plots where original datasets can be retrieved.3) Scatter plot from a two-channel/colour fluorescence image, showing the intensity of one colour channel in a given pixel as the x-coordinate and the fluorescence intensity of the second channel at the same pixel as the y-coordinate. Data are provided as Origin plots where original datasets can be retrieved.4) Scatter plot showing the fluorescence flux (in units of detected photoelectrons/s) versus extinction cross-section of nanoparticles. Data are provided as Origin plots where original datasets can be retrieved.Research results based upon these data are published at https://doi.org/10.1039/C9NR08512B

  11. d

    Representations of color and form in mouse visual cortex

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Issac Rhim; Ian Nauhaus (2023). Representations of color and form in mouse visual cortex [Dataset]. http://doi.org/10.5061/dryad.t1g1jwt3r
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    Issac Rhim; Ian Nauhaus
    Time period covered
    Jan 1, 2021
    Description

    Spatial transitions in color can aid any visual perception task, and its neural representation – an “integration of color and form†– is thought to begin at primary visual cortex (V1). Color and form integration is untested in mouse V1, yet studies show that the ventral retina provides the necessary substrate from green-sensitive rods and UV-sensitive cones. Here, we used two-photon imaging in V1 to measure spatial frequency (SF) tuning along four axes of rod and cone contrast space, including luminance and color. We first reveal that V1 has similar responsiveness to luminance and color, yet average SF tuning is significantly shifted lowpass for color. Next, guided by linear models, we used SF tuning along all four color axes to estimate the proportion of neurons that fall into classic models of color opponency – “single-†, “double-†, and “non-opponent†. Few neurons (~6%) fit the criteria for double-opponency, which are uniquely tuned for chromatic borders. Most of the population can be..., This data comes from two-photon imaging in mouse primary visual cortex. There is also Matlab code to run the simulations in figures 1, 6, 7, and 8., See uploaded README files for details. Below is the top of README_for_dataset.doc. This describes the uploaded data set used in Rhim and Nauhaus: “Joint representations of color and form in mouse visual cortex described by random pooling from rods and cones†. It is a MATLAB .mat file, where each structure pertains to a given figure. In addition to the source data for the figures, it also has the following additions:

    The same data set, but prior to culling the population according to the dashed box in the Figure 2 scatter plot. See variables appended with “…_allâ€

    Region-of-interest ID associated with each neuron.

    Below is all the information in README_for_simulations.doc. To run the simulations for Figures 1,6,7, and 8, execute the cells in the high-level scripts of the following: Figure_1.m, Figure_6_7.m, Figure_8.m. Make sure all the other .m files are in your path.

  12. e

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

    • b2find.eudat.eu
    Updated Aug 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3524622d-2099-554c-826a-f2155c3f4bb4
    Explore at:
    Dataset updated
    Aug 17, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.

  13. m

    Data from: Data-driven Multivariate Power Curve Modeling of Offshore Wind...

    • data.mendeley.com
    • narcis.nl
    Updated Jul 25, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Janssens (2016). Data-driven Multivariate Power Curve Modeling of Offshore Wind Turbines [Dataset]. http://doi.org/10.17632/gst3cdfnn5.1
    Explore at:
    Dataset updated
    Jul 25, 2016
    Authors
    Olivier Janssens
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    a) Description: A synthetic dataset consisting of 20.000 power and wind speed values. The goal of this dataset is to objectively quantify power curve modelling techniques for wind turbines.

    b) Size: 580.0 kB

    c) Platform: Any OS or programming language can read a txt file d) Environment: As this is a txt file, any modern OS will do. The txt file consists of comma seperated values so all modern programming languages can be used to read this file.

    e) Major Component Description: There are 20.001 rows in the txt file. The first row indicates the headers of the columns. The other 20.000 lines indicate the corresponding values of the column. There are two columns, the first is the power and the second the wind speed.

    f) Detailed Set-up Instructions: This depends on the platform and programming language. Since this is a txt file with tab seperated values, a broad range of options are possible and can be looked up.

    g) Detailed Run Instructions: / h) Output Description: When plotting the wind speed values vs the power values using a scatter plot (e.g. matlab or python matplotlib), a power curve can be seen.

  14. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  15. Data Set for: Step-by-Step Calculation and Spreadsheet Tools for Predicting...

    • catalog.data.gov
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Data Set for: Step-by-Step Calculation and Spreadsheet Tools for Predicting Stressor Levels that Extirpate Genera and Species [Dataset]. https://catalog.data.gov/dataset/data-set-for-step-by-step-calculation-and-spreadsheet-tools-for-predicting-stressor-levels
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data includes measured data from Ecoregions 69 and 70 in West Virginia. Paired biological and chemical grab samples are included. These data were used to estimate SC extirpation concentration (XC95) for benthic invertebrate genera. Also included are cumulative frequency distribution plots, scatter plots fitted with generalized additive models, and biogeographical maps of observations of each genus. The metadata and full data set is available in Supplemental Appendices S4 and S5, respectively. The output of 176 XC95 values from Ecoregions 69 and 70 are provided in Supplemental Appendix S6. Supplemental Appendix S7 depicts the probability of observing a genus for discrete ranges of SC. Supplemental Appendix S8 depicts the proportion of occurrence of a genus for discrete ranges of SC. Supplemental Appendix S9 shows the biogeographic distributions of the genera included in the data set. We also discuss limitations of this method to help avoid misinterpretations and inferential errors. A data dictionary is provided in Cond_DataFileColumnMetada-20161221. This dataset is associated with the following publication: Cormier, S., L. Zheng, E. Leppo, and A. Hamilton. Step-by-Step Calculation and Spreadsheet Tools for Predicting Stressor Levels that Extirpate Genera and Species. Integrated Environmental Assessment and Management. Allen Press, Inc., Lawrence, KS, USA, 14(2): 174-180, (2018).

  16. e

    Exploring the SDSS data set. I. EMP & CV stars

    • b2find.eudat.eu
    Updated Feb 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Exploring the SDSS data set. I. EMP & CV stars [Dataset]. https://b2find.eudat.eu/dataset/2a30d2b3-32f1-59f8-b3ab-ad01de5a7de9
    Explore at:
    Dataset updated
    Feb 10, 2017
    Description

    We present the results of a search for extremely metal-poor (EMP), carbon-enhanced metal-poor (CEMP), and cataclysmic variable (CV) stars using a new exploration tool based on linked scatter plots (LSPs). Our approach is especially designed to work with very large spectrum data sets such as the SDSS, LAMOST, RAVE, and Gaia data sets, and it can be applied to stellar, galaxy, and quasar spectra. As a demonstration, we conduct our search using the SDSS DR10 data set. We first created a 3326-dimensional phase space containing nearly 2 billion measures of the strengths of over 1600 spectral features in 569738 SDSS stars. These measures capture essentially all the stellar atomic and molecular species visible at the resolution of SDSS spectra. We show how LSPs can be used to quickly isolate and examine interesting portions of this phase space. To illustrate, we use LSPs coupled with cuts in selected portions of phase space to extract EMP stars, CEMP stars, and CV stars. We present identifications for 59 previously unrecognized candidate EMP stars and 11 previously unrecognized candidate CEMP stars. We also call attention to 2 candidate He II emission CV stars found by the LSP approach that have not yet been discussed in the literature.

  17. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  18. c

    Data from: Quantitative imaging of lipids in live mouse oocytes and early...

    • research-data.cardiff.ac.uk
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J Bradley; Iestyn Pope; Francesco Masia; Wolfgang Langbein; Karl Swann; Paola Borri (2024). Quantitative imaging of lipids in live mouse oocytes and early embryos using CARS microscopy [Dataset]. http://doi.org/10.17035/d.2016.0008223993
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Cardiff University
    Authors
    J Bradley; Iestyn Pope; Francesco Masia; Wolfgang Langbein; Karl Swann; Paola Borri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mammalian oocytes contain lipid droplets (LDs) that are a store of fatty acids, whose metabolism plays a significant role in pre-implantation development. Fluorescent staining has previously been used to image lipid droplets in mammalian oocytes and embryos, but this method is not quantitative and often incompatible with live cell imaging and subsequent development.These data show the application of chemically specific, label-free coherent anti-Stokes Raman scattering (CARS) microscopy to mouse oocytes and pre-implantation embryos. The data show that CARS imaging can quantify the size, number and spatial distribution of lipid droplets in living mouse oocytes and embryos up to the blastocyst stage. Notably, it can be used in a way that does not compromise oocyte maturation or embryo development.The data also correlate CARS with two-photon fluorescence microscopy simultaneously acquired using fluorescent lipid probes on fixed samples, and demonstrate only a partial degree of correlation, depending on the lipid probe, clearly exemplifying the limitation of lipid labelling.In addition, the data show that differences in the chemical composition of lipid droplets in living oocytes matured in media supplemented with different amounts of saturated and unsaturated fatty acids can be detected using CARS hyperspectral imaging. These data demonstrate that CARS microscopy provides a novel non-invasive method of quantifying lipid content, type and spatial distribution with sub-micron resolution in living mammalian oocytes and embryos.The data sets consists of optical microscopy images and numerical data. Microscope images show oocytes and early embryos (as cross sections in two dimensions or as maximum intensity projections), obtained using Differential Interference Contrast microscopy (DIC), CARS microscopy, and fluorescence microscopy. Lipid droplets of oocytes and early embryos are specifically visualised in the CARS microscopy images.Numerical data consist of the following groups:1) histogram of the occurrence of the aggregate size (number of lipid droplets per aggregate) in a representative egg. The data set is an ascii file with X and Y columns. X is the aggregate size and Y the occurrence.2) Scatter plot of the square root of the sum of the squared aggregate size against the total number of lipid droplets, in ensembles of eggs and embryos. The data set is an ascii file with X and Y columns. X is square root of the sum of the squared aggregate size and Y is the total number of lipid droplets.3) Vibrational Raman-like spectra obtained from CARS hyperspectral images of lipid droplets in representative eggs and embryos. The data set is an ascii file with X and Y columns. X is the wavenumber and Y is CARS susceptibility (imaginary part).4) histogram of the occurrence of the LD effective diameter in a representative egg. The data set is an ascii file with X and Y columns. X is the LD diameter and Y the occurrence.5) Scatter plot of the diameter of LDs against the total number of LDs, in ensembles of eggs and embryos. The data set is an ascii file with X and Y columns. X is diameter of LDs and Y is the total number of lipid droplets.Results derived from these data are published at http://dx.doi.org/10.1242/dev.129908

  19. f

    Three-dimensional scatter plots showing the probability of elimination at...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Aug 31, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younes, Mohamed; Barrey, Eric; Cottin, François; Robert, Céline (2015). Three-dimensional scatter plots showing the probability of elimination at vet gates 2, 3, 4 and 5, according to the corresponding logistic regressions with a fixed HR of 64 and the AS and CRT measured at the previous vet gate (n-1). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001868318
    Explore at:
    Dataset updated
    Aug 31, 2015
    Authors
    Younes, Mohamed; Barrey, Eric; Cottin, François; Robert, Céline
    Description

    Red corresponds to a probability of elimination of 60 to 80%, whereas brown (the darkest areas) corresponds to a probability of 80 to 100%. The white line corresponds to a probability of elimination of 70% (the threshold chosen to compute the probability of elimination in an independent data set used for validation).

  20. UK National Databank of Moored Current Meter Data (1967-)

    • bodc.ac.uk
    • data-search.nerc.ac.uk
    nc
    Updated Jan 30, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Oceanographic Data Centre (2017). UK National Databank of Moored Current Meter Data (1967-) [Dataset]. https://www.bodc.ac.uk/resources/inventories/edmed/report/157/
    Explore at:
    ncAvailable download formats
    Dataset updated
    Jan 30, 2017
    Dataset authored and provided by
    British Oceanographic Data Centrehttp://www.bodc.ac.uk/
    License

    https://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/

    Time period covered
    1967 - Present
    Area covered
    Norwegian Sea, Inner Seas off the West Coast of Scotland, North Sea, Indian Ocean, English Channel, Mediterranean Sea, South Atlantic Ocean, Irish Sea, North Atlantic Ocean,
    Description

    The data set comprises more than 7000 time series of ocean currents from moored instruments. The records contain horizontal current speed and direction and often concurrent temperature data. They may also contain vertical velocities, pressure and conductivity data. The majority of data originate from the continental shelf seas around the British Isles (for example, the North Sea, Irish Sea, Celtic Sea) and the North Atlantic. Measurements are also available for the South Atlantic, Indian, Arctic and Southern Oceans and the Mediterranean Sea. Data collection commenced in 1967 and is currently ongoing. Sampling intervals normally vary between 5 and 60 minutes. Current meter deployments are typically 2-8 weeks duration in shelf areas but up to 6-12 months in the open ocean. About 25 per cent of the data come from water depths of greater than 200m. The data are processed and stored by the British Oceanographic Data Centre (BODC) and a computerised inventory is available online. Data are quality controlled prior to loading to the databank. Data cycles are visually inspected by means of a sophisticated screening software package. Data from current meters on the same mooring or adjacent moorings can be overplotted and the data can also be displayed as time series or scatter plots. Series header information accompanying the data is checked and documentation compiled detailing data collection and processing methods.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:
pptxAvailable download formats
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Search
Clear search
Close search
Google apps
Main menu