46 datasets found
  1. m

    Ultimate_Analysis

    • data.mendeley.com
    Updated Jan 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2
    Explore at:
    Dataset updated
    Jan 28, 2022
    Authors
    Akara Kijkarncharoensin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

    The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

    An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

    A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

    The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

    Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.

  2. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  3. Data from: ISOCountryCodes

    • kaggle.com
    Updated Mar 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VinayBhargav (2020). ISOCountryCodes [Dataset]. https://www.kaggle.com/datasets/vsesham/isocountrycodes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    VinayBhargav
    Description

    Context

    This dataset contains standard country names, ISO alpha2 codes, ISO alpha3 codes and UN designated codes for countries.

    Content

    This dataset is prepared by scraping information from https://www.nationsonline.org/oneworld/country_code_list.htm page.

    Acknowledgements

    Acknowledgements to https://www.nationsonline.org/oneworld/country_code_list.htm

    Inspiration

    If standard country names or ISO Alpha codes are required for plotting geographical maps, you could use this dataset. For example using plotly's scatter_geo (https://plot.ly/python/scatter-plots-on-maps/)

  4. f

    M_DATA2

    • figshare.com
    csv
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Wang (2025). M_DATA2 [Dataset]. http://doi.org/10.6084/m9.figshare.28956344.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 8, 2025
    Dataset provided by
    figshare
    Authors
    Wei Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    this paper generated a challenging two-dimensional imbalanced dataset M_DATA2 based on the normal distribution. Its scatter plot is shown in Figure 1. The dataset has a sample size of 1000, 2 feature variables, and the ratio between the number of samples in the two classes is 1:4. Among them, the samples of the minority class are divided into four parts by the majorityclasssamples, and there is a certain mixed area, which makes it difficult for the GNB classifier to classify effectivel

  5. f

    Data_Sheet_5_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_5_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s005
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  6. Submarine Cable Features Dataset

    • kaggle.com
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Submarine Cable Features Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/submarine-cable-features-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Submarine Cable Features Dataset

    Submarine Cable Features: Scale, Description, and Effective Dates

    By Homeland Infrastructure Foundation [source]

    About this dataset

    The Submarine Cables dataset provides a comprehensive collection of features related to submarine cables. It includes information such as the scale band, description, and effective dates of these cables. These data are specifically designed to support coastal planning at both regional and national scales.

    The dataset is derived from 2010 NOAA Electronic Navigational Charts (ENCs), along with 2009 NOAA Raster Navigational Charts (RNCs) which were updated in 2013 using the most recent RNCs as a reference point. The source material's scale varied significantly, resulting in discontinuities between multiple sources that were resolved with minimal spatial adjustments.

    Polyline features representing submarine cables were extracted from the original sources while excluding 'cable areas' noted within the data. The S-57 data model was modified for improved readability and performance purposes.

    Overall, this dataset provides valuable information regarding the occurrence and characteristics of submarine cables in and around U.S. navigable waters. It serves as an essential resource for coastal planning efforts at various geographic scales

    How to use the dataset

    Here's a guide on how to effectively utilize this dataset:

    1. Familiarize Yourself with the Columns

    The dataset contains multiple columns that provide important information:

    • scaleBand: This categorical column indicates the scale band of each submarine cable.
    • description: The text column provides a description of each submarine cable.
    • effectiveDate: Indicates the effective date of the information about each submarine cable.

    Understanding these columns will help you navigate and interpret the data effectively.

    2. Explore Scale Bands

    Start by analyzing the distribution of different scale bands in the dataset. The scale band categorizes submarines cables based on their size or capacity. Identifying patterns or trends within specific scale bands can provide valuable insights into how submarine cables are deployed.

    For example, you could analyze which scale bands are most commonly used in certain regions or countries, helping coastal planners understand infrastructure needs and potential connectivity gaps.

    3. Analyze Cable Descriptions

    The description column provides detailed information about each submarine cable's characteristics, purpose, or intended use. By examining these descriptions, you can uncover specific attributes related to each cable.

    This information can be crucial when evaluating potential impacts on marine ecosystems, identifying areas prone to damage or interference with other maritime activities, or understanding connectivity options for coastal regions.

    4. Consider Effective Dates

    While excluding dates from this analysis as per your request (as we exclude them here), effective dates play an important role in keeping track of when information about a particular cable was collected or updated.

    By considering effective dates over time: - You can monitor changes in infrastructure deployment strategies. - Identify areas where new cables have been installed. - Track outdated infrastructure that may need replacements or upgrades.

    5. Combine with Other Datasets

    To gain a comprehensive understanding and unlock deeper insights, consider integrating this dataset with other relevant datasets. For example: - Population density data can help identify areas in high need of improved connectivity. - Coastal environmental data can help assess potential ecological impacts of submarine cables.

    By merging datasets, you can explore relationships, draw correlations, and make more informed decisions based on the available information.

    6. Visualize the Data

    Create meaningful visualizations to better understand and communicate insights from the dataset. Utilize scatter plots, bar charts, heatmaps, or GIS maps

    Research Ideas

    • Coastal Planning: The dataset can be used for coastal planning at both regional and national scales. By analyzing the submarine cable features, planners can assess the impact of these cables on coastal infrastructure development and design plans accordingly.
    • Communication Network Analysis: The dataset can be utilized to analyze the connectivity and coverage of submarine cable networks. This information is valuable for telecommunications companies and network providers to understand gaps in communication infras...
  7. visualizing_environmental_data

    • kaggle.com
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghav Khandelwal (2024). visualizing_environmental_data [Dataset]. https://www.kaggle.com/datasets/raghavkhandelwal65/visualizing-environmental-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raghav Khandelwal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Title: Beijing PM2.5 Data (2010-2014)

    Description:

    This dataset contains air quality measurements from Beijing, China, spanning from January 1, 2010, to December 31, 2014. The dataset includes daily PM2.5 levels along with various meteorological data recorded at the US Embassy in Beijing.

    Dataset Columns:

    No: Row number year: Year of the data record month: Month of the data record day: Day of the data record hour: Hour of the data record season: Season of the data record (1 = spring, 2 = summer, 3 = fall, 4 = winter) PM2.5: PM2.5 concentration (µg/m³) DEWP: Dew Point (°C) TEMP: Temperature (°C) PRES: Pressure (hPa) cbwd: Combined wind direction Iws: Cumulated wind speed (m/s) Is: Cumulated hours of snow Ir: Cumulated hours of rain

    Usage: This dataset is useful for analyzing trends in air quality over time, understanding the impact of meteorological conditions on air quality, and visualizing spatial and temporal variations in PM2.5 levels.

    Source: The data was obtained from the UCI Machine Learning Repository.

    Key Insights:

    Trends in daily PM2.5 levels over the years. Seasonal variations in air quality. Correlation between PM2.5 levels and meteorological conditions such as temperature and wind speed. Example Analysis:

    Line plots to show daily average PM2.5 levels over a year. Heatmaps to visualize PM2.5 levels across different locations and months. Scatter plots to show the correlation between PM2.5 levels and weather conditions (e.g., temperature, humidity). Libraries Used:

    pandas matplotlib seaborn

  8. e

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

    • b2find.eudat.eu
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Jul 31, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.

  9. d

    OsterlundJBC_Figure 6 & SFig6

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osterlund, Elizabeth (2023). OsterlundJBC_Figure 6 & SFig6 [Dataset]. http://doi.org/10.5683/SP3/CXCMAG
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Osterlund, Elizabeth
    Description

    Figure 6 data Includes: -Panel A, Data from multiple biological replicates in BMK-DKO cells expressing either mCerulean3-BCL-XL, mCerulean3-BCL-XL-ActA or mCerulean3-BCL-XL-Cb5. '.pzfx' file can be opened in Graphpad prism. Alternatively, see "Fig6A_GetAverageForHeatmaps" for raw data used to generate Heatmaps. The same data was displayed in a Scatter plot (included as well), but was not included in the paper. SFigure 6 data -Images, and example data from a single replicate that were used to generate FigureS6. This figure was made to demonstrate the 3-channel colocalization method. Median, or mode colocalization values shown in Figures 6,7, and SFig8A were determined as shown in SFig6.

  10. o

    Transcription profiling of human lymphoblastoid cell response to ultaviolet...

    • omicsdi.org
    xml
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wan-Jen Hong, Transcription profiling of human lymphoblastoid cell response to ultaviolet and ionising radiation [Dataset]. https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-1977
    Explore at:
    xmlAvailable download formats
    Authors
    Wan-Jen Hong
    Variables measured
    Transcriptomics,Multiomics
    Description

    Peripheral blood lymphocytes from a total of 15 healthy individuals without history of cancer (NoCa) were immortalized with Epstein-Barr virus. Cells were exposed to mock treatment (Mock), ultra-violet radiation (UV), or ionizing radiation (IR). For UV radiation treatment, cells were exposed to 10 J/m^2 and harvested for RNA 24 hours later. For IR treatment, cells were exposed to 5 Gy of IR and harvested for RNA 4 hours later. For example, NoCa1-Mock refers to cells from healthy patient 1 exposed to mock treatment. The published manuscript (NAR 32:4786, 2004) can be found at http://nar.oupjournals.org/cgi/content/abstract/32/16/4786. Data were analyzed with Affymetrix MAS version 4.0. Normalization -- A reference data set was generated by averaging the expression of each gene over all data sets. The data for each hybridization were compared with the reference data set in a cube root scatter plot. A linear least-squares fit to the cube root scatter plot was then used to normalize each hybridization.

  11. n

    Hydrographic data in the Faroe-Shetland Channel collected during the Slope...

    • data-search.nerc.ac.uk
    • bodc.ac.uk
    Updated Sep 15, 2005
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2005). Hydrographic data in the Faroe-Shetland Channel collected during the Slope Mixing Experiment (September 2005). [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=Moored%20instrument%20depth
    Explore at:
    Dataset updated
    Sep 15, 2005
    Description

    This dataset comprises hydrographic data from conductivity and temperature sensors deployed at fixed intervals on moorings within the water column or close to the sea bed on benthic frames. The measurements were collected at five sites within the Faroe – Shetland channel during the FS Poseidon cruise PO328 between 07 and 23 September 2005. The data have been processed, quality controlled and made available by the British Oceanographic Data Centre (BODC). The data were collected as part of the Slope Mixing Experiment, a Proudman Oceanographic Laboratory (POL) core Natural Environment Research Council (NERC) funded project, which aimed to estimate slope mixing and its effects on waters in the overturning circulation. Detailed in situ measurements of mixing in the water column) were to be combined with fine resolution 3-D and process models. The experiment was lead by POL, in collaboration with the School of Ocean Sciences, University of Wales, Bangor; the Scottish Association for Marine Science (SAMS); the University of Highlands and Islands and the Institute of Marine Studies (IMS) at the University of Plymouth. The Slope Mixing Experiment dataset also includes conductivity-temperature-depth (CTD) profiles, moored Acoustic Doppler Current Profilers (ADCP), vessel mounted ADCP sensors as well as 3-D and process models. These data are not available from BODC.

  12. d

    Replication Data for: A Rydberg atom based system for benchmarking mmWave...

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borówka, Sebastian; Krokosz, Wiktor; Mazelanik, Mateusz; Wasilewski, Wojciech; Parniak, Michał (2024). Replication Data for: A Rydberg atom based system for benchmarking mmWave automotive radar chips [Dataset]. http://doi.org/10.7910/DVN/OYUNJ1
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Borówka, Sebastian; Krokosz, Wiktor; Mazelanik, Mateusz; Wasilewski, Wojciech; Parniak, Michał
    Description

    Simulation Data The waveplate.hdf5 file stores the results of the FDTD simulation that are visualized in Fig. 3 b)-d). The simulation was performed using the Tidy 3D Python library and also utilizes its methods for data visualization. The following snippet can be used to visualize the data: import tidy3d as td import matplotlib.pyplot as plt sim_data: td.SimulationData = td.SimulationData.from_file(f"waveplate.hdf5") fig, axs = plt.subplots(1, 2, tight_layout=True, figsize=(12, 5)) for fn, ax in zip(("Ex", "Ey"), axs): sim_data.plot_field("field_xz", field_name=fn, val="abs^2", ax=ax).set_aspect(1 / 10) ax.set_xlabel("x [$\mu$m]") ax.set_ylabel("z [$\mu$m]") fig.show() Measurement Data Signal data used for plotting Fig. 4-6. The data is stored in NetCDF providing self describing data format that is easy to manipulate using the Xarray Python library, specifically by calling xarray.open_dataset() Three datasets are provided and structured as follows: The electric_fields.nc dataset contains data displayed in Fig. 4. It has 3 data variables, corresponding to the signals themselves, as well as estimated Rabi frequencies and electric fields. The freq dimension is the x-axis and contains coordinates for the Probe field detuning in MHz. The n dimension labels different configurations of applied electric field, with the 0th one having no EHF field. The detune.nc dataset contains data displayed in Fig. 6. It has 2 data variables, corresponding to the signals themselves, as well as estimated peak separations, multiplied by the coupling factor. The freq dimension is the same, while the detune dimension labels different EHF field detunings, from -100 to 100 MHz with a step of 10. The waveplates.nc dataset contains data displayed in Fig. 5. It contains estimated Rabi frequencies calculated for different waveplate positions. The angles are stored in radians. There is the quarter- and half-waveplate to choose from. Usage examples Opening the dataset import matplotlib.pyplot as plt import xarray as xr electric_fields_ds = xr.open_dataset("data/electric_fields.nc") detuned_ds = xr.open_dataset("data/detune.nc") waveplates_ds = xr.open_dataset("data/waveplates.nc") sigmas_da = xr.open_dataarray("data/sigmas.nc") peak_heights_da = xr.open_dataarray("data/peak_heights.nc") Plotting the Fig. 4 signals and printing params fig, ax = plt.subplots() electric_fields_ds["signals"].plot.line(x="freq", hue="n", ax=ax) print(f"Rabi frequencies [Hz]: {electric_fields_ds['rabi_freqs'].values}") print(f"Electric fields [V/m]: {electric_fields_ds['electric_fields'].values}") fig.show() Plotting the Fig. 5 data (waveplates_ds["rabi_freqs"] ** 2).plot.scatter(x="angle", col="waveplate") Plotting the Fig. 6 signals for chosen detunes fig, ax = plt.subplots() detuned_ds["signals"].sel( detune=[ -100, -70, -40, 40, 70, 100, ] ).plot.line(x="freq", hue="detune", ax=ax) fig.show() Plotting the Fig. 6 inset plot fig, ax = plt.subplots() detuned_ds["separations"].plot.scatter(x="detune", ax=ax) ax.plot( detuned_ds.detune, np.sqrt(detuned_ds.detune**2 + detuned_ds["separations"].sel(detune=0) ** 2), ) fig.show() Plotting the Fig. 7 calculated peak widths sigmas_da.plot.scatter() Plotting the Fig. 8 calculated detuned smaller peak heights peak_heights_da.plot.scatter()

  13. a

    Chart Viewer

    • city-of-lawrenceville-arcgis-hub-lville.hub.arcgis.com
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    esri_en (2021). Chart Viewer [Dataset]. https://city-of-lawrenceville-arcgis-hub-lville.hub.arcgis.com/items/be4582b38d764de0a970b986c824acde
    Explore at:
    Dataset updated
    Sep 22, 2021
    Dataset authored and provided by
    esri_en
    Description

    Use the Chart Viewer template to display bar charts, line charts, pie charts, histograms, and scatterplots to complement a map. Include multiple charts to view with a map or side by side with other charts for comparison. Up to three charts can be viewed side by side or stacked, but you can access and view all the charts that are authored in the map. Examples: Present a bar chart representing average property value by county for a given area. Compare charts based on multiple population statistics in your dataset. Display an interactive scatterplot based on two values in your dataset along with an essential set of map exploration tools. Data requirements The Chart Viewer template requires a map with at least one chart configured. Key app capabilities Multiple layout options - Choose Stack to display charts stacked with the map, or choose Side by side to display charts side by side with the map. Manage chart - Reorder, rename, or turn charts on and off in the app. Multiselect chart - Compare two charts in the panel at the same time. Bookmarks - Allow users to zoom and pan to a collection of preset extents that are saved in the map. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.

  14. Public files of religious and spiritual texts

    • kaggle.com
    Updated Apr 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Pfeiffer (2018). Public files of religious and spiritual texts [Dataset]. https://www.kaggle.com/metron/public-files-of-religious-and-spiritual-texts/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2018
    Dataset provided by
    Kaggle
    Authors
    Dan Pfeiffer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Metron is interested in taking a data science approach to gleaming deeper insights into matters of spirituality, religion and extranormal experience. Data scientists at all levels of experience are encouraged to participate in this analysis.

    This data set contains several public files of religious and spiritual texts. Also included is a “wildcard” file on the subject of machine super intelligence. This file is licensed under a Creative Commons Attribution-Share Alike 2.5 Switzerland License. More information can be found at: https://creativecommons.org/licenses/by-sa/2.5/ch/deed.en

    Metron is interested in various text analysis techniques that can further an understanding of concepts that assist in furthering development of a body of knowledge on the topic of comparative religion and spirituality. Also other interesting observations that will fuel further lines of inquiry and questions are highly desirable.

    Suggested Analysis Types: Scatter plot analysis - Metron would like to see a scatter plot of frequently recurring words in the texts If possible, can this be taken to the level of deriving conceptual correlations? Example, we may be able to state that “unconditional love” was the primary concept conveyed from the dataset. Followed by “communion” etc.

    Word cloud analysis - create a scatter plot that uses a horizontal position indicating most popular words and a vertical position that indicates (some other type to be defined of) popularity.

    Topic modeling - identify common topics found in the set of documents. Drill down to most common words per topic.

    Word & document frequencies (tf-idf) - word frequency measurement and comparison across all documents. Charts of the highest tf-idf words in each text within the corpus.

    Word dendrogram cluster - graphical representation of hierarchical word clusters.

    Pairwise correlation - Words most correlated with other words in chart format.

    Sentiment analysis - the most common words in texts associated with sentiments. Example: sentiment = creation. Associated words: God, genesis, Ein Sof, Allah, primordial, etc.

    Word Network using tf-idf as a metric to find characteristic words for each description field rather than using counts of words.

    Note: the above techniques can be found in “Data Science from Scratch - First Principles with Python” and “Text Mining with R - A Tidy Approach” both books from O’Reilly publishing.

    The files are:

    St. Augustine City of God 108 Upanishads 7 Tablets of Creation vols 1 and 2 Advaita Vendanta Aryan sun myths Autobiography of a yogi The Book of Illumination”

    Attributed to Rabbi Nehunia ben haKana Bhagavad Gita Bible KJV Collected Fruits of Occult Teaching by A.P. Sinnett Epistle to the Son of the Wolf

    Hidden Nature-The Startling Insights of Viktor Schauberger HILDEGARD OF BINGEN: SELECTED WRITINGS History of Zoroastrianism by

    Maneckji Nusservanji The Philosophy of the Kaivalya Upanishad Kitab-i-Iqan (Book of Certitude) Knowledge of the Higher Worlds Rudolf Steiner Kularnava Tantra The Life of Buddha Machine Super Intelligence The Planet Mars and its Inhabitants By Eros Urides (A Martian) THE NATURE OF THE GODS. M. Tullius Cicero (“nature-gods”) OCCULT THEOCRASY BY LADY QUEENBOROUGH Urantia Book The Book of the People: POPUL VUH THE DHAMMAPADA VEDIC HYMNS Vedic Hymns, Part II Secret Instructions of the Society of Jesus THE CHALDEAN ACCOUNT OF GENESIS THE KITAB-I-AQDAS THE PATH OF LIGHT The Buddha's Way of Virtue The Yoga Sutras of Patanjali: The Book of the Spiritual Man by Patañjali The Vedanta-Sutras with the Commentary by Ramanuja The Kybalion Buddhism, in Its Connexion with Brahmanism and Hinduism, and in Its Contrast

  15. r

    Data from: CG4928 knockdown affects the expression of membrane-bound,...

    • researchdata.se
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikaela M. Ceder (2024). CG4928 knockdown affects the expression of membrane-bound, biosynthesis and signaling genes : RNA sequencing data from CG4928 knockdown flies and controls [Dataset]. http://doi.org/10.57804/8r49-vw64
    Explore at:
    (5161339)Available download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Uppsala University
    Authors
    Mikaela M. Ceder
    Description

    Malpighian tubules from 3-day-old females (da-Gal4 > CG4928 RNAi, da-Gal4 > w1118, and w1118 > CG4928 RNAi) were dissected, pooled, and frozen in −80°C until analysis. A total of three replicates per genotype were collected, each replicate containing 60 Malpighian tubules. Analysis was performed using inductive coupled plasma sector field mass spectrometry (ICP-SFMS) to analyze a total of 69 elements (ions) in each sample. A control sample was used to measure background in the collecting media. The data were within group, normalized using geometric mean before means (±SD) were calculated and outliers were removed. GraphPad Prism, version 5, was used to perform one-way ANOVA using unpaired t-test with Bonferroni’s multiple correction (adjusted p-values: ∗p < 0.0492, ∗∗p < 0.0099, ∗∗∗p < 0.0001). The observed data points for boron, cadmium, cobalt, and potassium, which were the only elements found to be altered, are plotted using scatter plot, where the line displays the mean value.

    The dataset was originally published in DiVA and moved to SND in 2024.

  16. Data from: Testing Scratch Programs Automatically

    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2020). Testing Scratch Programs Automatically [Dataset]. http://doi.org/10.5281/zenodo.2567778
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    Description

    # Replication Package

    This is the replication package for our work on
    "Testing Scratch Programs Automatically".

    The package contains our raw results and scripts for generating
    the plots of the paper from the raw data.

    ## Abstract

    Block-based programming environments like Scratch foster engagement
    with computer programming and are used by millions of young learners.
    Scratch allows learners to quickly create entertaining programs and
    games, while eliminating syntactical program errors that could
    interfere with progress.

    However, functional programming errors may still lead to incorrect
    programs, and learners and their teachers need to identify and
    understand these errors. This is currently an entirely manual process.

    In this paper, we introduce a formal testing framework that describes
    the problem of Scratch testing in detail. We instantiate this formal
    framework with the Whisker tool, which provides automated and
    property-based testing functionality for Scratch programs.

    Empirical evaluation on real student and teacher programs
    demonstrates that Whisker can successfully test Scratch programs,
    and automatically achieves an average of 95.25% code coverage.

    Although well-known testing problems such as test flakiness also
    exist in the scenario of Scratch testing, we show that automated and
    property-based testing can accurately reproduce and replace the
    manually and laboriously produced grading efforts of a teacher, and
    opens up new possibilities to support learners of programming in
    their struggles.

    ## Contents

    The replication package is structured into two main directories:

    * 'data/':
    raw data and scripts that have been used for collecting the data

    * 'scripts/':
    scripts for generating the plots that are presented in the paper

    ### RAW data

    * 'data/teacher-data/'
    data from the scratch workshop: sample solution and scores for student solutions

    * 'data/code-club-stats/'
    block counts and input methods of the used Code Club projects

    * 'data/coverage`
    code for measuring the coverage of automated input generation

    * 'data/coverage-results/'
    coverage measurements on the Code Club projects

    * 'data/test/'
    test suites for the projects of the Scratch workshop

    * 'data/test-results/'
    test results from the test suites in 'data/test/'

    * 'data/time/'
    Scratch programs for time measurement (10x the sample solution from 'data/teacher-data/`)

    * 'data/time-results/'
    time measurements on the projects in 'data/time/'

    ## Reproducing the Plots

    ### Prerequisites

    We describe the process based on:

    * the R statistics package in version 3.5
    * an Unix environment (Linux or MacOSX)

    Following R packages are required:

    * ggplot2
    * dplyr
    * viridis

    The package can be installed with the R command "install.packages".

    ### Generating the Plots

    Coverage (Figure 10)

    ./scripts/coverage.R

    The result is a set of "coverage-*.pdf" files

    Inconsistency (Figure 9)

    ./scripts/consistency.R

    The result is a set of "consistency-*.pdf" files

    Scatter Plots (Figure 8, Figure 11)

    ./scripts/scatter.R

    The result is a set of "scatter-*.pdf" files

  17. Occurrence dataset for the subspecies of the American badger (Taxidea taxus...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Palacio-Núñez; J. Palacio-Núñez; J. M. Martínez-Calderas; J. M. Martínez-Calderas; D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. F. Martínez-Montoya; J. F. Martínez-Montoya; F. Clemente-Sánchez; F. Clemente-Sánchez; G. Olmos-Oropeza; G. Olmos-Oropeza (2024). Occurrence dataset for the subspecies of the American badger (Taxidea taxus berlandieri) in the north-central region of Mexico [Dataset]. http://doi.org/10.5281/zenodo.7901045
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    J. Palacio-Núñez; J. Palacio-Núñez; J. M. Martínez-Calderas; J. M. Martínez-Calderas; D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. F. Martínez-Montoya; J. F. Martínez-Montoya; F. Clemente-Sánchez; F. Clemente-Sánchez; G. Olmos-Oropeza; G. Olmos-Oropeza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Mexico, United States
    Description

    The subspecies of American badger (Taxidea taxus berlandieri Baird, 1858), also called tlalcoyote (Figure 1), is distributed in north-central Mexico. However, its occurrence records are scarce and the few that exist are uncertain due to incorrect georeferencing or identification of the taxonomic unit. In view of this, we disgned a spatial sampling in part of the states of Coahuila de Zaragoza, Durango, Nuevo León, San Luis Potosí and Zacatecas. In this north-central protion of Mexico, we generated a grid of squares measuring 5 × 5 km over the entire study area using QGIS® 3.10 software. Subsequently, we excluded squares that included urban settlements, agricultural land, or water bodies in more than 30% of their extension; we also descarted squares located at an altitude over 2,250 meters above sea level. To perform this filtering, we used both the land use and vegetation chart of the INEGI [Instituto Nacional de Estadística, Geografía e Informática] (2018) and the Digital Elevation Model (DEM) downloaded from the USGS page [United States Geological Survey] (2019) as a basis. As result, we obtained 3,471 squares separated by at least 5 km. Then, through simple random sampling, 177 (≈5%) squares were selected, where we generated centroids to be used as sampling sites.

    In field work, between 2009 and 2015, at these 177 sites we traced a 10 × 100 m transect, where we searched for T. t. berlandieri signs (i.e., burrows and scratching posts). In this case, their burrows and scratching posts are easily observed and quantified, and there is no chance of mistaking them for burrows of other species (Long 1973; Merlin 1999). Also, we recorded possible sightings, as other studies (e.g., Merlin 1999; Elbroch 2003). As result, we only found 33 with signs of occurrence.

    Figure 1. Individual of tlalcoyote (Taxidea taxus Berlandieri). Photo obtained from Naturalista (2023) and uploaded by David Molina©. All rights reserved (CC BY-NC-ND).

    To increase the number of records, we included occurrence data from GBIF [Global Biodiversity Information Facility portal] (2022). We downloaded only the records that included coordinates and that their basis of registration was "preserved specimen". This, because they are correctly identified as specimens from biological collections (Maldonado et al. 2015). In addition, we only selected records for Mexico. Subsequently, we filtered the downloaded database, discarding records that were incorrectly georeferenced, with atypical and duplicate coordinates, as well as with low geospatial accuracy (e.g., less than three decimals of precision).

    We loaded the remaining data into the QGIS® software and performed a spatial filtering, where we excluded data that were outside the study area, located in unlikely areas (e.g., human settlements, bodies of water, agricultural areas) and with a distance of less than 5 km from the records obtained in the field. This gave a total of 10 records from the GBIF portal. Finally, we loaded the raster layers of elevation (Elev; INEGI 2007), normalized difference vegetation index (NDVI, USGS 2019) and the slope of the terrain into the software to extract the pixel values based on the GBIF records and those obtained in the field. With this, we generated a new global dataset to which we performed environmental filtering to find environmental outliers. We plotted the normality distribution of the data for each variable and the dispersion of the data among the variables. In this filtering, we conserve all records. Figure 2 shows the normality distribution of the records as a function of Elev. Figure 3 shows the dispersion of the data between Elev and NDVI.

    Figure 2. Normality distribution of T. t. berlandieri occurrence records as a function of the elevation variable (Elev).

    Figure 3. Scatter plot of T. t. berlandieri occurrence records as a function of elevation (Elev) and normalized difference vegetation index (NDVI).

    For the north-central region of Mexico, we present the global database (i.e., Tatabe_joint.csv), as well as the database that contains only the field evidence records (i.e., Tatabe_first_order.csv) and another one with the filtered GBIF records (i.e., Tatabe_GBIF.csv).

  18. f

    File S1 - A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from...

    • plos.figshare.com
    • figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumya Alige Mahabala Rao; Sujaya Srinivasan; Irene Rosita Pia Patric; Alangar Sathyaranjandas Hegde; Bangalore Ashwathnarayanara Chandramouli; Arivazhagan Arimappamagan; Vani Santosh; Paturu Kondaiah; Manchanahalli R. Sathyanarayana Rao; Kumaravel Somasundaram (2023). File S1 - A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma [Dataset]. http://doi.org/10.1371/journal.pone.0085200.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Soumya Alige Mahabala Rao; Sujaya Srinivasan; Irene Rosita Pia Patric; Alangar Sathyaranjandas Hegde; Bangalore Ashwathnarayanara Chandramouli; Arivazhagan Arimappamagan; Vani Santosh; Paturu Kondaiah; Manchanahalli R. Sathyanarayana Rao; Kumaravel Somasundaram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Text: Supplementary Methods and Supplementary Results. Supplementary Tables: Table S1. Primers used for RT-qPCR. Table S2. List of genes selected for expression analysis by PCR array. Table S3. Number of AA and GBM patient samples in training set, test set and three independent cohorts of patient samples (TCGA, GSE1993 and GSE4422). Table S4. Expression of 16 genes in AA (n = 20) and GBM (n = 54) samples of the test set. Table S5. Expression of 16 genes in Grade III glioma (n = 27) and GBM (n = 152) samples of the TCGA dataset. Table S6. Expression of 16 genes in AA (n = 19) and GBM (n = 39) samples of GSE1993 dataset. Table S7. Expression of 16 genes in AA (n = 5) and GBM (n = 71) samples of the GSE4422 dataset. Supplementary Figures: Figure S1. Heat map of one-way hierarchical clustering of 16 PAM-identified genes in AA (n = 20) and GBM (n = 54) patient samples in the test set. A dual-color code was used, with red and green indicating up- and down regulation, respectively. Figure S2. Heat map of one-way hierarchical clustering of 16 PAM-identified genes in grade III glioma (n = 27) and GBM (n = 152) patient samples in TCGA dataset. A dual-color code was used, with red and green indicating up- and down regulation, respectively. Figure S3. A. Heat map of one-way hierarchical clustering of 16 PAM-identified genes in AA (n = 19) and GBM (n = 39) patient samples in GSE1993 dataset. A dual-color code was used, with red and green indicating up- and down regulation, respectively. B. PCA was performed using expression values of 16-PAM identified genes between AA and GBM samples in GSE1993 dataset. A scatter plot is generated using the first two principal components for each sample. The color of the samples is as indicated. C. The detailed probabilities of 10-fold cross-validation for the samples of GSE1993 dataset based on the expression values of 16 genes are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S4. A. Heat map of one-way hierarchical clustering of 16 PAM-identified genes in AA (n = 5) and GBM (n = 71) patient samples in GSE4422 dataset. A dual-color code was used, with red and green indicating up- and down regulation, respectively. B. PCA was performed using expression values of 16-PAM identified genes between AA and GBM samples in GSE4422 dataset. A scatter plot is generated using the first two principal components for each sample. The color of the samples is as indicated. C. The detailed probabilities of 10-fold cross-validation for the samples of GSE4422 dataset based on the expression values of 16 genes are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S5. A. The detailed probabilities of 10-fold cross-validation for the samples of GSE4271 dataset based on the expression values of 16 genes are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. B. The average Age at Diagnosis along with standard deviation is plotted for Authentic AAs (n = 12), Authentic GBMs (n = 68), Discordant AAs (n = 10) and Discordant GBMs (n = 8) of GSE4271 dataset. C. The Kaplan Meier survival analysis of samples of GSE4271 dataset. Figure S6. PAM analysis of the Petalidis-gene signature in TCGA dataset. A. Plot showing classification error for the Petalidis gene set in TCGA dataset. The threshold value of 0.0 corresponded to all 54 genes which classified AA (n = 27) and GBM (n = 604) samples with classification error of 0.000. B. The detailed probabilities of 10-fold cross-validation for the samples of TCGA dataset based on Petalidis gene set are shown. For each sample, its probability as AA (green color) and GBM (red color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S7. PAM analysis of the Phillips gene signature in our dataset. A. Plot showing classification error for the Phillips gene set in our dataset. The threshold value of 0.0 that correspond to all 5 genes which classified AA (n = 50) and GBM (n = 132) samples with classification error of 0.159. B. The detailed probabilities of 10-fold cross-validation for the samples of our dataset based on Phillips gene set are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S8. PAM analysis of the Phillips gene signature in Phillips dataset. A. Plot showing classification error for the Phillips gene set in Phillips dataset. The threshold value of 0.0 that correspond to all 8 genes which classified AA (n = 24) and GBM (n = 76) samples with classification error of 0.169. B. The detailed probabilities of 10-fold cross-validation for the samples of our dataset based on Phillips gene set are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S9. PAM analysis of the Phillips gene signature in GSE4422 dataset. A. Plot showing classification error for the Phillips gene set in GSE4422 dataset. The threshold value of 0.0 that correspond to all 8 genes which classified AA (n = 5) and GBM (n = 76) samples with classification error of 0.065. B. The detailed probabilities of 10-fold cross-validation for the samples of our dataset based on Phillips gene set are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S10. PAM analysis of the Phillips-gene signature in TCGA dataset. A. Plot showing classification error for the Phillips gene set in TCGA dataset. The threshold value of 0.0 corresponded to all 8 genes which classified AA (n = 27) and GBM (n = 604) samples with classification error of 0.008. B. The detailed probabilities of 10-fold cross-validation for the samples of TCGA dataset based on Phillips gene set are shown. For each sample, its probability as AA (orange color) and GBM (blue color) are shown and it was predicted by the PAM program as either AA or GBM based on which grade's probability is higher. The original histological grade of the samples is shown on the top. Figure S11. Network obtained by using 16-genes of classification signature as input genes to Bisogenet plugin in Cytoscape. The gene rated network had 252 nodes (genes) and 1498 edges (interactions between genes/proteins). This network consisted of the seed proteins with their immediate interacting neighbors. The nodes corresponding to the input genes are highlighted by the bigger node size as compared to the rest of the interacting partners. The color code is as indicated in the scale. (PDF)

  19. m

    Calculations dataset of diatomic systems based on van der Waals density...

    • data.mendeley.com
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiyou Shibata (2021). Calculations dataset of diatomic systems based on van der Waals density functional method [Dataset]. http://doi.org/10.17632/yz5rrmvrgd.1
    Explore at:
    Dataset updated
    Feb 12, 2021
    Authors
    Kiyou Shibata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides results obtained by first-principles calculations on diatomic systems and isolated systems based on SCAN+rVV10. All diatomic systems containing atomic species from H (Z=1) to Ra (Z=88) are considered. Calculations not only for diatomic systems but also for isolated systems are uploaded for evaluating binding energy, .

    ===========================

    raw_vasp_output_files [zip files (diatomic_db_raw.zip, isolated_db_raw.zip)] These zip files contain raw output files (OUTCAR and vasprun.xml) of VASP calculations.

    ===========================

    parsed_dataset [Python pickle files (diatomic_df.pickle, isolated_df.pickle) and csv files (diatomic_df.csv, isolated_df.csv)] These files contain tables of typical physical values files obtained from the VASP calculations. The python pickle files requires python environment with pandas and pymatgen. Files "*_df.pickle" and "*_df_protocol3.pickle" contains the same data, but they were saved with python pickle protocol 5 and 3, respectively.

    ===========================

    codes [diatomic_parser.zip] Simple python scripts for parsing raw VASP output files and plotting heatmaps and a scatter plot.

  20. o

    Unexplained Death in Infancy by deprivation and ethnicity

    • ora.ox.ac.uk
    jpeg, plain
    Updated Jan 1, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kroll, ME (2018). Unexplained Death in Infancy by deprivation and ethnicity [Dataset]. http://doi.org/10.5287/bodleian:XmE4XBaoZ
    Explore at:
    jpeg(96961), plain(841)Available download formats
    Dataset updated
    Jan 1, 2018
    Dataset provided by
    University of Oxford
    Authors
    Kroll, ME
    License

    https://ora.ox.ac.uk/terms_of_usehttps://ora.ox.ac.uk/terms_of_use

    Time period covered
    2006 - 2012
    Area covered
    England and Wales
    Description

    JPEG file: supplementary graph derived from the same large dataset as the analysis reported in the cited journal article. TXT file: data for this graph, and reference for the journal article. This graph relates to a journal article that can be viewed at: http://dx.doi.org/10.1136/jech-2018-21045 (see Related Items). We report a nearly five-fold disparity in risk of Unexplained Death in Infancy (UDI) across ethnic groups in England and Wales, and demonstrate that this disparity is not explained by deprivation. Formal adjustment for deprivation (IMD quintiles) does not even slightly reduce the ethnic variation (see Table 2 of the cited paper). A simple scatter plot of ethnic groups illustrates the lack of a relationship between deprivation and risk, with a virtually horizontal overall trend line (as shown in this Dataset). For example, Black Caribbean babies have nearly triple the UDI risk of Black African babies, but similar levels of deprivation. The Indian, Pakistani and Bangladeshi ethnic groups each have around half the risk of White British babies; the White British and Indian groups have similar (relatively low) levels of deprivation, and the Pakistani and Bangladeshi groups are the most deprived in England and Wales. In the cited paper we discuss various potential mediators of the ethnic differences, including sleep practices, breastfeeding and tobacco use, based on the ethnic-specific prevalence of these factors in prior survey data. We suggest that careful comparison of ethnic patterns of exposure and outcome might lead to a better understanding of the aetiology of these very distressing deaths.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akara Kijkarncharoensin (2022). Ultimate_Analysis [Dataset]. http://doi.org/10.17632/t8x96g88p3.2

Ultimate_Analysis

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 28, 2022
Authors
Akara Kijkarncharoensin
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.

The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.

An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.

A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.

The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.

Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.

Search
Clear search
Close search
Google apps
Main menu