100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. f

    Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

    • acs.figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

  3. Mental Health Support Feature Analysis

    • kaggle.com
    zip
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mental Health Support Feature Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-support-feature-analysis
    Explore at:
    zip(961023031 bytes)Available download formats
    Dataset updated
    Jan 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mental Health Support Feature Analysis

    Correlating Text Features and Mental Health Indicators

    By [source]

    About this dataset

    This dataset is an invaluable source of information for exploring the psychological and linguistic features of mental health support discussions conducted on Reddit in 2019. The data consists of text from posts extracted from a variety of subreddits, as well as over 256 features that may provide insight into the psychological and linguistic characteristics within these conversations.

    These included indicators measure such things as Automated Readability Index, Coleman-Liau Index, Flesch Reading Ease, Gunning Fog Index and Lix scores; Wiener Sachtextformel calculations; TF-IDF analyses related to key topics like abuse, alcohol use, anxiety, depression symptoms, family matters and more. Furthermore, values are also provided for metrics like words and syllables per sentence; total characters present in each post; total phrases or sentences contained per submission; numbers of long/monosyllable/polysyllable words used throughout each contribution.

    Sentiment analysis is another useful measurement made available within this dataset - values can be graphed against aspects such as negativity or neutrality versus positivity across all posts discussing various ideas related to economic stressors or isolation experiences - all alongside scores related to specific issues like substance use frequency or gun control debates. Additionally this dataset offers valuable metrics concerning punctuation tendencies encountered in these types of conversations - often associated with syntax brought forward by personal pronouns in the first person (I); second person (you) ; third person (him/her/they). Furthermore score information has been pulled around achievement language used; adverb presence detected throughout post histories etc., helping pave the way for detailed discourse analyses surrounding affective processes, anxieties mentioned within discussions on religious topics – even sadness levels expressed through discourse exchanges between people seeking mutual relationship advice!

    In addition to providing a wealth of measures produced from texts associated with all kinds mental health conversations found online – this dataset could prove extremely important when conducting further research designed to better profile certain populations emotionally impacted by their individual digital footprints!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Using this dataset, you will be able to analyze various psychological and linguistic features in mental health support online discussions, in order to identify high-risk behaviors. The data consists of text from posts, as well as over 256 different indicators of psychological and linguistic features.

    To get started, you will need to set up your own local environment with the necessary packages for running the dataset. You can find this information on the Kaggle page for this dataset. Once you have all that set up, you'll be able to dive into exploring the available data!

    The first step is to take a look at each column header and gain an understanding of what each feature measures. This dataset contains features such as Automated Readability Index (ARI), Coleman-Liau Index (CLI), Flesch Reading Ease (FREEDOM), Gunning Fog Index (GFI), Lix, Wiener Sachtextformel, sentiment scores such as sentiment negative (SENT_NEG) and sentiment compound (SENT_COMPOUND). And textual features including TF-IDF analysis of words related to topics such as abuse, alcohol, anxiety depression family fearing medication problem stress suicide etc., are provided too in order for us use it accordingly with our purpose/project..

    Using these features collected from mental health support discussions on Reddit between 2019 and 2020 on various topics related to mental health states such us abuse substance use economic issues social isolation etc., can help us better identify dangerous risk behaviours among those people who discussing their problems online Hence get a deeper understanding of online behaviorat risk state by studying certain patterns or trends beyond their text contents so intelligence agenciesetc could benefitfrom it when monitoring suspicious situations..from one side it provides them a unique toolkitfor identifying certain high-risk behaviorsfrom another side if provides many opportunitiesfor criminal justice authorities aimingto detect whenever someone discussing illegal activitiesonline like drug dealing weapons exchangeetc…so they would be readily catchit while digging deep...

  4. m

    Total capital and labor losses of floods in Indonesia

    • data.mendeley.com
    • narcis.nl
    Updated Jul 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ambiyah abdullah (2021). Total capital and labor losses of floods in Indonesia [Dataset]. http://doi.org/10.17632/4nfv9ghgxp.1
    Explore at:
    Dataset updated
    Jul 16, 2021
    Authors
    ambiyah abdullah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Indonesia
    Description

    This dataset consists of estimated direct capital (at both Indonesia and three provinces’ levels) and labor losses due to flood in Indonesia. The direct capital losses of flood in Indonesia are calculated based on three following steps. First is mapping the direct flood losses of three provinces (based on the damages and flood losses reports published by BAPPENAS and BNPB) into sectors' classification at 2010 Indonesian IO table. Moreover, total direct flood losses of three provinces are also calculated at this step. Second is estimation of total direct flood losses for Indonesia by multiplying the total direct flood losses at first step with the total share of GDP of three provinces (Jakarta, West Papua, and North Sulawesi) over the 2010 total GDP of Indonesia at 2010 constant price. Third is estimation of direct capital losses of flood by dividing total direct flood losses over total capital of Indonesia. The total capital of Indonesia is based on data of capital at the 2010 Indonesian IO table. The same steps are applied for estimation of direct capital losses at three provinces levels. The labor losses of flood is calculated based on data of total affected people due to flood over the total population of Indonesia in 2010.

  5. f

    Analysis and Interpretation of Imaging Mass Spectrometry Data by Clustering...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Becker, Michael; Chernyavsky, Ilya; Nikolenko, Sergey; von Eggeling, Ferdinand; Alexandrov, Theodore (2016). Analysis and Interpretation of Imaging Mass Spectrometry Data by Clustering Mass-to-Charge Images According to Their Spatial Similarity [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001727913
    Explore at:
    Dataset updated
    Feb 18, 2016
    Authors
    Becker, Michael; Chernyavsky, Ilya; Nikolenko, Sergey; von Eggeling, Ferdinand; Alexandrov, Theodore
    Description

    Imaging mass spectrometry (imaging MS) has emerged in the past decade as a label-free, spatially resolved, and multipurpose bioanalytical technique for direct analysis of biological samples from animal tissue, plant tissue, biofilms, and polymer films., Imaging MS has been successfully incorporated into many biomedical pipelines where it is usually applied in the so-called untargeted mode-capturing spatial localization of a multitude of ions from a wide mass range. An imaging MS data set usually comprises thousands of spectra and tens to hundreds of thousands of mass-to-charge (m/z) images and can be as large as several gigabytes. Unsupervised analysis of an imaging MS data set aims at finding hidden structures in the data with no a priori information used and is often exploited as the first step of imaging MS data analysis. We propose a novel, easy-to-use and easy-to-implement approach to answer one of the key questions of unsupervised analysis of imaging MS data: what do all m/z images look like? The key idea of the approach is to cluster all m/z images according to their spatial similarity so that each cluster contains spatially similar m/z images. We propose a visualization of both spatial and spectral information obtained using clustering that provides an easy way to understand what all m/z images look like. We evaluated the proposed approach on matrix-assisted laser desorption ionization imaging MS data sets of a rat brain coronal section and human larynx carcinoma and discussed several scenarios of data analysis.

  6. Methodology data of "A qualitative and quantitative citation analysis toward...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni (2022). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. http://doi.org/10.5281/zenodo.4323221
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

    Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

    Data gathering

    The data generated by this step are stored in "data/":

    1. "cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention").
      Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).
    2. "cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi").
      Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

    Topic modeling
    We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

    1. "mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

    2. "corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

    3. "coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

    4. "datasets_and_views/": the datasets and visualizations generated using MITAO.

    References

    1. Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0
    2. Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

    3. Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3

  7. n

    Data from: Quantifying and Predicting the Impact of the Madden-Julian...

    • catalog.northslopescience.org
    Updated Feb 23, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Quantifying and Predicting the Impact of the Madden-Julian Oscillation on the State of the Arctic [Dataset]. https://catalog.northslopescience.org/dataset/2034
    Explore at:
    Dataset updated
    Feb 23, 2016
    Area covered
    Arctic
    Description

    Many authors have documented variability in the state of the Arctic on a range of time scales, including the intraseasonal. Recent studies have noted the dependence of Arctic surface temperature and circulation on deep convection associated with the leading mode of tropical intraseasonal atmospheric variability, the Madden-Julian Oscillation (MJO). However, very few studies have connected this relationship to the key Arctic parameters of ice and snow extent. The character of Arctic sea ice and snow cover extent is the result of a complex interrelationship between meteorological and oceaongraphical factors. These parameters have been found to vary on a number of time and spatial scales, including the intraseasonal, and Arctic sea ice and snow extent have been found to be strongly related to both thermodynamic and dynamic forcing on that time scale. Furthermore, several recent studies have found that the leading mode of tropical atmospheric variability on the intraseasonal scale, the MJO, modifies these mid- and highlatitude thermodynamic and dynamic forcing mechanisms. However, the MJO’s impact on sea ice and snow cover remains largely unstudied. Therefore, by mapping the dependence of Arctic sea ice and snow cover extent on phase of the MJO, explaining the observed variability via known relationships to atmospheric state variables, and transitioning the results to a statistical prediction model, this research effort fills important gaps in knowledge and prediction of the Arctic system. Intellectual merit The PIs have previously investigated the character of Northern Hemisphere snow cover extent and intraseasonal variability of other components of the atmosphere, including precipitation, tropospheric pressure and circulation patterns, fronts, and tropical cyclones. Preliminary results for the Arctic show statistically significant modulation of summer sea ice, both volume and extent, by phase of the MJO. The scientific objectives of this research are motivated by these findings, and the methods of this grant extend analysis techniques from prior studies to explore a new topic: impact of the MJO on the state of the Arctic system. The broader impacts of this activity center in three areas. First, the project will integrate undergraduate oceanography majors as participants in all components of the effort: data analysis, explanation of observed patterns, and presentation of results. Through weekly group and individual meetings, these students will learn both the techniques of cutting-edge science inquiry and the meteorology governing intraseasonal variability of the Arctic. Participation by students from underrepresented groups will be particularly solicited. Second, project results will be disseminated broadly to the scientific community. The PIs and students will publish results in peer-reviewed journals and give presentations at scientific conferences. At the end of each academic year, students will present their theses to peers and faculty, highlighting both methodology and results. Composite maps, a description of the methodology, and results from the statistical prediction model will be hosted on a project web site. Third, the research represents the first steps in a long-term goal of the PIs to understand and predict the Arctic on the intraseasonal time scale. The results of this study will be used to build one of the first prediction schemes for Arctic sea ice and snow cover extent on an intraseasonal time scale. This RUI research combines knowledge from the polar meteorology and climate research communities to investigate impacts on the Arctic system from the leading mode of tropical atmospheric intraseasonal variability, the Madden-Julian Oscillation (MJO). The specific research tasks are to quantify variability by phase of the MJO on Arctic sea ice and snow extent and then explain the observed variability by examining composites of surface, upper-air, precipitation, and reanalysis data. Results from these tasks will fill critical gaps in understanding and predicting Arctic intraseasonal variability. The activity integrates undergraduate students at each stage of the research, from data analysis to publishing of results, and represents the first step toward the long-term goal of the PIs to develop a statistical-dynamical prediction model for Arctic ice and snow. No fieldwork is conducted.

  8. Smart Device Data

    • kaggle.com
    zip
    Updated Dec 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J Colbert (2021). Smart Device Data [Dataset]. https://www.kaggle.com/datasets/jcolbert/smart-device-data
    Explore at:
    zip(6831419 bytes)Available download formats
    Dataset updated
    Dec 31, 2021
    Authors
    J Colbert
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data was collected to complete the Google data analytics certification capstone project.

    Content

    This data from approximately 33 FitBit users was collected. The data includes users' activity levels, calories burned, sleep data, and more. Each dataset contains users' Id's and a timestamp.

    Inspiration

    This is the first step in my data analytics journey!

  9. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

  10. F

    First Stage Organic Infant Formula Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). First Stage Organic Infant Formula Report [Dataset]. https://www.datainsightsmarket.com/reports/first-stage-organic-infant-formula-1273890
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The size of the First Stage Organic Infant Formula market was valued at USD XXX million in 2024 and is projected to reach USD XXX million by 2033, with an expected CAGR of XX% during the forecast period.

  11. d

    Data from: Data and code from: Topographic wetness index as a proxy for soil...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-topographic-wetness-index-as-a-proxy-for-soil-moisture-in-a-hillslope-c-e5e42
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all data and code necessary to reproduce the analysis presented in the manuscript: Winzeler, H.E., Owens, P.R., Read Q.D.., Libohova, Z., Ashworth, A., Sauer, T. 2022. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018. There are several steps to this analysis. The relevant scripts for each are listed below. The first step is to use the raw digital elevation data (DEM) to produce different versions of the topographic wetness index (TWI) for the study region (Calculating TWI). Then, these TWI output files are processed, along with soil moisture (volumetric water content or VWC) time series data from a number of sensors located within the study region, to create analysis-ready data objects (Processing TWI and VWC). Next, models are fit relating TWI to soil moisture (Model fitting) and results are plotted (Visualizing main results). A number of additional analyses were also done (Additional analyses). Input data The DEM of the study region is archived in this dataset as SourceDem.zip. This contains the DEM of the study region (DEM1.sgrd) and associated auxiliary files all called DEM1.* with different extensions. In addition, the DEM is provided as a .tif file called USGS_one_meter_x39y400_AR_R6_WashingtonCO_2015.tif. The remaining data and code files are archived in the repository created with a GitHub release on 2022-10-11, twi-moisture-0.1.zip. The data are found in a subfolder called data. 2017_LoggerData_HEW.csv through 2021_HEW.csv: Soil moisture (VWC) logger data for each year 2017-2021 (5 files total). 2882174.csv: weather data from a nearby station. DryPeriods2017-2021.csv: starting and ending days for dry periods 2017-2021. LoggerLocations.csv: Geographic locations and metadata for each VWC logger. Logger_Locations_TWI_2017-2021.xlsx: 546 topographic wetness indexes calculated at each VWC logger location. note: This is intermediate input created in the first step of the pipeline. Code pipeline To reproduce the analysis in the manuscript run these scripts in the following order. The scripts are all found in the root directory of the repository. See the manuscript for more details on the methods. Calculating TWI TerrainAnalysis.R: Taking the DEM file as input, calculates 546 different topgraphic wetness indexes using a variety of different algorithms. Each algorithm is run multiple times with different input parameters, as described in more detail in the manuscript. After performing this step, it is necessary to use the SAGA-GIS GUI to extract the TWI values for each of the sensor locations. The output generated in this way is included in this repository as Logger_Locations_TWI_2017-2021.xlsx. Therefore it is not necessary to rerun this step of the analysis but the code is provided for completeness. Processing TWI and VWC read_process_data.R: Takes raw TWI and moisture data files and processes them into analysis-ready format, saving the results as CSV. qc_avg_moisture.R: Does additional quality control on the moisture data and averages it across different time periods. Model fitting Models were fit regressing soil moisture (average VWC for a certain time period) against a TWI index, with and without soil depth as a covariate. In each case, for both the model without depth and the model with depth, prediction performance was calculated with and without spatially-blocked cross-validation. Where cross validation wasn't used, we simply used the predictions from the model fit to all the data. fit_combos.R: Models were fit to each combination of soil moisture averaged over 57 months (all months from April 2017-December 2021) and 546 TWI indexes. In addition models were fit to soil moisture averaged over years, and to the grand mean across the full study period. fit_dryperiods.R: Models were fit to soil moisture averaged over previously identified dry periods within the study period (each 1 or 2 weeks in length), again for each of the 546 indexes. fit_summer.R: Models were fit to the soil moisture average for the months of June-September for each of the five years, again for each of the 546 indexes. Visualizing main results Preliminary visualization of results was done in a series of RMarkdown notebooks. All the notebooks follow the same general format, plotting model performance (observed-predicted correlation) across different combinations of time period and characteristics of the TWI indexes being compared. The indexes are grouped by SWI versus TWI, DEM filter used, flow algorithm, and any other parameters that varied. The notebooks show the model performance metrics with and without the soil depth covariate, and with and without spatially-blocked cross-validation. Crossing those two factors, there are four values for model performance for each combination of time period and TWI index presented. performance_plots_bymonth.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by month across the five years of data to show within-year trends. performance_plots_byyear.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by year to show trends across multiple years. performance_plots_dry_periods.Rmd: Prediction performance was presented for the models fit to the previously identified dry periods. performance_plots_summer.Rmd: Prediction performance was presented for the models fit to the June-September moisture averages. Additional analyses Some additional analyses were done that may not be published in the final manuscript but which are included here for completeness. 2019dryperiod.Rmd: analysis, done separately for each day, of a specific dry period in 2019. alldryperiodsbyday.Rmd: analysis, done separately for each day, of the same dry periods discussed above. best_indices.R: after fitting models, this script was used to quickly identify some of the best-performing indexes for closer scrutiny. wateryearfigs.R: exploratory figures showing median and quantile interval of VWC for sensors in low and high TWI locations for each water year. Resources in this dataset:Resource Title: Digital elevation model of study region. File Name: SourceDEM.zipResource Description: .zip archive containing digital elevation model files for the study region. See dataset description for more details.Resource Title: twi-moisture-0.1: Archived git repository containing all other necessary data and code . File Name: twi-moisture-0.1.zipResource Description: .zip archive containing all data and code, other than the digital elevation model archived as a separate file. This file was generated by a GitHub release made on 2022-10-11 of the git repository hosted at https://github.com/qdread/twi-moisture (private repository). See dataset description and README file contained within this archive for more details.

  12. ABoVE: Light-Curve Modelling of Gridded GPP Using MODIS MAIAC and Flux Tower...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). ABoVE: Light-Curve Modelling of Gridded GPP Using MODIS MAIAC and Flux Tower Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/above-light-curve-modelling-of-gridded-gpp-using-modis-maiac-and-flux-tower-data-76b14
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This dataset contains gridded estimations of daily ecosystem Gross Primary Production (GPP) in grams of carbon per day at a 1 km2 spatial resolution over Alaska and Canada from 2000-01-01 to 2018-01-01. Daily estimates of GPP were derived from a light-curve model that was fitted and validated over a network of ABoVE domain Ameriflux flux towers then upscaled using MODIS Multi-Angle Implementation of Atmospheric Correction (MAIAC) data to span the extended ABoVE domain. In general, the methods involved three steps; the first step involved collecting and processing mainly carbon-flux site-level data, the second step involved the analysis and correction of site-level MAIAC data, and the final step developed a framework to produce large-scale estimates of GPP. The light-curve parameter model was generated by upscaling from flux tower sub-daily temporal resolution by deconvolving the GPP variable into 3 components: the absorbed photosynthetically active radiation (aPAR), the maximum GPP or maximum photosynthetic capacity (GPPmax), and the photosynthetic limitation or amount of light needed to reach maximum capacity (PPFDmax). GPPmax and PPFDmax were related to satellite reflectance measurements sampled at the daily scale. GPP over the extended ABoVE domain was estimated at a daily resolution from the light-curve parameter model using MODIS MAIAC daily reflectance as input. This framework allows large-scale estimates of phenology and evaluation of ecosystem sensitivity to climate change.

  13. H

    Replication Data for: Computer-Assisted Keyword and Document Set Discovery...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gary King; Patrick Lam; Margaret E. Roberts (2018). Replication Data for: Computer-Assisted Keyword and Document Set Discovery from Unstructured Text [Dataset]. http://doi.org/10.7910/DVN/FMJDCD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Gary King; Patrick Lam; Margaret E. Roberts
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FMJDCDhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FMJDCD

    Description

    The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Most seem to think that keyword selection is easy, since they do Google searches every day, but we demonstrate that humans perform exceedingly poorly at this basic task. We offer a better approach, one that also can help with following conversations where participants rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; industry and intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated or human-only) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with easy-to-understand Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, and others.

  14. H

    Replication Data for: Measuring Wikipedia Article Quality in One Dimension...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan TeBlunthuis (2024). Replication Data for: Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression [Dataset]. http://doi.org/10.7910/DVN/U5V0G1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Nathan TeBlunthuis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...

  15. National Household Forest Survey 2018-2019 - Liberia

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liberia Institute of Statistics and Geo-Information Services (2021). National Household Forest Survey 2018-2019 - Liberia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3787
    Explore at:
    Dataset updated
    Jul 12, 2021
    Dataset authored and provided by
    Liberia Institute of Statistics and Geo-Information Serviceshttp://www.lisgis.gov.lr/
    Time period covered
    2018 - 2019
    Area covered
    Liberia
    Description

    Geographic coverage

    The NHFS is focused on forest proximate households. Therefore, the sample is limited to enumeration areas which fall within 2.5km of the nearest forest, as defined using Metria and Geoville (2019) land cover data. The final sample includes enumeration areas from all 15 of Liberia's counties, but excludes urban areas of Montserrado.

    Analysis unit

    Household; Community

    Universe

    All EAs within 2.5 kilometers of forests except for the EAs from the urban part of the Montserrado county.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Given the focus of the NHFS on the population living in close proximity to forests4, a first step was to clearly define forest for the purposes of the survey. Building on the national definition of forest used in Liberia, and modifying it in order to minimize the impact of small urban forests and facilitate survey operations, the NHFS employed the following definition:

    Forest = area with at least 30 percent tree canopy cover, with trees higher than 5 meters and at least 50 hectares in size

    The forest cover was determined using high-resolution forest cover data produced in 2019 based on satellite information on forest cover in Liberia for 2015.6 All EAs within 2.5 kilometers of forests identified with this definition were deemed eligible for inclusion in the NHFS.7 EAs from the Montserrado county (part of Greater Monrovia) were excluded from the sample universe due to the high rate of urbanization. However, rural parts of Montserrado county were included in the sample universe.

    Based on the forest definition defined above, the distance from each EA in the country (except urban Montserrado) to the nearest forest was computed. That distance was subsequently used to assign each EA to one of the following strata: S1 (less than 2km from forest); S2 (two to 7 km from forest); S3 (7 to 15 km from forest).

    Following strata classification, a total of 250 EAs were selected through a Probability Proportional to Size (PPS) sampling approach within each stratum, with the following purposeful allocation across strata: 90 EAs in S1; 90 EAs in S2; 70 EAs in S3.8 The measure of size for each EA was based on the total number of households listed in the 2008 PHC.

    Following the selection of the 250 sample EAs, a listing of households was conducted in each sample EA to provide the sampling frame for the second stage selection of households. Random sampling was used to select 12 households from the household listing for each sample EA.

    The original sample design provided a total household sample size of 3,000 (250 EAs with 12 households sampled per EA), data from 14 households are missing or unusable, representing 0.05 percent of the sample and resulting in a final sample of 2,986 households. Similarly, data from 5 of the community questionnaires were missing or unusable, resulting in a total sample of 245 community questionnaires. The final sample of 2,986 households is distributed across counties.

    Upon post-data collection analysis, it was discovered that the initial variable that was used to stratify EAs by distance to forest was incorrectly computed. Despite thorough attempts to understand the nature and source of the error, it was determined that a mechanical error must have occurred during the process of the distance calculations. This error rendered the stratification incorrect. Therefore, the stratification by distance to forest has been abandoned and the sample weighted to reflect only geographic clusters, not distance to forest. This was determined to be the most appropriate way forward following consultation with sampling experts.

    The resulting sample, therefore, is weighted to reflect all EAs in Liberia (with the exception of urban Montserrado) that fall within 2.5 km of the nearest forest, which was the upper bound of the distances for the selected EAs.

    Sampling deviation

    Please refer to the Basic Information Document found in the External Resources section.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    The NHFS survey consisted of: 1. A HH questionnaire, administered to 12 selected HHs in each enumeration area, and 2. A community questionnaire, administered to a group of members from the EA.

    Each questionnaire was administered using computer-assisted personal interviewing (CAPI) with CSPro3 software.

    Cleaning operations

    The data cleaning process was done in several stages over the course of fieldwork and through preliminary analysis. The first stage of data cleaning was conducted by the field-based teams during the interview itself utilizing error messages generated by the CSPro application when a response did not fit the rules for a particular question. For questions that flagged an error, the enumerators were expected to record a comment within the questionnaire to explain to their supervisor the reason for the error and confirming that they double checked the response with the respondent.

    The second stage occurred during the review of the questionnaire by the supervisors. Prior to sharing data with LISGIS HQ, the supervisor was to review the interviewers. Depending on the outcome, the supervisors can either approve or reject the case. If rejected, the case goes back to the respective enumerator and a re-visit to the household may be necessary. Additional errors were compiled into error reports by the World Bank and LISGIS HQ that were regularly sent to the teams and then corrected based on re-visits to the household.

    The last stage involved a comprehensive review of the final raw data following the first and second stage cleaning, after data collection completion. Every variable was examined individually for (1) consistency with other sections and variables, (2) out of range responses, and (3) outliers. However, special care was taken to avoid making strong assumptions when resolving potential errors. Some minor errors remain in the data where the diagnosis and/or solution were unclear to the data cleaning team.

    The first and the second stage of the cleaning activities were led by LISGIS and the World Bank provided technical assistance. The third stage of data cleaning was performed by the World Bank team exclusively.

  16. D

    Denmark Data Center Market Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Oct 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Denmark Data Center Market Report [Dataset]. https://www.datainsightsmarket.com/reports/denmark-data-center-market-11506
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Oct 30, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Denmark
    Variables measured
    Market Size
    Description

    The market offers a range of data center services, including colocation, managed hosting, cloud computing, and disaster recovery. The colocation segment is the largest, accounting for over 60% of the market revenue. Recent developments include: February 2023: GlobalConnect is the first colocation operator in Europe to offer its clients immersed cooling technique that can reduce data center cooling power consumption by up to 90%. The next-generation cooling technology was deployed in GlobalConnect's data center in Copenhagen and will be rolled out to all remaining data centers based on customer demand.June 2021: Sentia Denmark’s data center in Glostrup is accquired by European data center provider Penta Infra. The acquisition of Sentia Danmark's data centers is Penta Infra's first step toward entering the Nordic market. Penta Infra currently manages a number of data centers in the Netherlands and Germany.February 2021: The digital infrastructure provider, STACK Infrastructure ("STACK"), has purchased a 110,000 m2 plot of land, secured enough renewable energy to support the development, and onsite water, planning and building permissions to erect five data centers for a significant new campus site in Denmark. The campus masterplan provides for five 6MW IT load data centers and an office building.. Key drivers for this market are: , High Mobile penetration, Low Tariff, and Mature Regulatory Authority; Successful Privatization and Liberalization Initiatives. Potential restraints include: , Difficulties in Customization According to Business Needs. Notable trends are: OTHER KEY INDUSTRY TRENDS COVERED IN THE REPORT.

  17. D

    Janneke van der Steen - PhD Project data for study 1

    • dataverse.nl
    docx, odt
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janneke van der Steen; Janneke van der Steen (2024). Janneke van der Steen - PhD Project data for study 1 [Dataset]. http://doi.org/10.34894/ODZZXL
    Explore at:
    docx(46260), docx(47477), docx(49305), odt(27410), docx(47369), docx(70044), docx(43384)Available download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    DataverseNL
    Authors
    Janneke van der Steen; Janneke van der Steen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Title Supporting Teachers in Improving Formative Decision-Making: Design Principles for Formative Assessment Plans Abstract Formative assessment is considered as one of the most effective interventions to support teacher decision-making and improve education and student learning. However, formative assessment does not always meet these expectations. In order to be effective, formative assessment activities should be consciously and coherently planned aligned with other aspects of the curriculum and the decisions teachers wish to make based on these activities. While there is sufficient support for teachers to design formative assessment activities, no guidelines exist to help them tie these different activities together in an effective way. To support teachers in designing formative assessment plans informing formative decision-making, this study focused on the creation of a set of design principles. These design principles for formative assessment plans were formulated based on expert interviews and subsequently evaluated by future users. The result is a set of eight design principles that can be used and validated in educational practice. Description of the data included 1. Four lists of clustered critical aspects for a formative assessment plan mentioned by the participants during the expert interviews. In total, twenty participants were experts from both research and practice, all involved with formative assessment. Table 1 shows the combination of experts in each expert group: Table 1. Participant expert interviews Groups - Participants Group 1 - Three educational researchers, one teacher-educator, and one school policy maker Group 2 - Two educational reseachers, three teachers from one school for secondary education Group 3 - Nine teachers from two schools for secondary education, one teacher-educator The purpose of these expert interviews was to reach agreement among the participants of each group about what they thought were critical aspects a formative assessment plan should have to be effective. These critical aspects will be clustered, first within groups and then across groups, and used in a later stage as a starting point to formulate design principles. To promote consensus, the interviews were organized as group decision rooms where the discussion was supported by a digital group support system (Fjermestad and Hiltz, 2000; Pyrko et al., 2019). This support system offers participants the possibility to answer questions individually and digitally through a device followed by a group discussion about how to cluster all given answers. By clustering their own answers during the interview within the group, participants were directly involved in the first phase of data-analysis. All subsequent steps that were taken in the expert interviews are presented in Table 2: Table 2. Activities and questions expert interviews Activity - Question 1. Participants answer individually - "Can you name three critical aspects of a formative assessment plan?" 2. Group discussion - "How can we cluster the given aspects? What name should the different clusters have?" 3. Participants answer individually - "Which two critical aspects a formative assessment plan should have are still missing in the composed list?" 4. Group discussion - "Can we add these extra aspects to existing clusters or do we need to create new ones?" 5. Participants answer individually - "If you still think that there are critical aspects missing in the composed list, can you please add them now?" 6. Group discussion - "Can we add these extra aspects to existing clusters or do we need to create new ones?" 7. Participants answer individually - "How would you arrange all clustered critical aspects that are the result of this expert interview, in order of importance?" 2. Anonymized Transcripts of four group interviews with future users originating from four schools for secondary education set up to evaluate the draft version of the design principles for formative assessment plans. Each group consisted of five to eight teachers from the same school. In two cases, a school leader also joined the interview, see Table 3: Table 3. Participants group interviews Groups - Participants School 1 - Five teachers School 2 - Seven teachers and two school leaders School 3 - Five teachers and two school leaders School 4 - Six teachers and one school policy maker The teachers and school leaders were questioned about recommendations regarding transparency, usability, completeness, and suitability of the design principles for school practice. The participants had received the design principles in advance. First, the participants were asked to write down all recommendations they could think of to improve the design principles. Secondly, they were asked to decide what facet of the design principles would improve if this recommendation was followed. Facets they could choose from were transparency, usability, completeness, or suitability. Subsequently, they were asked...

  18. Data applied to automatic method to transform routine otolith images for a...

    • seanoe.org
    image/*
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Andrialovanirina; Alizee Hache; Kelig Mahe; Sébastien Couette; Emilie Poisson Caillault (2022). Data applied to automatic method to transform routine otolith images for a standardized otolith database using R [Dataset]. http://doi.org/10.17882/91023
    Explore at:
    image/*Available download formats
    Dataset updated
    2022
    Dataset provided by
    SEANOE
    Authors
    Nicolas Andrialovanirina; Alizee Hache; Kelig Mahe; Sébastien Couette; Emilie Poisson Caillault
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    fisheries management is generally based on age structure models. thus, fish ageing data are collected by experts who analyze and interpret calcified structures (scales, vertebrae, fin rays, otoliths, etc.) according to a visual process. the otolith, in the inner ear of the fish, is the most commonly used calcified structure because it is metabolically inert and historically one of the first proxies developed. it contains information throughout the whole life of the fish and provides age structure data for stock assessments of all commercial species. the traditional human reading method to determine age is very time-consuming. automated image analysis can be a low-cost alternative method, however, the first step is the transformation of routinely taken otolith images into standardized images within a database to apply machine learning techniques on the ageing data. otolith shape, resulting from the synthesis of genetic heritage and environmental effects, is a useful tool to identify stock units, therefore a database of standardized images could be used for this aim. using the routinely measured otolith data of plaice (pleuronectes platessa; linnaeus, 1758) and striped red mullet (mullus surmuletus; linnaeus, 1758) in the eastern english channel and north-east arctic cod (gadus morhua; linnaeus, 1758), a greyscale images matrix was generated from the raw images in different formats. contour detection was then applied to identify broken otoliths, the orientation of each otolith, and the number of otoliths per image. to finalize this standardization process, all images were resized and binarized. several mathematical morphology tools were developed from these new images to align and to orient the images, placing the otoliths in the same layout for each image. for this study, we used three databases from two different laboratories using three species (cod, plaice and striped red mullet). this method was approved to these three species and could be applied for others species for age determination and stock identification.

  19. d

    Analysis of Pre-Retrofit Building and Utility Data - Southeast United States...

    • catalog.data.gov
    Updated Jul 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibacos Innovation (2025). Analysis of Pre-Retrofit Building and Utility Data - Southeast United States [Dataset]. https://catalog.data.gov/dataset/analysis-of-pre-retrofit-building-and-utility-data-southeast-united-states
    Explore at:
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Ibacos Innovation
    Area covered
    United States, Southeastern United States
    Description

    This project delves into the workflow and results of regression models on monthly and daily utility data (meter readings of electricity consumption), outlining a process for screening and gathering useful results from inverse models. Energy modeling predictions created in Building Energy Optimization software (BEopt) Version 2.0.0.3 (BEopt 2013) are used to infer causes of differences among similar homes. This simple data analysis is useful for the purposes of targeting audits and maximizing the accuracy of energy savings predictions with minimal costs. The data for this project are from two adjacent military housing communities of 1,166 houses in the southeastern United States. One community was built in the 1970s, and the other was built in the mid-2000s. Both communities are all electric; the houses in the older community were retrofitted with ground source heat pumps in the early 1990s, and the newer community was built to an early version of ENERGY STAR with air source heat pumps. The houses in the older community will receive phased retrofits (approximately 10 per month) in the coming years. All houses have had daily electricity metering readings since early 2011. This project explores a dataset at a simple level and describes applications of a utility data normalization. There are far more sophisticated ways to analyze a dataset of dynamic, high resolution data; however, this report focuses on simple processes to create big-picture overviews of building portfolios as an initial step in a community-scale analysis. TO4 9.1.2: Comm. Scale Military Housing Upgrades

  20. International Satellite Cloud Climatology Project (ISCCP) Stage D1 3-Hourly...

    • data.nasa.gov
    • access.earthdata.nasa.gov
    • +3more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). International Satellite Cloud Climatology Project (ISCCP) Stage D1 3-Hourly Cloud Products - Revised Algorithm in Hierarchical Data Format [Dataset]. https://data.nasa.gov/dataset/international-satellite-cloud-climatology-project-isccp-stage-d1-3-hourly-cloud-products-r-a4190
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    ISCCP_D1_1 is the International Satellite Cloud Climatology Project (ISCCP) Stage D1 3-Hourly Cloud Products - Revised Algorithm data set in Hierarchical Data Format. This data set contains 3-hourly, 280 KM equal-area grid data from various polar and geostationary satellites. The Gridded Cloud Product contents are spatial averages of DX quantities and statistical summaries, including properties of cloud types. Satellites are merged into a global grid. Atmosphere and surface properties from TOVS are appended. Data collection for this data set is complete. ISCCP, the first project of the World Climate Research Program (WCRP), was established in 1982 (WMO-35 1982, Schiffer and Rossow 1983) to: produce a global, reduced resolution, calibrated and normalized radiance data set containing basic information on the properties of the atmosphere from which cloud parameters can be derived; stimulate and coordinate basic research on techniques for inferring the physical properties of clouds from the condensed radiance data set and to apply the resulting algorithms to derive and validate a global cloud climatology for improving the parameterization of clouds in climate models; and promote research using ISCCP data that contributes to improved understanding of the Earth's radiation budget and hydrological cycle. Starting in 1983 an international group of institutions collected and analyzed satellite radiance measurements from up to five geostationary and two polar orbiting satellites to infer the global distribution of cloud properties and their diurnal, seasonal and interannual variations. The primary focus of the first phase of the project (1983-1995) was the elucidation of the role of clouds in the radiation budget (top of the atmosphere and surface). In the second phase of the project (1995 onward) the analysis also concerns improving understanding of clouds in the global hydrological cycle. The ISCCP analysis combined satellite-measured radiances (Stage B3 data, Schiffer and Rossow 1985), Rossow et al. 1987) with the TOVS atmospheric temperature-humidity and ice/snow correlative data sets to obtain information about clouds and the surface. The analysis method first determined the presence of absence of clouds in each individual image pixel and retrieves the radiometric properties of the cloud for each cloudy pixel and of the surface for each clear pixel. The pixel analysis was performed separately for each satellite radiance data set and the results were reported in the Stage DX data product, which had a nominal resolution of 30 km and 3 hours. The Stage D1 product was produced by summarizing the pixel-level results every 3 hours on an equal-area map with 280 km resolution and merging the results from separate satellites with the atmospheric and ice/snow data sets to produce global coverage at each time. The Stage D2 data product was produced by averaging the Stage D1 data over each month, first at each of the eight three hour time intervals and then over all time intervals.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu