100+ datasets found
  1. t

    Trusted Research Environments: Analysis of Characteristics and Data...

    • researchdata.tuwien.ac.at
    bin, csv
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

    Methodology

    We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

    • Peer-reviewed articles where available,
    • TRE websites,
    • TRE metadata catalogs.

    The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

    Technical details

    This dataset consists of five comma-separated values (.csv) files describing our inventory:

    • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
    • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
    • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
    • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
    • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

    Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

    • schema.sql: Schema definition file to create the tables and views used in the analysis.

    The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

  2. u

    ERA-40 Monthly Means of Isentropic Level Analysis Data

    • data.ucar.edu
    • rda-web-prod.ucar.edu
    • +2more
    grib
    Updated Oct 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (2025). ERA-40 Monthly Means of Isentropic Level Analysis Data [Dataset]. http://doi.org/10.5065/84RB-5G30
    Explore at:
    gribAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    NSF National Center for Atmospheric Research
    Authors
    European Centre for Medium-Range Weather Forecasts
    Description

    The monthly means of ECMWF ERA-40 reanalysis isentropic level analysis data are in this dataset.

  3. f

    Descriptive statistics.

    • plos.figshare.com
    xls
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

  4. Model output and data used for analysis

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Model output and data used for analysis [Dataset]. https://catalog.data.gov/dataset/model-output-and-data-used-for-analysis
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The modeled data in these archives are in the NetCDF format (https://www.unidata.ucar.edu/software/netcdf/). NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data. The Unidata Program Center supports and maintains netCDF programming interfaces for C, C++, Java, and Fortran. Programming interfaces are also available for Python, IDL, MATLAB, R, Ruby, and Perl. Data in netCDF format is: • Self-Describing. A netCDF file includes information about the data it contains. • Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. • Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers. • Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure. • Sharable. One writer and multiple readers may simultaneously access the same netCDF file. • Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software. Pub_figures.tar.zip Contains the NCL scripts for figures 1-5 and Chesapeake Bay Airshed shapefile. The directory structure of the archive is ./Pub_figures/Fig#_data. Where # is the figure number from 1-5. EMISS.data.tar.zip This archive contains two NetCDF files that contain the emission totals for 2011ec and 2040ei emission inventories. The name of the files contain the year of the inventory and the file header contains a description of each variable and the variable units. EPIC.data.tar.zip contains the monthly mean EPIC data in NetCDF format for ammonium fertilizer application (files with ANH3 in the name) and soil ammonium concentration (files with NH3 in the name) for historical (Hist directory) and future (RCP-4.5 directory) simulations. WRF.data.tar.zip contains mean monthly and seasonal data from the 36km downscaled WRF simulations in the NetCDF format for the historical (Hist directory) and future (RCP-4.5 directory) simulations. CMAQ.data.tar.zip contains the mean monthly and seasonal data in NetCDF format from the 36km CMAQ simulations for the historical (Hist directory), future (RCP-4.5 directory) and future with historical emissions (RCP-4.5-hist-emiss directory). This dataset is associated with the following publication: Campbell, P., J. Bash, C. Nolte, T. Spero, E. Cooter, K. Hinson, and L. Linker. Projections of Atmospheric Nitrogen Deposition to the Chesapeake Bay Watershed. Journal of Geophysical Research - Biogeosciences. American Geophysical Union, Washington, DC, USA, 12(11): 3307-3326, (2019).

  5. m

    Data from: A Semiotics Analysis Found on Music Video of You Belong with Me...

    • data.mendeley.com
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PRAGMATICA; Journal of Linguistics and Literature (2023). A Semiotics Analysis Found on Music Video of You Belong with Me by Taylor Swift [Dataset]. http://doi.org/10.17632/fp46m4gvps.1
    Explore at:
    Dataset updated
    Aug 22, 2023
    Authors
    PRAGMATICA; Journal of Linguistics and Literature
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This research entitles “A Semiotics Analysis Found on Music Vidio of You Belong with Me”.The aim of this research was to investigate and analyze the verbal and visual signs and the meaning itself in the music video of “You Belong with Me” by Taylor Swift. The type of this research was qualitative research. In collecting data, the writer used the method of observation and documentation by classifying videos into pictures in the form of sequences.The results of this study indicate that the semiotic signs contained in this music video are in the form of visual displays contained in body language in the music video which tells about a male friend that Swift likes who actually has a lover, and verbal signs contained in the music video is a paper that contains writing that is used to communicate. Based on the result of the analysis,it can be concluded as there are two classifications,namely: verbal sign and visual sign. In verbal sign, it was found eight data. In visual sign, it was found seven data. The concept of music video of You Belong With Me describe someone who is in love with someone where that person has been with a lover who doesn't appreciate it at all. In the data found, verbal and visual sign explained about caring, disappointment, jealousy, and express feelings.

  6. Comparison of proteomic sample preparation and data analysis methods by...

    • data-staging.niaid.nih.gov
    • ebi.ac.uk
    xml
    Updated Dec 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roland Lehmann; Prof. Hortense Slevogt (2018). Comparison of proteomic sample preparation and data analysis methods by means of human follicular fluids [Dataset]. https://data-staging.niaid.nih.gov/resources?id=pxd009061
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Dec 4, 2018
    Dataset provided by
    Host Septomics Research Centre Jena University Hospital
    University Hospital Jena Septomics
    Authors
    Roland Lehmann; Prof. Hortense Slevogt
    Variables measured
    Proteomics
    Description

    In-depth proteome exploration of complex body fluids is a challenging task that requires optimal sample preparation and analysis in order to reach novel and meaningful insights. Analysis of follicular fluids is similarly difficult as that of blood serum due to the ubiquitous presence of several highly abundant proteins and a wide range of protein concentrations. Therefore, the accessibility of this complex body fluid for liquid chromatography-tandem mass spectrometry (LC/MS/MS) analysis is a challenging opportunity to gain insights into the physiological status or to identify new diagnostic and prognostic markers for e.g. the treatment of infertility. We compared different sample preparation methods (FASP, eFASP and in-solution digestion) and three different data analysis software packages (Proteome Discoverer with SEQUEST and Mascot, Maxquant with Andromeda) in conjunction with semi- and full-tryptic databank search approaches in order to obtain a maximum coverage of the proteome.

  7. Netflix 2025: User Data Analysis-Ready

    • kaggle.com
    zip
    Updated Oct 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Irsyad Dimas.A (2025). Netflix 2025: User Data Analysis-Ready [Dataset]. https://www.kaggle.com/datasets/muhammadirsyaddimasa/netflix-merged-cleaned
    Explore at:
    zip(4417117 bytes)Available download formats
    Dataset updated
    Oct 11, 2025
    Authors
    Muhammad Irsyad Dimas.A
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Deskripsi Umum

    Dataset ini merupakan versi turunan yang telah melalui proses preprocessing dan data cleaning dari dataset asli:
    Netflix 2025 User Behavior Dataset (210K Records)

    Dataset ini dikembangkan untuk mendukung analisis segmentasi dan clustering pengguna Netflix berdasarkan:
    - Faktor demografis (usia, jenis kelamin, lokasi, ukuran rumah tangga)
    - Preferensi tontonan (genre, jenis konten, bahasa)
    - Perilaku penggunaan (durasi menonton, persentase progres, perangkat utama, pola langganan)

    Dataset hasil preprocessing telah dibersihkan dari:
    - Nilai hilang (missing values) signifikan
    - Duplikasi data berdasarkan session_id dan user_id
    - Inkonsistensi format tanggal serta anomali numerik
    - Kolom non-informatif dan noise yang tidak relevan

    Struktur Dataset

    KolomTipe DataDeskripsi
    session_idStringID unik setiap sesi menonton
    user_idStringID unik pengguna Netflix
    movie_idStringID unik konten yang ditonton
    watch_dateDateTanggal aktivitas menonton
    device_typeStringJenis perangkat yang digunakan
    watch_duration_minutesFloatDurasi menonton (menit)
    progress_percentageFloatPersentase tontonan selesai
    actionStringStatus aktivitas (started, completed, paused, dll)
    qualityStringKualitas streaming (HD, 4K, dll)
    location_countryStringNegara lokasi pengguna
    is_downloadBooleanStatus apakah konten diunduh
    user_ratingStringRating konten oleh pengguna
    emailStringEmail pengguna (disamarkan)
    first_name, last_nameStringNama pengguna (disamarkan)
    ageFloatUsia pengguna
    genderStringJenis kelamin pengguna
    country, state_province, cityStringLokasi geografis pengguna
    subscription_planStringJenis langganan (Basic, Standard, Premium)
    subscription_start_dateDateTanggal mulai langganan
    is_activeBooleanStatus keaktifan akun
    monthly_spendFloatPengeluaran bulanan pengguna (USD)
    primary_deviceStringPerangkat utama pengguna
    household_sizeFloatJumlah anggota rumah tangga
    created_atDateTimeWaktu pencatatan data
    titleStringJudul konten yang ditonton
    content_typeStringJenis konten (Movie, Series, Stand-up, dll)
    genre_primary, genre_secondaryStringGenre utama dan sekunder
    release_yearIntTahun rilis konten
    duration_minutesFloatDurasi total konten (menit)
    languageStringBahasa utama konten
    country_of_originStringNegara asal produksi
    production_budget, box_office_revenueFloatData finansial konten
    number_of_seasons, number_of_episodesIntInformasi serial (jika ada)
    is_netflix_originalBooleanApakah konten merupakan orisinal Netflix
    added_to_platformDateTanggal konten ditambahkan ke platform
    content_warningBooleanPeringatan konten (violence, nudity, dll)

    Proses Preprocessing

    ⚙️ Proses Preprocessing

    TahapDeskripsi Proses
    Handling Missing ValuesDataset hasil penggabungan tiga sumber utama (users, movies, watch history) mengandung banyak nilai null/NaN. Untuk mengatasinya, dilakukan penambahan data dari dataset pendukung agar jumlah nilai hilang berkurang, kemudian dilakukan imputasi bila masih terdapat nilai kosong dalam proporsi kecil.
    Cek Missing ValueMenghitung jumlah nilai hilang di tiap kolom untuk menentukan proporsi missing values yang signifikan.
    Thresholding KolomKolom dengan lebih dari 12% missing values dihapus karena dianggap tidak layak diimputasi.
    Pembersihan Data UmurNilai usia pengguna difilter agar berada pada rentang logis (5 \leq \text{Usia} < 100). Nilai di luar rentang ini dihapus karena tidak relevan untuk pengguna Netflix.
    Filter Tahun FilmHanya konten dengan tahun rilis dalam rentang operasi Netflix yang dipertahankan: (2007 \leq \text{Tahun Rilis} \leq 2025).
    Imputasi Nilai HilangSetelah pembersihan, nilai kosong diisi menggunakan metode statistik:
    Numerik: median atau mean
    Kategorikal: modus.

    Tujuan Penggunaan

    Dataset ini disiapkan untuk:
    1. Analisis segmentasi dan clustering pengguna Netflix menggunakan K-Means dan DBSCAN.
    2. Eksplorasi pola perilaku menonton berdasarkan usia, genre, durasi, dan perangkat.
    3. Evaluasi efektivitas algoritma clustering melalui metrik seperti silhouette score, Davies–Bouldin index, dan Calinski–Harabasz score.
    4. Visualisasi interaktif PCA 2D & 3D untuk memahami karakteristik setiap klaster.

  8. D

    Data from: Qualitative analysis of meanings concerning death and dying...

    • ssh.datastations.nl
    • datasearch.gesis.org
    bin, pdf, xml, zip
    Updated Nov 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux; N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux (2016). Qualitative analysis of meanings concerning death and dying stemming from the Dutch article series 'the last word' (NRC Handelsblad, 2011-2013) [Dataset]. http://doi.org/10.17026/DANS-ZEM-SKCD
    Explore at:
    bin(564097), zip(20931), xml(939604), pdf(331037)Available download formats
    Dataset updated
    Nov 16, 2016
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux; N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux
    License

    https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58

    Description

    This dataset is an ATLAS.ti copy bundle that contains the analysis of 86 articles that appeared between March 2011 and March 2013 in the Dutch quality newspaper NRC Handelsblad in the weekly article series 'the last word' [Dutch: 'het laatste woord'] that were written by NRC editor Gijsbert van Es. Newspaper texts have been retrieved from LexisNexis (http://academic.lexisnexis.nl/). These articles describe the experience of the last phase of life of people who were confronted with approaching death due to cancer or other life-threatening diseases, or due to old age and age-related health losses. The analysis focuses on the meanings concerning death and dying that were expressed by these people in their last phase of life. The data-set was analysed with ATLAS.ti and contains a codebook. In the memo manager a memo is included that provides information concerning the analysed data. Culturally embedded meanings concerning death and dying have been interpreted as 'death-related cultural affordances': possibilities for perception and action in the face of death that are offered by the cultural environment. These have been grouped into three different ‘cultural niches’ (sets of mutually supporting cultural affordances) that are grounded in different mechanisms for determining meaning: a canonical niche (grounding meaning in established (religious) authority and tradition), a utilitarian niche (grounding meaning in rationality and utilitarian function) and an expressive niche (grounding meaning in authentic (and often aesthetic) self-expression. Interviews are in Dutch; Codes, analysis and metadata are in English.

  9. 4

    MECAnalysisTool: A method to analyze consumer data

    • data.4tu.nl
    txt
    Updated Jul 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirstin Foolen-Torgerson; Fleur Kilwinger (2022). MECAnalysisTool: A method to analyze consumer data [Dataset]. http://doi.org/10.4121/19786900.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 6, 2022
    Dataset provided by
    4TU.ResearchData
    Authors
    Kirstin Foolen-Torgerson; Fleur Kilwinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Excel based tool was developed to analyze means-end chain data. The tool consists of a user manual, a data input file to correctly organise your MEC data, a calculator file to analyse your data, and instructional videos. The purpose of this tool is to aggregate laddering data into hierarchical value maps showing means-end chains. The summarized results consist of (1) a summary overview, (2) a matrix, and (3) output for copy/pasting into NodeXL to generate hierarchal value maps (HVMs). To use this tool, you must have collected data via laddering interviews. Ladders are codes linked together consisting of attributes, consequences and values (ACVs).

  10. D

    Collision between biological process and statistical analysis revealed by...

    • datasetcatalog.nlm.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Sep 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dingemanse, Niels; Allegue, Hassen; Westneat, David; Dochtermann, Ned; Class, Barbara; Nakagawa, Shinichi; Schielzeth, Holger; Martin, Julien; Reale, Denis; Garamszegi, Laszlo; Araya-Ajoy, Yimen (2020). Collision between biological process and statistical analysis revealed by mean-centering [Dataset]. http://doi.org/10.5061/dryad.sj3tx9632
    Explore at:
    Dataset updated
    Sep 8, 2020
    Authors
    Dingemanse, Niels; Allegue, Hassen; Westneat, David; Dochtermann, Ned; Class, Barbara; Nakagawa, Shinichi; Schielzeth, Holger; Martin, Julien; Reale, Denis; Garamszegi, Laszlo; Araya-Ajoy, Yimen
    Description

    Animal ecologists often collect hierarchically-structured data and analyze these with linear mixed-effects models. Specific complications arise when the effect sizes of covariates vary on multiple levels (e.g., within vs among subjects). Mean-centering of covariates within subjects offers a useful approach in such situations, but is not without problems. A statistical model represents a hypothesis about the underlying biological process. Mean-centering within clusters assumes that the lower level responses (e.g. within subjects) depend on the deviation from the subject mean (relative) rather than on absolute values of the covariate. This may or may not be biologically realistic. We show that mismatch between the nature of the generating (i.e., biological) process and the form of the statistical analysis produce major conceptual and operational challenges for empiricists. We explored the consequences of mismatches by simulating data with three response-generating processes differing in the source of correlation between a covariate and the response. These data were then analyzed by three different analysis equations. We asked how robustly different analysis equations estimate key parameters of interest and under which circumstances biases arise. Mismatches between generating and analytical equations created several intractable problems for estimating key parameters. The most widely misestimated parameter was the among-subject variance in response. We found that no single analysis equation was robust in estimating all parameters generated by all equations. Importantly, even when response-generating and analysis equations matched mathematically, bias in some parameters arose when sampling across the range of the covariate was limited. Our results have general implications for how we collect and analyze data. They also remind us more generally that conclusions from statistical analysis of data are conditional on a hypothesis, sometimes implicit, for the process(es) that generated the attributes we measure. We discuss strategies for real data analysis in face of uncertainty about the underlying biological process.

  11. H

    Replication data for: Statistical Analysis of List Experiments

    • dataverse.harvard.edu
    Updated Oct 2, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graeme Blair; Kosuke Imai (2014). Replication data for: Statistical Analysis of List Experiments [Dataset]. http://doi.org/10.7910/DVN/7WEJ09
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2014
    Dataset provided by
    Harvard Dataverse
    Authors
    Graeme Blair; Kosuke Imai
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The validity of empirical research often relies upon the accuracy of self-reported behavior and beliefs. Yet, eliciting truthful answers in surveys is challenging especially when studying sensitive issues such as racial prejudice, corruption, and support for militant groups. List experiments have attracted much attention recently as a potential solution to this measurement problem. Many researchers, however, have used a simple difference-in-means estimator without being able to efficiently examine multivariate relationships between respondents' characteristics and their answers to sensitive items. Moreover, no systematic means exist to investigate role of underlying assumptions. We fill these gaps by developing a set of new statistical methods for list experiments. We identify the commonly invoked assumptions, propose new multivariate regression estimators, and develop methods to detect and adjust for potential violations of key assumptions. For empirical illustrations, we analyze list experiments concerning racial prejudice. Open-source software is made available to implement the proposed methodology.

  12. s

    10 Important Questions on Fundamental Analysis of Stocks – Meaning,...

    • smartinvestello.com
    html
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smart Investello (2025). 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide - Data Table [Dataset]. https://smartinvestello.com/10-questions-on-fundamental-analysis/
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Oct 5, 2025
    Dataset authored and provided by
    Smart Investello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset extracted from the post 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide on Smart Investello.

  13. f

    Data from: An Evaluation of the Use of Statistical Procedures in Soil...

    • scielo.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laene de Fátima Tavares; André Mundstock Xavier de Carvalho; Lucas Gonçalves Machado (2023). An Evaluation of the Use of Statistical Procedures in Soil Science [Dataset]. http://doi.org/10.6084/m9.figshare.19944438.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELO journals
    Authors
    Laene de Fátima Tavares; André Mundstock Xavier de Carvalho; Lucas Gonçalves Machado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT Experimental statistical procedures used in almost all scientific papers are fundamental for clearer interpretation of the results of experiments conducted in agrarian sciences. However, incorrect use of these procedures can lead the researcher to incorrect or incomplete conclusions. Therefore, the aim of this study was to evaluate the characteristics of the experiments and quality of the use of statistical procedures in soil science in order to promote better use of statistical procedures. For that purpose, 200 articles, published between 2010 and 2014, involving only experimentation and studies by sampling in the soil areas of fertility, chemistry, physics, biology, use and management were randomly selected. A questionnaire containing 28 questions was used to assess the characteristics of the experiments, the statistical procedures used, and the quality of selection and use of these procedures. Most of the articles evaluated presented data from studies conducted under field conditions and 27 % of all papers involved studies by sampling. Most studies did not mention testing to verify normality and homoscedasticity, and most used the Tukey test for mean comparisons. Among studies with a factorial structure of the treatments, many had ignored this structure, and data were compared assuming the absence of factorial structure, or the decomposition of interaction was performed without showing or mentioning the significance of the interaction. Almost none of the papers that had split-block factorial designs considered the factorial structure, or they considered it as a split-plot design. Among the articles that performed regression analysis, only a few of them tested non-polynomial fit models, and none reported verification of the lack of fit in the regressions. The articles evaluated thus reflected poor generalization and, in some cases, wrong generalization in experimental design and selection of procedures for statistical analysis.

  14. ECMWF ERA5: ensemble means of surface level analysis parameter data

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (ECMWF) (2025). ECMWF ERA5: ensemble means of surface level analysis parameter data [Dataset]. https://catalogue.ceda.ac.uk/uuid/d8021685264e43c7a0868396a5f582d0
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    European Centre for Medium-Range Weather Forecasts (ECMWF)
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf

    Area covered
    Earth
    Variables measured
    cloud_area_fraction, sea_ice_area_fraction, air_pressure_at_mean_sea_level, lwe_thickness_of_atmosphere_mass_content_of_water_vapor
    Description

    This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.

    Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.

    The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.

    An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

  15. l

    Artificial Symbol Learning With Training - Experiment 2 Data analysis

    • repository.lboro.ac.uk
    zip
    Updated Jan 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camilla Gilmore; Matthew Inglis; Hanna Weiers (2025). Artificial Symbol Learning With Training - Experiment 2 Data analysis [Dataset]. http://doi.org/10.17028/rd.lboro.13645850.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Loughborough University
    Authors
    Camilla Gilmore; Matthew Inglis; Hanna Weiers
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Zip file containing all data and analysis files for Experiment 2 in:Weiers, H., Inglis, M., & Gilmore, C. (under review). Learning artificial number symbols with ordinal and magnitude information.Article abstractThe question of how numerical symbols gain semantic meaning is a key focus of mathematical cognition research. Some have suggested that symbols gain meaning from magnitude information, by being mapped onto the approximate number system, whereas others have suggested symbols gain meaning from their ordinal relations to other symbols. Here we used an artificial symbol learning paradigm to investigate the effects of magnitude and ordinal information on number symbol learning. Across two experiments, we found that after either magnitude or ordinal training, adults successfully learned novel symbols and were able to infer their ordinal and magnitude meanings. Furthermore, adults were able to make relatively accurate judgements about, and map between, the novel symbols and non-symbolic quantities (dot arrays). Although both ordinal and magnitude training was sufficient to attach meaning to the symbols, we found beneficial effects on the ability to learn and make numerical judgements about novel symbols when combining small amounts of magnitude information for a symbol subset with ordinal information about the whole set. These results suggest that a combination of magnitude and ordinal information is a plausible account of the symbol learning process.© The Authors

  16. 🌆 City Lifestyle Segmentation Dataset

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset
    Explore at:
    zip(11274 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    UmutUygurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

    🌆 About This Dataset

    This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

    🎯 Perfect For:

    • 📊 K-Means, DBSCAN, Agglomerative Clustering
    • 🔬 PCA & t-SNE Dimensionality Reduction
    • 🗺️ Geospatial Visualization (Plotly, Folium)
    • 📈 Correlation Analysis & Feature Engineering
    • 🎓 Educational Projects (Beginner to Intermediate)

    📦 What's Inside?

    FeatureDescriptionRange
    10 FeaturesEconomic, environmental & social indicatorsRealistically scaled
    300 CitiesEurope, Asia, Americas, Africa, OceaniaDiverse distributions
    Strong CorrelationsIncome ↔ Rent (+0.8), Density ↔ Pollution (+0.6)ML-ready
    No Missing ValuesClean, preprocessed dataReady for analysis
    4-5 Natural ClustersMetropolitan hubs, eco-towns, developing centersPre-validated

    🔥 Key Features

    Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
    Regional Diversity: Each region has distinct economic and environmental characteristics
    Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
    Beginner-Friendly: No data cleaning required, includes example code
    Documented: Comprehensive README with methodology and use cases

    🚀 Quick Start Example

    import pandas as pd
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # Load and prepare
    df = pd.read_csv('city_lifestyle_dataset.csv')
    X = df.drop(['city_name', 'country'], axis=1)
    X_scaled = StandardScaler().fit_transform(X)
    
    # Cluster
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)
    
    # Analyze
    print(df.groupby('cluster').mean())
    

    🎓 Learning Outcomes

    After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

    📚 Ideal For These Projects

    • 🏆 Kaggle Competitions: Practice clustering techniques
    • 📝 Academic Projects: Urban planning, sociology, environmental science
    • 💼 Portfolio Work: Showcase ML skills to employers
    • 🎓 Learning: Hands-on practice with unsupervised learning
    • 🔬 Research: Urban lifestyle segmentation studies

    🌍 Expected Clusters

    ClusterCharacteristicsExample Cities
    Metropolitan Tech HubsHigh income, density, rentSilicon Valley, Singapore
    Eco-Friendly TownsLow density, clean air, high happinessNordic cities
    Developing CentersMid income, high density, poor airEmerging markets
    Low-Income SuburbanLow infrastructure, incomeRural areas
    Industrial Mega-CitiesVery high density, pollutionManufacturing hubs

    🛠️ Technical Details

    • Format: CSV (UTF-8)
    • Size: ~300 rows × 10 columns
    • Missing Values: 0%
    • Data Types: 2 categorical, 8 numerical
    • Target Variable: None (unsupervised)
    • Correlation Strength: Pre-validated (r: 0.4 to 0.8)

    📖 What Makes This Dataset Special?

    Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

    🏅 Use This Dataset If You Want To:

    ✓ Learn clustering without data cleaning hassles
    ✓ Practice PCA and dimensionality reduction
    ✓ Create beautiful geographic visualizations
    ✓ Understand feature correlation in real-world contexts
    ✓ Build a portfolio project with clear business insights

    📊 Acknowledgments

    This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

    Happy Clustering! 🎉

  17. F

    Data from: A generic gust definition and detection method based on...

    • data.uni-hannover.de
    • search.datacite.org
    zip
    Updated Jan 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AG PALM (2022). A generic gust definition and detection method based on wavelet-analysis [Dataset]. https://data.uni-hannover.de/dataset/a-generic-gust-definition-and-detection-method-based-on-wavelet-analysis
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    AG PALM
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset is associated with the paper Knoop et al. (2019) titled "A generic gust definition and detection method based on wavelet-analysis" published in "Advances in Science and Research (ASR)" within the Special Issue: 18th EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2018. It contains the data and analysis software required to recreate all figures in the publication.

  18. d

    Data from: Slug tests data, analysis, and results at wells near the North...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Slug tests data, analysis, and results at wells near the North Shore of Lake Superior, Minnesota [Dataset]. https://catalog.data.gov/dataset/slug-tests-data-analysis-and-results-at-wells-near-the-north-shore-of-lake-superior-minnes
    Explore at:
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Lake Superior, North Shore, Minnesota
    Description

    This dataset contains the original data, analysis data, and a results synopsis of 12 slug tests performed in 7 wells completed in unconfined fractured bedrock near the North Shore of Lake Superior in Minnesota. Aquifers tested include extrusive and intrusive volcanic rocks and slate. Estimated hydraulic conductivity range from 10.2 to 2x10-6 feet/day. Mean and median hydraulic conductivity are 3.7 and 1.6, respectively. The highest and lowest hydraulic conductivities were in slate and fractured lava, respectively. Compressed air and traditional displacement-tube methods were employed. Water levels were measured with barometrically compensated (11 tests) and absolute pressure transducers (1 test) and recorded with data loggers. Test data were analyzed with AQTESOLV software using the unconfined KGS (Hyder and others, 1994; 9 tests) and Bower-Rice, 1976 models (3 tests).This dataset contains the original data, analysis data, and a results synopsis of 12 slug tests performed in 7 wells completed in unconfined fractured bedrock near the North Shore of Lake Superior in Minnesota. Aquifers tested include extrusive and intrusive volcanic rocks and slate. Estimated hydraulic conductivity range from 10.2 to 2x10-6 feet/day. Mean and median hydraulic conductivity are 3.7 and 1.6, respectively. The highest and lowest hydraulic conductivities were in slate and fractured lava, respectively. Compressed air and traditional displacement-tube methods were employed. Water levels were measured with barometrically compensated (11 tests) and absolute pressure transducers (1 test) and recorded with data loggers. Test data were analyzed with AQTESOLV software using the unconfined KGS (Hyder and others, 1994; 9 tests) and Bower-Rice, 1976 models (3 tests). Data files include the original recorded data, data files transformed into a form necessary for AQTESLOV, AQTESOLV analysis files and results files, and a compilation of well information and slug-test results. All files are formatted as tab-delimited ASCII except for the AQTESOVE analysis and results files, which are proprietary aqt and PDF files respectively. For convenience, a Microsoft Excel file is included that contains a synopsis of the well data and slug-test results, original recorded, transformed, and plotted slug-test data, data formats, constants and variables used in the data analysis, and notes about each test. Data files include the original recorded data, data files transformed into a form necessary for AQTESLOV, AQTESOLV analysis files and results files, and a compilation of well information and slug-test results. All files are formatted as tab-delimited ASCII except for the AQTESOVE analysis and results files, which are proprietary aqt and PDF files respectively. For convenience, a Microsoft Excel file is included that contains a synopsis of the well data and slug-test results, original recorded, transformed, and plotted slug-test data, data formats, constants and variables used in the data analysis, and notes about each test.

  19. Data from: Meanings of work for manicurists and hairdressers: employees and...

    • scielo.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana Machado Souza; Livia de Oliveira Borges (2023). Meanings of work for manicurists and hairdressers: employees and pejotizados [Dataset]. http://doi.org/10.6084/m9.figshare.19923743.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Mariana Machado Souza; Livia de Oliveira Borges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The research aimed to identify the differentiation of meanings of work among beauty salon workers, considering the work contracts and the functions performed (hairdressers and manicurists), in a context of pejotização and functions’ internal hierarchy. We applied questionnaires to 171 manicurists and hairdressers with the following types of links: employee, informal, MEI pejotizado and MEI não pejotizado. The results indicated that employees perceive with greater intensity the work as a responsibility and as a way of being socially included, and more proportionality in social and financial retribution. They also indicated that manicurists experience with more intensity the characteristics of brutalization, discrimination and demand.

  20. d

    Data from: Digital analysis of cDNA abundance; expression profiling by means...

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Digital analysis of cDNA abundance; expression profiling by means of restriction fragment fingerprinting [Dataset]. https://catalog.data.gov/dataset/digital-analysis-of-cdna-abundance-expression-profiling-by-means-of-restriction-fragment-f
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Gene expression profiling among different tissues is of paramount interest in various areas of biomedical research. We have developed a novel method (DADA, Digital Analysis of cDNA Abundance), that calculates the relative abundance of genes in cDNA libraries. Results DADA is based upon multiple restriction fragment length analysis of pools of clones from cDNA libraries and the identification of gene-specific restriction fingerprints in the resulting complex fragment mixtures. A specific cDNA cloning vector had to be constructed that governed missing or incomplete cDNA inserts which would generate misleading fingerprints in standard cloning vectors. Double stranded cDNA was synthesized using an anchored oligo dT primer, uni-directionally inserted into the DADA vector and cDNA libraries were constructed in E. coli. The cDNA fingerprints were generated in a PCR-free procedure that allows for parallel plasmid preparation, labeling, restriction digest and fragment separation of pools of 96 colonies each. This multiplexing significantly enhanced the throughput in comparison to sequence-based methods (e.g. EST approach). The data of the fragment mixtures were integrated into a relational database system and queried with fingerprints experimentally produced by analyzing single colonies. Due to limited predictability of the position of DNA fragments on the polyacrylamid gels of a given size, fingerprints derived solely from cDNA sequences were not accurate enough to be used for the analysis. We applied DADA to the analysis of gene expression profiles in a model for impaired wound healing (treatment of mice with dexamethasone). Conclusions The method proved to be capable of identifying pharmacologically relevant target genes that had not been identified by other standard methods routinely used to find differentially expressed genes. Due to the above mentioned limited predictability of the fingerprints, the method was yet tested only with a limited number of experimentally determined fingerprints and was able to detect differences in gene expression of transcripts representing 0.05% of the total mRNA population (e.g. medium abundant gene transcripts).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117

Trusted Research Environments: Analysis of Characteristics and Data Availability

Explore at:
bin, csvAvailable download formats
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

Methodology

We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

  • Peer-reviewed articles where available,
  • TRE websites,
  • TRE metadata catalogs.

The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

Technical details

This dataset consists of five comma-separated values (.csv) files describing our inventory:

  • countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
  • tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
  • access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
  • inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
  • major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

  • schema.sql: Schema definition file to create the tables and views used in the analysis.

The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

Search
Clear search
Close search
Google apps
Main menu