100+ datasets found

t
Trusted Research Environments: Analysis of Characteristics and Data...
researchdata.tuwien.ac.at
bin, csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
f
Descriptive statistics.
plos.figshare.com
xls
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t003
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
u
ERA-40 Monthly Means of Isentropic Level Analysis Data
data.ucar.edu
rda-web-prod.ucar.edu
+2more
grib
Updated Oct 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (2025). ERA-40 Monthly Means of Isentropic Level Analysis Data [Dataset]. http://doi.org/10.5065/84RB-5G30
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.5065/84RB-5G30
Dataset updated
Oct 9, 2025
Dataset provided by
NSF National Center for Atmospheric Research
Authors
European Centre for Medium-Range Weather Forecasts
Description
The monthly means of ECMWF ERA-40 reanalysis isentropic level analysis data are in this dataset.
Model output and data used for analysis
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Model output and data used for analysis [Dataset]. https://catalog.data.gov/dataset/model-output-and-data-used-for-analysis
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The modeled data in these archives are in the NetCDF format (https://www.unidata.ucar.edu/software/netcdf/). NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data. The Unidata Program Center supports and maintains netCDF programming interfaces for C, C++, Java, and Fortran. Programming interfaces are also available for Python, IDL, MATLAB, R, Ruby, and Perl. Data in netCDF format is: • Self-Describing. A netCDF file includes information about the data it contains. • Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers. • Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers. • Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure. • Sharable. One writer and multiple readers may simultaneously access the same netCDF file. • Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software. Pub_figures.tar.zip Contains the NCL scripts for figures 1-5 and Chesapeake Bay Airshed shapefile. The directory structure of the archive is ./Pub_figures/Fig#_data. Where # is the figure number from 1-5. EMISS.data.tar.zip This archive contains two NetCDF files that contain the emission totals for 2011ec and 2040ei emission inventories. The name of the files contain the year of the inventory and the file header contains a description of each variable and the variable units. EPIC.data.tar.zip contains the monthly mean EPIC data in NetCDF format for ammonium fertilizer application (files with ANH3 in the name) and soil ammonium concentration (files with NH3 in the name) for historical (Hist directory) and future (RCP-4.5 directory) simulations. WRF.data.tar.zip contains mean monthly and seasonal data from the 36km downscaled WRF simulations in the NetCDF format for the historical (Hist directory) and future (RCP-4.5 directory) simulations. CMAQ.data.tar.zip contains the mean monthly and seasonal data in NetCDF format from the 36km CMAQ simulations for the historical (Hist directory), future (RCP-4.5 directory) and future with historical emissions (RCP-4.5-hist-emiss directory). This dataset is associated with the following publication: Campbell, P., J. Bash, C. Nolte, T. Spero, E. Cooter, K. Hinson, and L. Linker. Projections of Atmospheric Nitrogen Deposition to the Chesapeake Bay Watershed. Journal of Geophysical Research - Biogeosciences. American Geophysical Union, Washington, DC, USA, 12(11): 3307-3326, (2019).
m
Data from: A Semiotics Analysis Found on Music Video of You Belong with Me...
data.mendeley.com
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PRAGMATICA; Journal of Linguistics and Literature (2023). A Semiotics Analysis Found on Music Video of You Belong with Me by Taylor Swift [Dataset]. http://doi.org/10.17632/fp46m4gvps.1
Explore at:
Unique identifier
https://doi.org/10.17632/fp46m4gvps.1
Dataset updated
Aug 22, 2023
Authors
PRAGMATICA; Journal of Linguistics and Literature
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This research entitles “A Semiotics Analysis Found on Music Vidio of You Belong with Me”.The aim of this research was to investigate and analyze the verbal and visual signs and the meaning itself in the music video of “You Belong with Me” by Taylor Swift. The type of this research was qualitative research. In collecting data, the writer used the method of observation and documentation by classifying videos into pictures in the form of sequences.The results of this study indicate that the semiotic signs contained in this music video are in the form of visual displays contained in body language in the music video which tells about a male friend that Swift likes who actually has a lover, and verbal signs contained in the music video is a paper that contains writing that is used to communicate. Based on the result of the analysis,it can be concluded as there are two classifications,namely: verbal sign and visual sign. In verbal sign, it was found eight data. In visual sign, it was found seven data. The concept of music video of You Belong With Me describe someone who is in love with someone where that person has been with a lover who doesn't appreciate it at all. In the data found, verbal and visual sign explained about caring, disappointment, jealousy, and express feelings.

Netflix 2025: User Data Analysis-Ready

kaggle.com

zip

Updated Oct 11, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Muhammad Irsyad Dimas.A (2025). Netflix 2025: User Data Analysis-Ready [Dataset]. https://www.kaggle.com/datasets/muhammadirsyaddimasa/netflix-merged-cleaned

Explore at:

zip(4417117 bytes)Available download formats

Dataset updated

Oct 11, 2025

Authors

Muhammad Irsyad Dimas.A

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Deskripsi Umum

Dataset ini merupakan versi turunan yang telah melalui proses preprocessing dan data cleaning dari dataset asli:
Netflix 2025 User Behavior Dataset (210K Records)

Dataset ini dikembangkan untuk mendukung analisis segmentasi dan clustering pengguna Netflix berdasarkan:
- Faktor demografis (usia, jenis kelamin, lokasi, ukuran rumah tangga)
- Preferensi tontonan (genre, jenis konten, bahasa)
- Perilaku penggunaan (durasi menonton, persentase progres, perangkat utama, pola langganan)

Dataset hasil preprocessing telah dibersihkan dari:
- Nilai hilang (missing values) signifikan
- Duplikasi data berdasarkan session_id dan user_id
- Inkonsistensi format tanggal serta anomali numerik
- Kolom non-informatif dan noise yang tidak relevan

Struktur Dataset

Kolom	Tipe Data	Deskripsi
`session_id`	String	ID unik setiap sesi menonton
`user_id`	String	ID unik pengguna Netflix
`movie_id`	String	ID unik konten yang ditonton
`watch_date`	Date	Tanggal aktivitas menonton
`device_type`	String	Jenis perangkat yang digunakan
`watch_duration_minutes`	Float	Durasi menonton (menit)
`progress_percentage`	Float	Persentase tontonan selesai
`action`	String	Status aktivitas (started, completed, paused, dll)
`quality`	String	Kualitas streaming (HD, 4K, dll)
`location_country`	String	Negara lokasi pengguna
`is_download`	Boolean	Status apakah konten diunduh
`user_rating`	String	Rating konten oleh pengguna
`email`	String	Email pengguna (disamarkan)
`first_name`, `last_name`	String	Nama pengguna (disamarkan)
`age`	Float	Usia pengguna
`gender`	String	Jenis kelamin pengguna
`country`, `state_province`, `city`	String	Lokasi geografis pengguna
`subscription_plan`	String	Jenis langganan (Basic, Standard, Premium)
`subscription_start_date`	Date	Tanggal mulai langganan
`is_active`	Boolean	Status keaktifan akun
`monthly_spend`	Float	Pengeluaran bulanan pengguna (USD)
`primary_device`	String	Perangkat utama pengguna
`household_size`	Float	Jumlah anggota rumah tangga
`created_at`	DateTime	Waktu pencatatan data
`title`	String	Judul konten yang ditonton
`content_type`	String	Jenis konten (Movie, Series, Stand-up, dll)
`genre_primary`, `genre_secondary`	String	Genre utama dan sekunder
`release_year`	Int	Tahun rilis konten
`duration_minutes`	Float	Durasi total konten (menit)
`language`	String	Bahasa utama konten
`country_of_origin`	String	Negara asal produksi
`production_budget`, `box_office_revenue`	Float	Data finansial konten
`number_of_seasons`, `number_of_episodes`	Int	Informasi serial (jika ada)
`is_netflix_original`	Boolean	Apakah konten merupakan orisinal Netflix
`added_to_platform`	Date	Tanggal konten ditambahkan ke platform
`content_warning`	Boolean	Peringatan konten (violence, nudity, dll)

Proses Preprocessing

⚙️ Proses Preprocessing

Tahap	Deskripsi Proses
Handling Missing Values	Dataset hasil penggabungan tiga sumber utama (users, movies, watch history) mengandung banyak nilai `null`/`NaN`. Untuk mengatasinya, dilakukan penambahan data dari dataset pendukung agar jumlah nilai hilang berkurang, kemudian dilakukan imputasi bila masih terdapat nilai kosong dalam proporsi kecil.
Cek Missing Value	Menghitung jumlah nilai hilang di tiap kolom untuk menentukan proporsi missing values yang signifikan.
Thresholding Kolom	Kolom dengan lebih dari 12% missing values dihapus karena dianggap tidak layak diimputasi.
Pembersihan Data Umur	Nilai usia pengguna difilter agar berada pada rentang logis (5 \leq \text{Usia} < 100). Nilai di luar rentang ini dihapus karena tidak relevan untuk pengguna Netflix.
Filter Tahun Film	Hanya konten dengan tahun rilis dalam rentang operasi Netflix yang dipertahankan: (2007 \leq \text{Tahun Rilis} \leq 2025).
Imputasi Nilai Hilang	Setelah pembersihan, nilai kosong diisi menggunakan metode statistik: • Numerik: median atau mean • Kategorikal: modus.

Tujuan Penggunaan

Dataset ini disiapkan untuk:
1. Analisis segmentasi dan clustering pengguna Netflix menggunakan K-Means dan DBSCAN.
2. Eksplorasi pola perilaku menonton berdasarkan usia, genre, durasi, dan perangkat.
3. Evaluasi efektivitas algoritma clustering melalui metrik seperti silhouette score, Davies–Bouldin index, dan Calinski–Harabasz score.
4. Visualisasi interaktif PCA 2D & 3D untuk memahami karakteristik setiap klaster.

Data from: Basic statistical considerations for physiology: The journal...
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron R. Caldwell; Samuel N. Cheuvront (2023). Basic statistical considerations for physiology: The journal Temperature toolbox [Dataset]. http://doi.org/10.6084/m9.figshare.8320151.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8320151.v2
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Aaron R. Caldwell; Samuel N. Cheuvront
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average environmental and occupational physiologist may find statistics are difficult to interpret and use since their formal training in statistics is limited. Unfortunately, poor statistical practices can generate erroneous or at least misleading results and distorts the evidence in the scientific literature. These problems are exacerbated when statistics are used as thoughtless ritual that is performed after the data are collected. The situation is worsened when statistics are then treated as strict judgements about the data (i.e., significant versus non-significant) without a thought given to how these statistics were calculated or their practical meaning. We propose that researchers should consider statistics at every step of the research process whether that be the designing of experiments, collecting data, analysing the data or disseminating the results. When statistics are considered as an integral part of the research process, from start to finish, several problematic practices can be mitigated. Further, proper practices in disseminating the results of a study can greatly improve the quality of the literature. Within this review, we have included a number of reminders and statistical questions researchers should answer throughout the scientific process. Rather than treat statistics as a strict rule following procedure we hope that readers will use this review to stimulate a discussion around their current practices and attempt to improve them. The code to reproduce all analyses and figures within the manuscript can be found at https://doi.org/10.17605/OSF.IO/BQGDH.
D
Data from: Qualitative analysis of meanings concerning death and dying...
ssh.datastations.nl
datasearch.gesis.org
bin, pdf, xml, zip
Updated Nov 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux; N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux (2016). Qualitative analysis of meanings concerning death and dying stemming from the Dutch article series 'the last word' (NRC Handelsblad, 2011-2013) [Dataset]. http://doi.org/10.17026/DANS-ZEM-SKCD
Explore at:
bin(564097), zip(20931), xml(939604), pdf(331037)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-ZEM-SKCD
Dataset updated
Nov 16, 2016
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux; N.P.M. Fortuin; J.B.A.M. Schilderman; H.J.M. Venbrux
License
https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58
Description
This dataset is an ATLAS.ti copy bundle that contains the analysis of 86 articles that appeared between March 2011 and March 2013 in the Dutch quality newspaper NRC Handelsblad in the weekly article series 'the last word' [Dutch: 'het laatste woord'] that were written by NRC editor Gijsbert van Es. Newspaper texts have been retrieved from LexisNexis (http://academic.lexisnexis.nl/). These articles describe the experience of the last phase of life of people who were confronted with approaching death due to cancer or other life-threatening diseases, or due to old age and age-related health losses. The analysis focuses on the meanings concerning death and dying that were expressed by these people in their last phase of life. The data-set was analysed with ATLAS.ti and contains a codebook. In the memo manager a memo is included that provides information concerning the analysed data. Culturally embedded meanings concerning death and dying have been interpreted as 'death-related cultural affordances': possibilities for perception and action in the face of death that are offered by the cultural environment. These have been grouped into three different ‘cultural niches’ (sets of mutually supporting cultural affordances) that are grounded in different mechanisms for determining meaning: a canonical niche (grounding meaning in established (religious) authority and tradition), a utilitarian niche (grounding meaning in rationality and utilitarian function) and an expressive niche (grounding meaning in authentic (and often aesthetic) self-expression. Interviews are in Dutch; Codes, analysis and metadata are in English.
4
MECAnalysisTool: A method to analyze consumer data
data.4tu.nl
txt
Updated Jul 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kirstin Foolen-Torgerson; Fleur Kilwinger (2022). MECAnalysisTool: A method to analyze consumer data [Dataset]. http://doi.org/10.4121/19786900.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.4121/19786900.v1
Dataset updated
Jul 6, 2022
Dataset provided by
4TU.ResearchData
Authors
Kirstin Foolen-Torgerson; Fleur Kilwinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Excel based tool was developed to analyze means-end chain data. The tool consists of a user manual, a data input file to correctly organise your MEC data, a calculator file to analyse your data, and instructional videos. The purpose of this tool is to aggregate laddering data into hierarchical value maps showing means-end chains. The summarized results consist of (1) a summary overview, (2) a matrix, and (3) output for copy/pasting into NodeXL to generate hierarchal value maps (HVMs). To use this tool, you must have collected data via laddering interviews. Ladders are codes linked together consisting of attributes, consequences and values (ACVs).
s
10 Important Questions on Fundamental Analysis of Stocks – Meaning,...
smartinvestello.com
html
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Smart Investello (2025). 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide - Data Table [Dataset]. https://smartinvestello.com/10-questions-on-fundamental-analysis/
Explore at:
htmlAvailable download formats
Dataset updated
Oct 5, 2025
Dataset authored and provided by
Smart Investello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset extracted from the post 10 Important Questions on Fundamental Analysis of Stocks – Meaning, Parameters, and Step-by-Step Guide on Smart Investello.
D
Collision between biological process and statistical analysis revealed by...
datasetcatalog.nlm.nih.gov
data.niaid.nih.gov
+1more
Updated Sep 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dingemanse, Niels; Allegue, Hassen; Westneat, David; Dochtermann, Ned; Class, Barbara; Nakagawa, Shinichi; Schielzeth, Holger; Martin, Julien; Reale, Denis; Garamszegi, Laszlo; Araya-Ajoy, Yimen (2020). Collision between biological process and statistical analysis revealed by mean-centering [Dataset]. http://doi.org/10.5061/dryad.sj3tx9632
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.sj3tx9632
Dataset updated
Sep 8, 2020
Authors
Dingemanse, Niels; Allegue, Hassen; Westneat, David; Dochtermann, Ned; Class, Barbara; Nakagawa, Shinichi; Schielzeth, Holger; Martin, Julien; Reale, Denis; Garamszegi, Laszlo; Araya-Ajoy, Yimen
Description
Animal ecologists often collect hierarchically-structured data and analyze these with linear mixed-effects models. Specific complications arise when the effect sizes of covariates vary on multiple levels (e.g., within vs among subjects). Mean-centering of covariates within subjects offers a useful approach in such situations, but is not without problems. A statistical model represents a hypothesis about the underlying biological process. Mean-centering within clusters assumes that the lower level responses (e.g. within subjects) depend on the deviation from the subject mean (relative) rather than on absolute values of the covariate. This may or may not be biologically realistic. We show that mismatch between the nature of the generating (i.e., biological) process and the form of the statistical analysis produce major conceptual and operational challenges for empiricists. We explored the consequences of mismatches by simulating data with three response-generating processes differing in the source of correlation between a covariate and the response. These data were then analyzed by three different analysis equations. We asked how robustly different analysis equations estimate key parameters of interest and under which circumstances biases arise. Mismatches between generating and analytical equations created several intractable problems for estimating key parameters. The most widely misestimated parameter was the among-subject variance in response. We found that no single analysis equation was robust in estimating all parameters generated by all equations. Importantly, even when response-generating and analysis equations matched mathematically, bias in some parameters arose when sampling across the range of the covariate was limited. Our results have general implications for how we collect and analyze data. They also remind us more generally that conclusions from statistical analysis of data are conditional on a hypothesis, sometimes implicit, for the process(es) that generated the attributes we measure. We discuss strategies for real data analysis in face of uncertainty about the underlying biological process.
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Centro de Computação Gráfica
INESC TEC
Authors
Macedo, Nuno; Cunha, Alcino; Campos, José Creissac; Sousa, Emanuel; Margolis, Iara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
Comparison of proteomic sample preparation and data analysis methods by...
data-staging.niaid.nih.gov
ebi.ac.uk
xml
Updated Dec 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roland Lehmann; Prof. Hortense Slevogt (2018). Comparison of proteomic sample preparation and data analysis methods by means of human follicular fluids [Dataset]. https://data-staging.niaid.nih.gov/resources?id=pxd009061
Explore at:
xmlAvailable download formats
Dataset updated
Dec 4, 2018
Dataset provided by
Host Septomics Research Centre Jena University Hospital
University Hospital Jena Septomics
Authors
Roland Lehmann; Prof. Hortense Slevogt
Variables measured
Proteomics
Description
In-depth proteome exploration of complex body fluids is a challenging task that requires optimal sample preparation and analysis in order to reach novel and meaningful insights. Analysis of follicular fluids is similarly difficult as that of blood serum due to the ubiquitous presence of several highly abundant proteins and a wide range of protein concentrations. Therefore, the accessibility of this complex body fluid for liquid chromatography-tandem mass spectrometry (LC/MS/MS) analysis is a challenging opportunity to gain insights into the physiological status or to identify new diagnostic and prognostic markers for e.g. the treatment of infertility. We compared different sample preparation methods (FASP, eFASP and in-solution digestion) and three different data analysis software packages (Proteome Discoverer with SEQUEST and Mascot, Maxquant with Andromeda) in conjunction with semi- and full-tryptic databank search approaches in order to obtain a maximum coverage of the proteome.
H
Replication data for: Statistical Analysis of List Experiments
dataverse.harvard.edu
Updated Oct 2, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graeme Blair; Kosuke Imai (2014). Replication data for: Statistical Analysis of List Experiments [Dataset]. http://doi.org/10.7910/DVN/7WEJ09
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7WEJ09
Dataset updated
Oct 2, 2014
Dataset provided by
Harvard Dataverse
Authors
Graeme Blair; Kosuke Imai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The validity of empirical research often relies upon the accuracy of self-reported behavior and beliefs. Yet, eliciting truthful answers in surveys is challenging especially when studying sensitive issues such as racial prejudice, corruption, and support for militant groups. List experiments have attracted much attention recently as a potential solution to this measurement problem. Many researchers, however, have used a simple difference-in-means estimator without being able to efficiently examine multivariate relationships between respondents' characteristics and their answers to sensitive items. Moreover, no systematic means exist to investigate role of underlying assumptions. We fill these gaps by developing a set of new statistical methods for list experiments. We identify the commonly invoked assumptions, propose new multivariate regression estimators, and develop methods to detect and adjust for potential violations of key assumptions. For empirical illustrations, we analyze list experiments concerning racial prejudice. Open-source software is made available to implement the proposed methodology.
ECMWF ERA5: ensemble means of surface level analysis parameter data
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Jul 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (ECMWF) (2025). ECMWF ERA5: ensemble means of surface level analysis parameter data [Dataset]. https://catalogue.ceda.ac.uk/uuid/d8021685264e43c7a0868396a5f582d0
Explore at:
Dataset updated
Jul 7, 2025
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
European Centre for Medium-Range Weather Forecasts (ECMWF)
License
https://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf
Area covered
Earth
Variables measured
cloud_area_fraction, sea_ice_area_fraction, air_pressure_at_mean_sea_level, lwe_thickness_of_atmosphere_mass_content_of_water_vapor
Description
This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.

Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.

The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.

An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

🌆 City Lifestyle Segmentation Dataset

kaggle.com

zip

Updated Nov 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

UmutUygurr (2025). 🌆 City Lifestyle Segmentation Dataset [Dataset]. https://www.kaggle.com/datasets/umuttuygurr/city-lifestyle-segmentation-dataset

Explore at:

zip(11274 bytes)Available download formats

Dataset updated

Nov 15, 2025

Authors

UmutUygurr

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">

🌆 About This Dataset

This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.

🎯 Perfect For:

📊 K-Means, DBSCAN, Agglomerative Clustering
🔬 PCA & t-SNE Dimensionality Reduction
🗺️ Geospatial Visualization (Plotly, Folium)
📈 Correlation Analysis & Feature Engineering
🎓 Educational Projects (Beginner to Intermediate)

📦 What's Inside?

Feature	Description	Range
10 Features	Economic, environmental & social indicators	Realistically scaled
300 Cities	Europe, Asia, Americas, Africa, Oceania	Diverse distributions
Strong Correlations	Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6)	ML-ready
No Missing Values	Clean, preprocessed data	Ready for analysis
4-5 Natural Clusters	Metropolitan hubs, eco-towns, developing centers	Pre-validated

🔥 Key Features

✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases

🚀 Quick Start Example

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)

# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze
print(df.groupby('cluster').mean())

🎓 Learning Outcomes

After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics

📚 Ideal For These Projects

🏆 Kaggle Competitions: Practice clustering techniques
📝 Academic Projects: Urban planning, sociology, environmental science
💼 Portfolio Work: Showcase ML skills to employers
🎓 Learning: Hands-on practice with unsupervised learning
🔬 Research: Urban lifestyle segmentation studies

🌍 Expected Clusters

Cluster	Characteristics	Example Cities
Metropolitan Tech Hubs	High income, density, rent	Silicon Valley, Singapore
Eco-Friendly Towns	Low density, clean air, high happiness	Nordic cities
Developing Centers	Mid income, high density, poor air	Emerging markets
Low-Income Suburban	Low infrastructure, income	Rural areas
Industrial Mega-Cities	Very high density, pollution	Manufacturing hubs

🛠️ Technical Details

Format: CSV (UTF-8)
Size: ~300 rows × 10 columns
Missing Values: 0%
Data Types: 2 categorical, 8 numerical
Target Variable: None (unsupervised)
Correlation Strength: Pre-validated (r: 0.4 to 0.8)

📖 What Makes This Dataset Special?

Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code

🏅 Use This Dataset If You Want To:

✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights

📊 Acknowledgments

This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.

Happy Clustering! 🎉

f
Data from: An Evaluation of the Use of Statistical Procedures in Soil...
scielo.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laene de Fátima Tavares; André Mundstock Xavier de Carvalho; Lucas Gonçalves Machado (2023). An Evaluation of the Use of Statistical Procedures in Soil Science [Dataset]. http://doi.org/10.6084/m9.figshare.19944438.v1
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19944438.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELO journals
Authors
Laene de Fátima Tavares; André Mundstock Xavier de Carvalho; Lucas Gonçalves Machado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Experimental statistical procedures used in almost all scientific papers are fundamental for clearer interpretation of the results of experiments conducted in agrarian sciences. However, incorrect use of these procedures can lead the researcher to incorrect or incomplete conclusions. Therefore, the aim of this study was to evaluate the characteristics of the experiments and quality of the use of statistical procedures in soil science in order to promote better use of statistical procedures. For that purpose, 200 articles, published between 2010 and 2014, involving only experimentation and studies by sampling in the soil areas of fertility, chemistry, physics, biology, use and management were randomly selected. A questionnaire containing 28 questions was used to assess the characteristics of the experiments, the statistical procedures used, and the quality of selection and use of these procedures. Most of the articles evaluated presented data from studies conducted under field conditions and 27 % of all papers involved studies by sampling. Most studies did not mention testing to verify normality and homoscedasticity, and most used the Tukey test for mean comparisons. Among studies with a factorial structure of the treatments, many had ignored this structure, and data were compared assuming the absence of factorial structure, or the decomposition of interaction was performed without showing or mentioning the significance of the interaction. Almost none of the papers that had split-block factorial designs considered the factorial structure, or they considered it as a split-plot design. Among the articles that performed regression analysis, only a few of them tested non-polynomial fit models, and none reported verification of the lack of fit in the regressions. The articles evaluated thus reflected poor generalization and, in some cases, wrong generalization in experimental design and selection of procedures for statistical analysis.
F
Data from: A generic gust definition and detection method based on...
data.uni-hannover.de
search.datacite.org
zip
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AG PALM (2022). A generic gust definition and detection method based on wavelet-analysis [Dataset]. https://data.uni-hannover.de/dataset/a-generic-gust-definition-and-detection-method-based-on-wavelet-analysis
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
AG PALM
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset is associated with the paper Knoop et al. (2019) titled "A generic gust definition and detection method based on wavelet-analysis" published in "Advances in Science and Research (ASR)" within the Special Issue: 18th EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2018. It contains the data and analysis software required to recreate all figures in the publication.
H
Current Population Survey (CPS)
dataverse.harvard.edu
search.dataone.org
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Homestays data
kaggle.com
zip
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanshu shukla (2024). Homestays data [Dataset]. https://www.kaggle.com/datasets/priyanshu594/homestays-data
Explore at:
zip(44330689 bytes)Available download formats
Dataset updated
May 25, 2024
Authors
Priyanshu shukla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Objective: Build a robust predictive model to estimate the log_price of homestay listings based on comprehensive analysis of their characteristics, amenities, and host information. First make sure that the entire dataset is clean and ready to be used. 1. Feature Engineering: Task: Enhance the dataset by creating actionable and insightful features. Calculate Host_Tenure by determining the number of years from host_since to the current date, providing a measure of host experience. Generate Amenities_Count by counting the items listed in the amenities array to quantify property offerings. Determine Days_Since_Last_Review by calculating the days between last_review and today to assess listing activity and relevance. 2. Exploratory Data Analysis (EDA): Task: Conduct a deep dive into the dataset to uncover underlying patterns and relationships. Analyze how pricing (log_price) correlates with both categorical (such as room_type and property_type) and numerical features (like accommodates and number_of_reviews). Utilize statistical tools and visualizations such as correlation matrices, histograms for distribution analysis, and scatter plots to explore relationships between variables. 3. Geospatial Analysis: Task: Investigate the geographical data to understand regional pricing trends. Plot listings on a map using latitude and longitude data to visually assess price distribution. Examine if certain neighbourhoods or proximity to city centres influence pricing, providing a spatial perspective to the pricing strategy. 4. Sentiment Analysis on Textual Data: Task: Apply advanced natural language processing techniques to the description texts to extract sentiment scores. Use sentiment analysis tools to determine whether positive or negative descriptions influence listing prices, incorporating these findings into the predictive model being trained as a feature. 5. Amenities Analysis: Task: Thoroughly parse and analyse the amenities provided in the listings. Identify which amenities are most associated with higher or lower prices by applying statistical tests to determine correlations, thereby informing both pricing strategy and model inputs. 6. Categorical Data Encoding: Task: Convert categorical data into a format suitable for machine learning analysis. Apply one-hot encoding to variables like room_type, city, and property_type, ensuring that the model can interpret these as distinct features without any ordinal implication. 7. Model Development and Training: Task: Design and train predictive models to estimate log_price. Begin with a simple linear regression to establish a baseline, then explore more complex models such as RandomForest and GradientBoosting to better capture non-linear relationships and interactions between features. Document (briefly within Jupyter notebook itself) the model-building process, specifying the choice of algorithms and rationale. 8. Model Optimization and Validation: Task: Systematically optimize the models to achieve the best performance. Employ techniques like grid search to experiment with different hyperparameters settings. Validate model choices through techniques like k-fold cross-validation, ensuring the model generalizes well to unseen data. 9. Feature Importance and Model Insights: Task: Analyze the trained models to identify which features most significantly impact log_price. Utilize model-specific methods like feature importance scores for tree-based models and SHAP values for an in depth understanding of feature contributions. 10. Predictive Performance Assessment: Task: Critically evaluate the performance of the final model on a reserved test set. Use metrics such as Root Mean Squared Error (RMSE) and R-squared to assess accuracy and goodness of fit. Provide a detailed analysis of the residuals to check for any patterns that might suggest model biases or misfit.

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117

Trusted Research Environments: Analysis of Characteristics and Data Availability

Explore at:

bin, csvAvailable download formats

Unique identifier

https://doi.org/10.48436/cv20m-sg117

Dataset updated

Jun 25, 2024

Dataset provided by

TU Wien

Authors

Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.

Methodology

We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:

Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.

The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.

Technical details

This dataset consists of five comma-separated values (.csv) files describing our inventory:

countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).

Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:

schema.sql: Schema definition file to create the tables and views used in the analysis.

The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

Clear search

Close search

Google apps

Main menu

Trusted Research Environments: Analysis of Characteristics and Data...

Methodology

Technical details

Descriptive statistics.

ERA-40 Monthly Means of Isentropic Level Analysis Data

Model output and data used for analysis

Data from: A Semiotics Analysis Found on Music Video of You Belong with Me...

Netflix 2025: User Data Analysis-Ready

Deskripsi Umum

Struktur Dataset

Proses Preprocessing

⚙️ Proses Preprocessing

Tujuan Penggunaan

Data from: Basic statistical considerations for physiology: The journal...

Data from: Qualitative analysis of meanings concerning death and dying...

MECAnalysisTool: A method to analyze consumer data

10 Important Questions on Fundamental Analysis of Stocks – Meaning,...

Collision between biological process and statistical analysis revealed by...

Assessing the impact of hints in learning formal specification: Research...

Comparison of proteomic sample preparation and data analysis methods by...

Replication data for: Statistical Analysis of List Experiments

ECMWF ERA5: ensemble means of surface level analysis parameter data

🌆 City Lifestyle Segmentation Dataset

🌆 About This Dataset

🎯 Perfect For:

📦 What's Inside?

🔥 Key Features

🚀 Quick Start Example

🎓 Learning Outcomes

📚 Ideal For These Projects

🌍 Expected Clusters

🛠️ Technical Details

📖 What Makes This Dataset Special?

🏅 Use This Dataset If You Want To:

📊 Acknowledgments

Data from: An Evaluation of the Use of Statistical Procedures in Soil...

Data from: A generic gust definition and detection method based on...

Current Population Survey (CPS)

Homestays data

Trusted Research Environments: Analysis of Characteristics and Data Availability

Methodology

Technical details