34 datasets found

f
Descriptive statistics.
plos.figshare.com
xls
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t003
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
w
Data Use in Academia Dataset
datacatalog.worldbank.org
csv, utf-8
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
Explore at:
utf-8, csvAvailable download formats
Dataset updated
Nov 27, 2023
Dataset provided by
Semantic Scholar Open Research Corpus (S2ORC)
Brian William Stacy
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Description
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
n
ECMWF ERA5t: ensemble means of surface level analysis parameter data
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Jul 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ECMWF ERA5t: ensemble means of surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?format=Data%20are%20netCDF%20formatted%20with%20internal%20compression.
Explore at:
Dataset updated
Jul 28, 2021
Description
This dataset contains ERA5 initial release (ERA5t) surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5t is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project initial release available upto 5 days behind the present data. CEDA will maintain a 6 month rolling archive of these data with overlap to the verified ERA5 data - see linked datasets on this record. The ensemble means and spreads are calculated from the ERA5t 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. See linked datasets for ensemble member and spread data. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed and, if required, amended before the full ERA5 release. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
f
Data from: Do Neurochemicals Reflect Psychophysiological Dimensions in...
acs.figshare.com
xlsx
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandrine Parrot (2025). Do Neurochemicals Reflect Psychophysiological Dimensions in Behaviors? A Transdisciplinary Perspective Based on Analogy with Maslow’s Needs Pyramid [Dataset]. http://doi.org/10.1021/acschemneuro.4c00566.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acschemneuro.4c00566.s001
Dataset updated
Feb 18, 2025
Dataset provided by
ACS Publications
Authors
Sandrine Parrot
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
All behaviors, including motivated behaviors, result from integration of information in the brain via nerve impulses, with two main means of communication: electrical gap-junctions and chemical signaling. The latter enables information transfer between brain cells through release of biochemical messengers, such as neurotransmitters. Neurochemical studies generate plentiful biochemical data, with many variables per individual, since there are many methods to quantify neurotransmitters, precursors and metabolites. The number of variables can be far higher using other concomitant techniques to monitor behavioral parameters on the same subject of study. Surprisingly, while many quantitative variables are obtained, data analysis and discussion focus on just a few or only on the neurotransmitter known to be involved in the behavior, and the other biochemical data are, at best, regarded as less important for scientific interpretation. The present article aims to provide novel transdisciplinary arguments that all neurochemical data can be regarded as items of psychophysiological dimensions, just as questionnaire items identify modified behaviors or disorders using latent classes. A first proof of concept on nonmotivated and motivated behaviors using a multivariate data-mining approach is presented.
i
Household Health Survey 2012-2013, Economic Research Forum (ERF)...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Statistical Organization (CSO) (2017). Household Health Survey 2012-2013, Economic Research Forum (ERF) Harmonization Data - Iraq [Dataset]. https://catalog.ihsn.org/index.php/catalog/6937
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
Central Statistical Organization (CSO)
Kurdistan Regional Statistics Office (KRSO)
Economic Research Forum
Time period covered
2012 - 2013
Area covered
Iraq
Description
Abstract

The harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.

----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:

Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

The survey has six main objectives. These objectives are:

Provide data for poverty analysis and measurement and monitor, evaluate and update the implementation Poverty Reduction National Strategy issued in 2009.

Provide comprehensive data system to assess household social and economic conditions and prepare the indicators related to the human development.

Provide data that meet the needs and requirements of national accounts.

Provide detailed indicators on consumption expenditure that serve making decision related to production, consumption, export and import.

Provide detailed indicators on the sources of households and individuals income.

Provide data necessary for formulation of a new consumer price index number.

The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.

Geographic coverage

National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.

Analysis unit

1- Household/family. 2- Individual/person.

Universe

The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

Kind of data

Sample survey data [ssd]

Sampling procedure

----> Design:

Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.

----> Sample frame:

Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.

----> Sampling Stages:

In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.

Mode of data collection

Face-to-face [f2f]

Research instrument

----> Preparation:

The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.

----> Questionnaire Parts:

The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job

Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.

Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days

Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.

Cleaning operations

----> Raw Data:

Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.

----> Harmonized Data:

The SPSS package is used to harmonize the Iraq Household Socio Economic Survey (IHSES) 2007 with Iraq Household Socio Economic Survey (IHSES) 2012.

The harmonization process starts with raw data files received from the Statistical Office.

A program is generated for each dataset to create harmonized variables.

Data is saved on the household and individual level, in SPSS and then converted to STATA, to be disseminated.

Response rate

Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
n
ECMWF ERA5: ensemble means of surface level analysis parameter data
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Sep 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ECMWF ERA5: ensemble means of surface level analysis parameter data [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ERA5
Explore at:
Dataset updated
Sep 16, 2021
Description
This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data. The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.
Z
Simulation data and code for "Optimal Rejection-Free Path Sampling"
data-staging.niaid.nih.gov
zenodo.org
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lazzeri, Gianmarco (2025). Simulation data and code for "Optimal Rejection-Free Path Sampling" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14922167
Explore at:
Dataset updated
Mar 25, 2025
Dataset provided by
Goethe University Frankfurt
Authors
Lazzeri, Gianmarco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the main data of the paper "Optimal Rejection-Free Path Sampling," and the source code for generating/appending the independent RFPS-AIMMD and AIMMD runs.

Due to size constraints, the data has been split into separate repositories. The following repositories contain the trajectory files generated by the runs:

all the WQ runs: 10.5281/zenodo.14830317chignolin, fps0: 10.5281/zenodo.14826023chignolin, fps1: 10.5281/zenodo.14830200chignolin, fps2: 10.5281/zenodo.14830224chignolin, tps0: 10.5281/zenodo.14830251chignolin, tps1: 10.5281/zenodo.14830270chignolin, tps2: 10.5281/zenodo.14830280

The trajectory files are not required for running the main analysis, as all necessary information for machine learning and path reweighting is contained in the "PatEnsemble" object files stored in this repository. However, these trajectories are essential for projecting the path ensemble estimate onto an arbitrary set of collective variables.

To reconstruct the full dataset, please merge all the data folders you find in the supplemental repositories.

Data structure and content

analysis (code for analyzing the data and generating the figures of the| paper)|- figures.ipynb (Jupyter notebook for the analysis)|- figures (the figures created by the Jupyter notebook) |- ...

data (all the AIMMD and reference runs, plus general info about the| simulated systems)|- chignolin |- *.py: (code for generating/appending AIMMD runs on a Workstation or | HPC cluster via Slurm; see the "src" folder below) |- run.gro (full system positions in the native conformation) |- mol.pdb (only the peptide positions in the native conformation) |- topol.top (the system's topology for the GROMACS MD engine) |- charmmm22star.ff (force field parameter files) |- run.mdp (GROMACS MD parameters when appending a simulation) |- randomvelocities.mdp (GROMACS MD parameters when initializing a | simulation with random velocities) |- signature.npy, r0.npy (parameters for defining the fraction of native | contacts involved in the folded/unfolded states | definition; used by params.py function | "states_function") |- dmax.npy, dmin.npy (parameters for defining the feature representation | of the AIMMD NN model; used by params.py | function "descriptors_function") |- equilibrium (reference long equilibrium trajectory files; only the | peptide positions are saved!) |- run0.xtc, ..., run3.xtc |- validation |- validation.xtc (the validation SPs all together in an XTC file) |- validation.npy (for each SP, collects the cumulative shooting results after 10 two-way shooting simulations) |- fps0 (the first AIMMD-RFPS independent run) |- equilibriumA (the free simulations around A, already processed | in PathEnsemble files) |- traj000001.h5 |- traj000001.tpr (for running the simulation; in that case, please | retrieve all the trajectory files in the right | supplemental repository first) |- traj000001.cpt (for appending the simulation; in that case, please | retrieve all the trajectory files in the right | supplemental repository first) |- traj000002.h5 (in case of re-initialization) |- ... |- equilibriumB (the free simulations around B, ...) |- ... |- shots0 |- chain.h5 (the path sampling chain) |- pool.h5 (the selection pool, containing the frames from which | shooting points are currently selected from) |- params.py (file containing the states and descriptors definitions, | the NN fit function, and the AIMMD runs hyperparameters; | if can be modified to allow for RFPS-AIMMD or the original | algorithm AIMMD runs) |- initial.trr (the initial transition for path sampling) |- manager.log (reports info about the run) |- network.h5 (NN weights of the model at different path | sampling steps) |- fps1, fps2 (the other RFPS-AIMMD runs) |- tps0 (the first AIMMD-TPS, or "standard" AIMMD, run) |- ... |- shots0 |- ... |- chain_weights.npy (weights of the trials in TPS; only the trials | with non zero weight had been accepted) |- tps1, tps2 (the other AIMMD runs, with TPS for the shooting simulations)|- wq (Wolfe-Quapp 2D system) |- *.py: (code for generating/appending AIMMD runs on a Workstation or | HPC cluster via Slurm) |- run.gro (dummy gro file produced for compatibility reasons) |- integrator.py (custom MD engine) |- equilibrium (reference long simulation) |- transition000001.xtc (extracted from reference long simulation) |- transition000002.xtc |- ... |- transitions.h5 (PathEnsemble file with all the transitions) |- reference |- grid_X.npy, grid_Y.npy (X, Y grid for 2D plots) |- grid_V.npy (PES projected on the grid) |- grid_committor_relaxation.npy (true committor on the grid solved | with the relaxation method on the | backward Kolmogorov equation; the | code for doing this is in utils.py) |- grid_boltzmann_distribution.npy (Boltzmann distribution on the grid) |- pe.h5 (equilibrium distribution processed as a PathEnsemble file) |- tpe.h5 (TPE distribution processed as a PathEnsemble file) |- ... |- uniform_tps (reference TPS run with uniform SP selection) |- chain.h5 (PathEnsemble file containin all the accepted paths | with their correct weight) |- fps0, ..., fps9 (the independent AIMMD-RFPS runs) |- ... |- tps0, ..., tps9 (the independent AIMMD-TPS, or "standard" AIMMD runs)

src (code for generating/appending AIMMD runs on a Workstation or HPC| cluster via Slurm)|- generate.py (on a Workstation: initializes the processes; on an HPC| cluster: creates the sh file for submitting a job)|- slurm_options.py (to customize and use in case of running on HPC)|- manager.py (controls SP selection; reweights the paths)|- shooter.py (performs path sampling simulations)|- equilibrium.py (performs free simulations)|- pathensemble.py (code of the PathEnsemble class)|- utils.py (auxiliary functions for data production and analysis)

Running/appending AIMMD runs

To initialize a new RFPS-AIMMD (or AIMMD) run for the systems of this paper:

Create a "run directory" folder (same depth as "fps0")

Copy "initial.trr" and "params.py" from another AIMMD run folder. It is possible to change "params.py" to customize the run.

(On a Workstation) call:

python generate.py

where nsteps is the final number of path sampling steps for the run, n the number of independent path sampling chains, nA the number of independent free simulators around A, and nB that of free simulators around B.

(On a HPC cluster) call:

python generate.py -s slurm_options.pysbatch ._job.sh

To append to an existing RFPS-AIMMD or AIMMD run

Merge the supplemental repository with the trajectory files into this one.

Just call again (on a Workstation)

python generate.py

or (on a HPC cluster)

sbatch ._job.sh

after updating the "nsteps" parameters.

To run enhanced sampling for a new system: please keep the data structure as close as possible to the original. Different names for the files can generate incompatibilities. We are currently trying to make it easier.

Reproducing the analysis

Run the analysis/figures.ipynb notebook. Some groups of cells have to be run multiple times after changing the parameters in the preamble.
n
ECMWF ERA5.1: ensemble means of surface level analysis parameter data for...
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Sep 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ECMWF ERA5.1: ensemble means of surface level analysis parameter data for 2000-2006 [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=ensemble%20run
Explore at:
Dataset updated
Sep 18, 2021
Description
This dataset contains ERA5.1 surface level analysis parameter data ensemble means over the period 2000-2006. ERA5.1 is the European Centre for Medium-Range Weather Forecasts (ECWMF) ERA5 reanalysis project re-run for 2000-2006 to improve upon the cold bias in the lower stratosphere seen in ERA5 (see technical memorandum 859 in the linked documentation section for further details). The ensemble means are calculated from the ERA5.1 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record. See linked datasets for ensemble member and spread data. Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). The main ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects. An initial release of ERA5 data, ERA5t, are also available upto 5 days behind the present. A limited selection of data from these runs are also available via CEDA, whilst full access is available via the Copernicus Data Store.
f
Description of models.
figshare.com
xml
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund (2025). Description of models. [Dataset]. http://doi.org/10.1371/journal.pone.0327593.s003
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327593.s003
Dataset updated
Jul 17, 2025
Dataset provided by
PLOS ONE
Authors
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and objective: Practical identifiability analysis, i.e., ascertaining whether a model property can be determined from given data, is central to model-based data analysis in biomedicine. The main approaches used today all require that coverage of the parameter space be exhaustive, which is usually impossible. An alternative could be using structural identifiability methods, since they do not need such coverage. However, current structural methods are unsuited for practical identifiability analysis, since they assume that all higher-order derivatives of the measured variables are available. Herein, we provide new definitions and methods that allow for this assumption to be relaxed. Methods and results: We introduce the concept of -identifiability, which differs from previous definitions in that it assumes that only the first derivatives of the measurement signal yi are available. This new type of identifiability can be determined using our new algorithms, as is demonstrated by applications to various published biomedical models. Our methods allow for identifiability of not only parameters, but of any model property, i.e., observability. These new results provide further strengthening of conclusions made in previous analysis of these models. For the first time, we can quantify the impact of the assumption that all derivatives are available in specific examples. If one, e.g., assumes that only up to third order derivatives, instead of all derivatives, are available, the number of identifiable parameters drops from 17 to 1 for a Drosophila model, and from 21 to 6 for an NF-B model. In both models, the previously obtained identifiability is present only if at least 20 derivatives of all measurement signals are available. Conclusion: Our results demonstrate that the assumption regarding availability of derivatives made in traditional structural identifiability analysis requires a big overestimation regarding the number of parameters that can be estimated. Our new methods and algorithms allow for this assumption to be relaxed, bringing structural identifiability methodology one step closer to practical identifiability analysis.
Intermediate rings and their fraction fields.
plos.figshare.com
xls
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund (2025). Intermediate rings and their fraction fields. [Dataset]. http://doi.org/10.1371/journal.pone.0327593.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327593.t001
Dataset updated
Jul 17, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and objective: Practical identifiability analysis, i.e., ascertaining whether a model property can be determined from given data, is central to model-based data analysis in biomedicine. The main approaches used today all require that coverage of the parameter space be exhaustive, which is usually impossible. An alternative could be using structural identifiability methods, since they do not need such coverage. However, current structural methods are unsuited for practical identifiability analysis, since they assume that all higher-order derivatives of the measured variables are available. Herein, we provide new definitions and methods that allow for this assumption to be relaxed. Methods and results: We introduce the concept of -identifiability, which differs from previous definitions in that it assumes that only the first derivatives of the measurement signal yi are available. This new type of identifiability can be determined using our new algorithms, as is demonstrated by applications to various published biomedical models. Our methods allow for identifiability of not only parameters, but of any model property, i.e., observability. These new results provide further strengthening of conclusions made in previous analysis of these models. For the first time, we can quantify the impact of the assumption that all derivatives are available in specific examples. If one, e.g., assumes that only up to third order derivatives, instead of all derivatives, are available, the number of identifiable parameters drops from 17 to 1 for a Drosophila model, and from 21 to 6 for an NF-B model. In both models, the previously obtained identifiability is present only if at least 20 derivatives of all measurement signals are available. Conclusion: Our results demonstrate that the assumption regarding availability of derivatives made in traditional structural identifiability analysis requires a big overestimation regarding the number of parameters that can be estimated. Our new methods and algorithms allow for this assumption to be relaxed, bringing structural identifiability methodology one step closer to practical identifiability analysis.
r
Specification and optimization of analytical data flows
resodate.org
Updated May 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Hüske (2016). Specification and optimization of analytical data flows [Dataset]. http://doi.org/10.14279/depositonce-5150
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-5150
Dataset updated
May 27, 2016
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Fabian Hüske
Description
In the past, the majority of data analysis use cases was addressed by aggregating relational data. Since a few years, a trend is evolving, which is called “Big Data” and which has several implications on the field of data analysis. Compared to previous applications, much larger data sets are analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. At the same time, analysis applications include data sets with less or even no structure at all. This evolution has implications on the requirements on data processing systems. Due to the growing size of data sets and the increasing computational complexity of advanced analysis methods, data must be processed in a massively parallel fashion. The large number and diversity of data analysis techniques as well as the lack of data structure determine the use of user-defined functions and data types. Many traditional database systems are not flexible enough to satisfy these requirements. Hence, there is a need for programming abstractions to define and efficiently execute complex parallel data analysis programs that support custom user-defined operations. The success of the SQL query language has shown the advantages of declarative query specification, such as potential for optimization and ease of use. Today, most relational database management systems feature a query optimizer that compiles declarative queries into physical execution plans. Cost-based optimizers choose from billions of plan candidates the plan with the least estimated cost. However, traditional optimization techniques cannot be readily integrated into systems that aim to support novel data analysis use cases. For example, the use of user-defined functions (UDFs) can significantly limit the optimization potential of data analysis programs. Furthermore, lack of detailed data statistics is common when large amounts of unstructured data is analyzed. This leads to imprecise optimizer cost estimates, which can cause sub-optimal plan choices. In this thesis we address three challenges that arise in the context of specifying and optimizing data analysis programs. First, we propose a parallel programming model with declarative properties to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.
Q
Data for: Debating Algorithmic Fairness
data.qdr.syr.edu
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Hamilton; Melissa Hamilton (2023). Data for: Debating Algorithmic Fairness [Dataset]. http://doi.org/10.5064/F6JOQXNF
Explore at:
pdf(53179), pdf(63339), pdf(285052), pdf(103333), application/x-json-hypothesis(55745), pdf(256399), jpeg(101993), pdf(233414), pdf(536400), pdf(786428), pdf(2243113), pdf(109638), pdf(176988), pdf(59204), pdf(124046), pdf(802960), pdf(82120)Available download formats
Unique identifier
https://doi.org/10.5064/F6JOQXNF
Dataset updated
Nov 13, 2023
Dataset provided by
Qualitative Data Repository
Authors
Melissa Hamilton; Melissa Hamilton
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
2008 - 2017
Area covered
United States
Description
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
Short-term PM2.5 exposure and early-readmission risk in Heart Failure...
catalog.data.gov
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). Short-term PM2.5 exposure and early-readmission risk in Heart Failure Patients [Dataset]. https://catalog.data.gov/dataset/short-term-pm2-5-exposure-and-early-readmission-risk-in-heart-failure-patients
Explore at:
Dataset updated
Nov 15, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
In this manuscript EPA researchers used high resolution (1x1 km) modeled air quality data from a model built by Harvard collaborators to estimate the association between short-term exposure to air pollution and the occurrence of 30-day readmissions in a heart failure population. The heart failure population was taken from patients presenting to a University of North Carolina Healthcare System (UNCHCS) affiliated hospital or clinic that reported electronic health records to the Carolina Data Warehouse for Health (CDW-H). A description of the variables used in this analysis are available in the data dictionary (L:/PRIV/EPHD_CRB/Cavin/CARES/Data Dictonaries/HF short term PM25 and readmissions data dictionary.xlsx) associated with this manuscript. Analysis code is available in L:/PRIV/EPHD_CRB/Cavin/CARES/Project Analytic Code/Lauren Wyatt/DailyPM_HF_readmission. This dataset is not publicly accessible because: Dataset is PII in the form of electronic health records. It can be accessed through the following means: Data can be accessed with an approved IRB. Format: In this manuscript EPA researchers used high resolution (1x1 km) modeled air quality data from a model built by Harvard collaborators to estimate the association between short-term exposure to air pollution and the occurrence of 30-day readmissions in a heart failure population. The heart failure population was taken from patients presenting to a University of North Carolina Healthcare System (UNCHCS) affiliated hospital or clinic that reported electronic health records to the Carolina Data Warehouse for Health (CDW-H). A description of the variables used in this analysis are available in the data dictionary (L:/PRIV/EPHD_CRB/Cavin/CARES/Data Dictonaries/HF short term PM25 and readmissions data dictionary.xlsx) associated with this manuscript. Analysis code is available in L:/PRIV/EPHD_CRB/Cavin/CARES/Project Analytic Code/Lauren Wyatt/DailyPM_HF_readmission. This dataset is associated with the following publication: Wyatt, L., A. Weaver, J. Moyer, J. Schwartz, Q. Di, D. Diazsanchez, W. Cascio, and C. Ward-Caviness. Short-term PM2.5 exposure and early-readmission risk: A retrospective cohort study in North Carolina Heart Failure Patients. American Heart Journal. Mosby Year Book Incorporated, Orlando, FL, USA, 248: 130-138, (2022).
n
Dataset of biomechanical effects of the addition of a precision constraint...
data-staging.niaid.nih.gov
zenodo.org
+1more
zip
Updated Nov 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nour Sghaier; Guillaume Fumery; Vincent Fourcassié; Nicolas Alain Turpin; Pierre Moretto (2021). Dataset of biomechanical effects of the addition of a precision constraint on a collective load carriage task [Dataset]. http://doi.org/10.5061/dryad.kprr4xh5n
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.kprr4xh5n
Dataset updated
Nov 4, 2021
Dataset provided by
Université Toulouse III - Paul Sabatier
University of Reunion Island
Authors
Nour Sghaier; Guillaume Fumery; Vincent Fourcassié; Nicolas Alain Turpin; Pierre Moretto
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Team lifting is a complex and collective motor task that possesses both motor and cognitive components. We study the collective load carriage adaptation due to an additional accuracy constraint. Ten dyads performed a first condition in which they collectively transported a load (CC), and a second one in which they transported the same load while maintaining a ball in a target position on its top (PC). The recovery-rate, amplitude, and period of the center-of-mass of the whole system (dyad + table, CoMPACS) were computed. We analyzed the forces and moments exerted at each joint of the upper limbs of the subjects. We observed a decrease in the overall performance of the dyads when the Precision task was added, i.e., i) the velocity and amplitude of CoMPACS decreased by 1,7% and 5,8%, respectively, ii) inter-subject variability of the Moment-Cost-Function decreased by 95% and recovery rate decreased by 19,2% during PC. A kinetic synergy analysis showed that the subjects reorganized their coordinations in the PC. Our results demonstrate that adding a precision task affects the economy of collective load carriage. Notwithstanding, the joint moments at the upper-limbs are better balanced and co-vary more across the paired subjects during the precision task.

Methods We recorded two conditions the first one were the subjects walked side by side at spontaneous speed while carrying a box (CC: Control Condition) and a second one were the individuals were instructed to transport the box, while performing an accuracy task consisting in keeping a ball in the center (PC:Pecision Condition).

Motion capture data were collected using thirteen infrareds (11 MX3 and 2 TS40) transmitter-receiver video cameras (Vicon©, Oxford metric’s, Oxford, United Kingdom) sampled at 200 Hz. Forty-two retro-reflective markers were placed on bony landmarks and on the navel of each subject ( according to Wu et al., 2002, 2005) R. Soc. open sci. and fourteen on the box. The ball used during the PC tests was reflective as well and was tracked by the Vicon© system. In order to record the gait pattern at constant speed (i.e. to exclude the acceleration and deceleration phases at the beginning and end of each trial) the volume calibrated by the Vicon© system (30 m3) was located in the middle of the 20m-long walkway crossed by the subjects. The reflective marks were tracked to define the kinematics of the Poly-Articulated Collective System (PACS) formed by the two individuals and the load they carry (22,23). The data were recorded on one gait cycle defined by the first heel strike of the first subject and the third heel strike of the second subject of the PACS to ensure a cycle of each subject. The 3D reconstruction was performed using Vicon Nexus 1.8.5© software. The two lateral handles used to transport the box were equipped with Sensix® force sensors sampled at 2000 Hz. A 4th order Butterworth filter and a 5 Hz and 10 Hz cut frequency have been applied to analyze the positions of the markers and the forces exerted on the box handles, respectively

The De Leva Anthropometric tables (24) was used to estimate the mass mi and the CoM of each segment i (CoMi) of the PACS and to compute its global CoM (CoMPACS) as follow: ????? = 1 ????? ∑ ???? ?=33 ?=1 with GPACS the 3D position of the CoMPACS in the frame R (the global coordinate system), mPACS the mass of the PACS, n the number of PACS segments (i.e. 16 segments per volunteer plus one segment for the box) and Gi the 3D position of the CoMi in the frame R. The CoM of the box was determined at the intersection point of the vertical lines obtained by hanging it with a thread fixed at different positions. The material used for the box construction, i.e. wood and aluminium, was considered as not deformable. According to Holt et al., (2003), the amplitude (A = Zmax – Zmin, with Z the height of the CoMPACS, in meters,) and the period (peak to peak, in percent of the gait cycle) of the CoMPACS were also assessed. The forward kinetic (Wkf), as well as the vertical (Wv) and external work (Wext) of the CoMPACS were computed according to the method of Bastien et al. (2016). Then based on the external work, the percentage of energy recovered of the CoMPACS in the sagittal plane was computed (called recovery rate RR in Fumery et al., 2018a, 2018b). This parameter assess the amount of energy transferred between the potential and the kinetic energy (Eqn 2). ?? = 100 ?kf+?v−?ext ?kf+?v (2) The closer the value of RR to 100%, the more consistent the locomotor pattern is with the inverted pendulum system (IPS) model of locomotion (26–28,13). In this study, the trajectory of CoMPACS and CoM of an inverted pendulum have been investigated.

Sensix force sensors recorded the forces and moments applied by each individual on the two box handles. Before the computation, the data of the sensors located by specific markers were transfer to the Galilean frame of the laboratory using rotation matrix. A cross correlation method has been applied in order to analyze the coordination between the forces produced by both subjects. To investigate whether the movement of the box results from an action-reaction strategy, we computed the time lag required for the position of the left side and right side of the box to be the same on the medio-lateral, antero-posterior and vertical axis in CC and PC. The coordination was assessed through the forces exerted on three directions (medio-lateral, antero-posterior and vertical axis). This results will reflect the level of coordination of two subjects during a collective transport In order to quantify muscular constraints produced at the upper limb, the Inverse Dynamic Method was used to estimate forces and moments at each joint of the upper limb. The Moment Cost Function was then computed (kg.m2 .s-2 , Costes et al., 2018) as follow : MCF = √??_?? 2 + √??_?? 2 + √??_?? 2 + √??_?? 2 + √??_?ℎ 2 + √??_?ℎ 2 + √????? 2 + √????? Where ML_wt, MR_wt, ML_el, MR_el, ML_sh, MR_sh, Mback and Mneck are the mean values over a PACS gait cycle of the three-dimensional left and right wrist, left and right elbow, left and right shoulder, top of the back and neck moments, respectively. √M2 represents the Euclidian norm of M (i.e. √M2 = √∑ (??) 3 2 ?=1 , with Mi the i-th component of the vector M). Then, the MCF values of each individual was summed to obtain the total moment cost function (Total MCF). This Total MCF allows to quantify the global effort produced at the upper-limbs of the PACS during one gait cycle. Finally, the MCF difference (∆ MCF) was computed as the difference between the two individuals to investigate whether the subjects produced the same effort in the upper limbs during the load transport.

We extracted the synergies by using a principal component analysis (PCA) applied to the wrist, elbow, shoulder, back, and neck joint moment on the right and left sides of the body. The PCA was used to reduce data dimensionality. It consisted in the eigen-decomposition of the co-variance matrix of the joint moment data (Matlab eig function). The joint moments data from one trial per condition were arranged in time × joint moment matrices. In this analysis we only used the y-component which is very close to the norm of the 3D joint moments, except that the y-component (medio-lateral) could be positive and negative. The joint moments were normalized by their amplitude and centered (mean removed) before application of the PCA. We called the eigenvectors extracted from the PCA, dynamic synergy vectors. We computed the VAF (Variance Accounted For) which corresponded to the cumulative sum of the eigenvalues, ordered from the greatest to the lowest value, normalized by the total variance computed as the sum of all eigenvalues. The synergy vectors retained were then rotated using a Varimax rotation method to improve interpretability. We first extracted the synergy vectors for each experimental condition and each participant separately. In this analysis the initial data matrices were constituted of all available time frames in line, concatenated from one trial per condition, and of eight columns corresponding to each joint moment, namely the right wrist, left wrist, right elbow, left elbow, right shoulder, left shoulder, back, and neck. Based on a previous study we extracted 3 synergies in this analysis. We then performed a second analysis to identify possible co-variations between the joint moments of the two participants in each pair. The columns of the initial matrices were thus constituted of the joint moments of the two loaded arms, i.e., the right wrist, elbow, and shoulder joint moments of participant #1, plus the left wrist, elbow and shoulder joint moments of participant #2. Based on a previous study we extracted 2 synergies in this analysis. We used Pearson’s r to order the different synergies similarly between the different subjects and conditions.

A performance score (Scorep) was assigned to each image of the videos captured by the Vicon© system (200 images/s). The score depended on the location of the ball in the target: 1 when the ball was inside the small circle, 0.5 when it was in-between the small and large circle and 0 when it was outside the large circle. The accuracy over the whole gait cycle was measured by an overall score (Scoreaccuracy), expressed in percentage, and calculated as follows: ????????????? = ∑ ?????? ×??? ????? ????? where tgait cycle represents the number of Vicon© images recorded along one gait cycle.

The head, shoulders and pelvis rotation angles were computed around the vertical axis of each individual in the two conditions. The angle was positive when the subjects turned towards the box they carried, otherwise it was negative. The distance between the forehead and the sternum (distance FOR-STE) was also computed in order to investigate the flexion of the cervical spine.

The data were analyzed with Matlab
h
Data publication: Generating structured foam via flowing through a wire...
rodare.hzdr.de
7z
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skrypnik, Artem; Knüpfer, Leon; Trtik, Pavel; Lappan, Tobias; Ziauddin, Muhammad; Heitkam, Sascha (2025). Data publication: Generating structured foam via flowing through a wire array [Dataset]. http://doi.org/10.14278/rodare.3583
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.14278/rodare.3583
Dataset updated
Feb 26, 2025
Dataset provided by
Technische Universität Dresden
Paul Scherrer Institute
Helmholtz-Zentrum Dresden-Rossendorf
Technische Universität Dresden, Helmholtz-Zentrum Dresden-Rossendorf
Authors
Skrypnik, Artem; Knüpfer, Leon; Trtik, Pavel; Lappan, Tobias; Ziauddin, Muhammad; Heitkam, Sascha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The structure of liquid foam is generally considered random and isotropic. However, when foam flows past a set of wires, an inhomogeneous liquid fraction distribution, or layering, can be observed within the bulk. This dataset presents neutron radiography data of foam flowing past a set of thin metal wires. During the experiments, the gas flow rate and bubble size were varied. Additionally, a dataset for foam flow past a single wire is included for reference.

The folder includes initial data for the manuscript "Generating structured foam via flowing through a wire array".

Folder includes:

01_scripts scripts used for the data processing
02_rawdata Initial neutron imaging data (.tif images)
03_evaluation folder with MATLAB scripts used for data analysis

LABBOOK Experimental labbook explaining the experimental sequence.
Protocol The Neutron imaging protocol with the data of neutron source and image resolution

The data processing is shown for the O1 bubble generator. It includes:
1. MASK_... script used to define the cell walls and determine the mask, used further for the liquid fraction calculation.
2. N13_INIT... scritps to define normalised image, which further used to determine liquid fraction distribution
3. POST_BOT... scripts used to postprocess the data: define Liquid fraction distribution and DFT of those distributions.

Note:

1. The data were analysed at two positions: bottom (0) and top (100), meaining at the wire grid and 100 mm downstream the grid. To this end, mask should be calculated also for the top part of the nozzle, if needed, as shown in the presented examples.

2. The data for the empty cell were calculated for the foam flow through the cell with a single thin wire. The data were extracted
in the ROI before the wire (run 553-557).

3. Data processing was performed as suggested in https://doi.org/10.1371/journal.pone.0210300
Z
Data for paper "Ultraluminous X-ray sources in Globular Clusters"
data-staging.niaid.nih.gov
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wiktorowicz, Grzegorz; Giersz, Mirek; Askar, Abbas; Hypki, Arkadiusz; Hellström, Lucas (2025). Data for paper "Ultraluminous X-ray sources in Globular Clusters" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14953836
Explore at:
Dataset updated
Mar 1, 2025
Dataset provided by
Nicolaus Copernicus Astronomical Center of the Polish Academy of Sciences
Nicolaus Copernicus Astronomical Center
National Astronomical Observatories
Authors
Wiktorowicz, Grzegorz; Giersz, Mirek; Askar, Abbas; Hypki, Arkadiusz; Hellström, Lucas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

This repository contains data used in the research published in "Ultraluminous X-ray sources in Globular Clusters" (Wiktorowicz et al. 2025) available at https://arxiv.org/abs/2501.06037.

The dataset includes simulation data on Ultra-Luminous X-ray (ULX) systems and their host globular clusters (GCs).

Files

Data Files- ulx_240711_escapers.csv: Contains data on ULX systems for every timestep- system_240711_escapers.csv: Contains data on GC properties throughout the simulation- snapshot0_240711_escapers.csv: Contains data on initial properties (Zero Age Main Sequence) for binaries whose component or components later become part of a ULX (as accretor or donor)

Header Files- ulx_header.csv: Column definitions for the ULX systems data- system_header.csv: Column definitions for the GC properties data- snapshot0_header.csv: Column definitions for the initial properties data

Data Usage

For analysis and visualizations based on this data, please refer to our paper.

Citation

If you use this data in your research, please cite:

@ARTICLE{2025arXiv250106037W, author = {{Wiktorowicz}, Grzegorz and {Giersz}, Mirek and {Askar}, Abbas and {Hypki}, Arkadiusz and {Helstrom}, Lucas}, title = "{Ultraluminous X-ray sources in Globular Clusters}", journal = {arXiv e-prints}, keywords = {Astrophysics - High Energy Astrophysical Phenomena, Astrophysics - Astrophysics of Galaxies}, year = 2025, month = jan, eid = {arXiv:2501.06037}, pages = {arXiv:2501.06037}, doi = {10.48550/arXiv.2501.06037},archivePrefix = {arXiv}, eprint = {2501.06037}, primaryClass = {astro-ph.HE}, adsurl = {https://ui.adsabs.harvard.edu/abs/2025arXiv250106037W}, adsnote = {Provided by the SAO/NASA Astrophysics Data System}}

Contact

For questions regarding this dataset, please contact:

Grzegorz Wiktorowiczgwiktoro@camk.edu.plNicolaus Copernicus Astronomical Center, Polish Academy of Sciences
Maple code for algorithm 3.
plos.figshare.com
xml
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund (2025). Maple code for algorithm 3. [Dataset]. http://doi.org/10.1371/journal.pone.0327593.s002
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327593.s002
Dataset updated
Jul 17, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Peter Thompson; Benjamin Jan Andersson; Nicolas Sundqvist; Gunnar Cedersund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and objective: Practical identifiability analysis, i.e., ascertaining whether a model property can be determined from given data, is central to model-based data analysis in biomedicine. The main approaches used today all require that coverage of the parameter space be exhaustive, which is usually impossible. An alternative could be using structural identifiability methods, since they do not need such coverage. However, current structural methods are unsuited for practical identifiability analysis, since they assume that all higher-order derivatives of the measured variables are available. Herein, we provide new definitions and methods that allow for this assumption to be relaxed. Methods and results: We introduce the concept of -identifiability, which differs from previous definitions in that it assumes that only the first derivatives of the measurement signal yi are available. This new type of identifiability can be determined using our new algorithms, as is demonstrated by applications to various published biomedical models. Our methods allow for identifiability of not only parameters, but of any model property, i.e., observability. These new results provide further strengthening of conclusions made in previous analysis of these models. For the first time, we can quantify the impact of the assumption that all derivatives are available in specific examples. If one, e.g., assumes that only up to third order derivatives, instead of all derivatives, are available, the number of identifiable parameters drops from 17 to 1 for a Drosophila model, and from 21 to 6 for an NF-B model. In both models, the previously obtained identifiability is present only if at least 20 derivatives of all measurement signals are available. Conclusion: Our results demonstrate that the assumption regarding availability of derivatives made in traditional structural identifiability analysis requires a big overestimation regarding the number of parameters that can be estimated. Our new methods and algorithms allow for this assumption to be relaxed, bringing structural identifiability methodology one step closer to practical identifiability analysis.
d
Data from: Baseline for the coast of Puerto Rico's main island generated to...
catalog.data.gov
data.usgs.gov
+3more
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Baseline for the coast of Puerto Rico's main island generated to calculate shoreline change rates using the Digital Shoreline Analysis System version 5.1 (ver. 2.0, March 2023) [Dataset]. https://catalog.data.gov/dataset/baseline-for-the-coast-of-puerto-ricos-main-island-generated-to-calculate-shoreline-change
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Puerto Rico
Description
The U.S. Geological Survey (USGS) maintains shoreline positions for the United States' coasts from both older sources, such as aerial photographs or topographic surveys, and contemporary sources, such as lidar-point clouds and digital elevation models. These shorelines are compiled and analyzed in the USGS Digital Shoreline Analysis System (DSAS), version 5.1 software to calculate rates of change. Keeping a record of historical shoreline positions is an effective method to monitor change over time, enabling scientists to identify areas most susceptible to erosion or accretion. These data can help coastal managers understand which areas of the coast are vulnerable to change. This data release, and other associated products, represent an expansion of the USGS national-scale shoreline database to include Puerto Rico and its islands, Vieques and Culebra. The USGS, in cooperation with the Coastal Research and Planning Institute of Puerto Rico—part of the Graduate School of Planning at the University of Puerto Rico, Rio Piedras Campus—has derived and compiled a database of historical shoreline positions using a variety of methods. These historical shoreline data are then used to measure the rate of shoreline change over time. Rate calculations are computed within a geographic information system (GIS) using the DSAS version 5.1 software. Starting from a user defined baseline, measurement transects are created by DSAS that intersect the shoreline vectors. The resulting intersections provide the location and time information necessary to calculate rates of shoreline change. The overall project contains shorelines, baselines, shoreline change rates (long-term and short-term), and shoreline intersects (long-term and short-term), for Puerto Rico, and the adjacent islands of Vieques and Culebra.
w
Fragmentation Main Model
data.wu.ac.at
gstore.unm.edu
+2more
html, xml, zip
Updated Jun 25, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Data Analysis Center, University of New Mexico (2014). Fragmentation Main Model [Dataset]. https://data.wu.ac.at/schema/data_gov/OGIwNWVhZGQtM2FmOC00NzQ4LWJmZWYtNDc0NDg1MjhjZDBh
Explore at:
zip, html, xmlAvailable download formats
Dataset updated
Jun 25, 2014
Dataset provided by
Earth Data Analysis Center, University of New Mexico
Area covered
25d861ab5d47d1abf4d41a9481d38617a376fe58
Description
The fragmentation model combines patch size and patch continuity with diversity of vegetation types per patch and rarity of vegetation types per patch. A patch was defined as an area of natural vegetation not bisected by roads, utilities, or rails. Patch size and continuity were calculated separately for forests, woodlands, shrublands, grasslands and riparian areas. Definitions of each system type can be found in the data atlas (http://allaboutwatersheds.org/groups/SAS/public/data-atlases).

Facebook

Twitter

Click to copy link

Link copied

Cite

Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003

Descriptive statistics.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pgph.0002475.t003

Dataset updated

Oct 31, 2023

Dataset provided by

PLOS Global Public Health

Authors

Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

Clear search

Close search

Google apps

Main menu

Descriptive statistics.

Data Use in Academia Dataset

ECMWF ERA5t: ensemble means of surface level analysis parameter data

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: Do Neurochemicals Reflect Psychophysiological Dimensions in...

Household Health Survey 2012-2013, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

ECMWF ERA5: ensemble means of surface level analysis parameter data

Simulation data and code for "Optimal Rejection-Free Path Sampling"

ECMWF ERA5.1: ensemble means of surface level analysis parameter data for...

Description of models.

Intermediate rings and their fraction fields.

Specification and optimization of analytical data flows

Data for: Debating Algorithmic Fairness

Short-term PM2.5 exposure and early-readmission risk in Heart Failure...

Dataset of biomechanical effects of the addition of a precision constraint...

Data publication: Generating structured foam via flowing through a wire...

Data for paper "Ultraluminous X-ray sources in Globular Clusters"

Dataset Description

Files

Header Files- ulx_header.csv: Column definitions for the ULX systems data- system_header.csv: Column definitions for the GC properties data- snapshot0_header.csv: Column definitions for the initial properties data

Data Usage

Citation

Contact

Maple code for algorithm 3.

Data from: Baseline for the coast of Puerto Rico's main island generated to...

Fragmentation Main Model

Descriptive statistics.See More Versions

Header Files- `ulx_header.csv`: Column definitions for the ULX systems data- `system_header.csv`: Column definitions for the GC properties data- `snapshot0_header.csv`: Column definitions for the initial properties data

Descriptive statistics.