Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.
Facebook
TwitterThis dataset is designed to analyze prescription drug use and spending among New York State residents at the drug product level (pharmacy claims data that have been aggregated by labeler code and product code segments of the National Drug Code). The dataset includes the number of prescriptions filled by unique members by payer type, nonproprietary name, labeler name, dosage characteristics, amount insurer paid, and more.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the dataset files and the code used for feature engineering in the paper titled "Open Data, Private Learners: A De-Identified Dataset for Learning Analytics Research" submitted to the Nature Scientific data journal.
Facebook
Twitterhttps://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32Whttps://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32W
This repository introduces a de-identified and spatially aggregated user activity dataset in the United States, which contains approximately 2 million users and 1.2 billion data points collected from 2012 to 2019. This data is stored in a series of Parquet files, totaling 13GB, in the data/parquet/ directory. A smaller subset with only points within the Denver timezone can be found in data/sample/.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterGeneral information
The dataset contains de-identified messaging meta-data from 78 WhatsApp and 7 Facebook data donations. The dataset was collected in an online study using the data donation platform Dona. After donating their messaging data, the study participants viewed visual summaries of their messaging data and evaluated this visual feedback. The responses to the evaluation questions and the sociodemographic data of the participants are also included in the dataset.
The data was collected from August 2022 to June 2024.
For more information on Dona, the associated publications and updates, please visit https://mbp-lab.github.io/dona-blog/.
File description
donation_table.csv - contains general information about the donations including
messages_table.csv - contains the donated messages includingdonation_table.csv)messages_filtered_table.csv - same structure as messages_table.csv except that chats with no considerable interactions were removed. This was defined as chats where donor's word count contribution was less than 10% or more than 90%. survey.xlsx → contains survey responses of the participants.survey_table_coding.xlsx → contains the mapping between the column names in survey.xlsx and their meaning, including the original survey questions and response options. Different sheets of the Excel file detail the survey questions and responses in one of the study languages (English, German, Armenian).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: Assess the effectiveness of providing Logical Observation Identifiers Names and Codes (LOINC®)-to-In Vitro Diagnostic (LIVD) coding specification, required by the United States Department of Health and Human Services for SARS-CoV-2 reporting, in medical center laboratories and utilize findings to inform future United States Food and Drug Administration policy on the use of real-world evidence in regulatory decisions. Materials and Methods: We compared gaps and similarities between diagnostic test manufacturers’ recommended LOINC® codes and the LOINC® codes used in medical center laboratories for the same tests. Results: Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (chi-square 45.0, 4 df, P < .0001). Discussion: The five medical centers and three test manufacturers vary in how they organize, categorize, and store LIS catalog information. This variation impacts data quality and interoperability. Conclusion: The results of the study indicate that providing the LIVD mappings was not sufficient to support laboratory data interoperability. National implementation of LIVD and further efforts to promote laboratory interoperability will require a more comprehensive effort and continuing evaluation and quality control. Methods Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6,956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (Chi-square 45.0,4 df,p < .0001). Data Collection from Medical Center Laboratory Pilot Sites: Each medical center was asked to extract about 100 LOINC® Codes from their LIS for prioritized tests of interest focused on high-risk conditions and SARS-CoV-2. For each selected test (e.g., SARS-CoV-2 RNA COVID-19), we collected the following data elements: test names/descriptions (e.g., SARS coronavirus 2 RNA [Presence] in Respiratory specimen by NAA with probe detection), associated instruments (e.g., IVD Vendor Model), and LOINC® codes (e.g., 94500-6). High risk conditions were defined by referencing the CDC’s published list of Underlying Medical Conditions Associated with High Risk for Severe COVID-19.[29] A data collection template spreadsheet was created and disseminated to the medical centers to help provide consistency and reporting clarity for data elements from sites. Data Collection from IVD Manufacturers: We coordinated with SHIELD stakeholders and the IICC to request manufacturer LIVD catalogs containing the LOINC® codes per IVD instrument per test from manufacturers.
Facebook
TwitterUsers of the V-safe data are required to adhere to the following standards for the analysis and reporting of research data. All research results must be presented and/or published in a manner that protects the confidentiality of participants. V-safe data will not be presented and/or published in any way in which an individual can be identified.
Therefore, users will:
V-safe is an active surveillance program to monitor the safety of COVID-19 vaccines that are authorized for use under U.S. Food and Drug Administration (FDA) Emergency Use Authorization (EUA) and after FDA licensure.
These data include MedDRA coded text responses collected through V-safe from 12/13/2020 to 06/30/2023. Please review the V-safe data user agreement before analyzing any V-safe data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides de-identified insurance data for hypertension and hyperlipidemia from three managed care organizations in Allegheny County: Gateway Health Plan, Highmark Health, and UPMC. The data represents the insured population for the 2015 and 2016 calendar years.
The dataset includes valuable insights into the health conditions of individuals covered by these plans but comes with several limitations:
Misclassification and Duplicate Individuals: As administrative claims data was not collected for surveillance purposes, there may be errors in categorizing conditions or identifying individuals.
Exclusions: It does not include individuals who were uninsured, not enrolled in one of the represented plans, or were enrolled for less than 90 days.
Missing Data: The dataset excludes individuals who did not seek care within the past two years or were enrolled in plans not represented in the dataset.
Disclaimer: Users should exercise caution when using this data to assess disease prevalence or interpret trends over time, as the data was collected for purposes other than public health surveillance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we work on repairing three datasets:
country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients. N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089
This dataset provides the code and the data sets used in the PHD thesis "Identification of Software Features in Issue Tracking System Data" as well as the files that represent the results measured in experiments. For problem studies (e.g. chapters 10 and 11) the folders include the raw data and the data annotations as well as the tools used to extract the data. For solution studies (e.g. chapters 14, 15, and 16) the folders include the raw data, tools used to extract the data, the gold standards, the code to process the data and finally the experiment results. This archive contains one folder per chaper and every folder again contains a README.md file describing its contents: Chapter10: SOFTWARE FEATURES IN ISSUE TRACKING SYSTEMS – AN EMPIRICAL STUDY Chapter11: ISSUE TYPES AND INFORMATION TYPES – AN EMPIRICAL STUDY Chapter14: PREPROCESSING ISSUES – AN EMPIRICAL STUDY Chapter15: RECOVERING RELATED ISSUES IN ISSUE TRACKING SYSTEMS Chapter16: DETECTING SOFTWARE FEATURE REQUESTS IN ISSUES
Facebook
TwitterData Overview: ASPR, in partnership with the Centers for Medicare and Medicaid Services (CMS), provide de-identified and aggregated Medicare beneficiary claims data at the state/territory, county, and ZIP code levels in the HHS emPOWER Map and this public HHS emPOWER REST Service. The REST Service includes aggregated data from the Medicare Fee-For-Service (Parts A&B) and Medicare Advantage (Part C) Programs for beneficiaries who rely on electricity-dependent durable medical equipment (DME) and cardiac implantable devices.
Data includes the following DME and devices: Cardiac devices (left, right, and bi-ventricular assistive devices
(LVAD, RVAD, BIVAD) and total artificial hearts (TAH)), ventilators
(invasive, non-invasive and oscillating vests), bi-level positive airway
pressure device (BiPAP), oxygen concentrator, enteral feeding tube,
intravenous (IV) infusion pump, suction pump, end-stage renal disease
(ESRD) at-home dialysis, motorized wheelchair or scooter, and electric
bed.
Purpose: Over 3 million Medicare beneficiaries rely on electricity-dependent
medical equipment, such as ventilators, to live independently in their
homes. Severe weather and other emergencies, especially those with long
power outages, can be life-threatening for these individuals. The HHS
emPOWER Map and public REST Service give every public health official,
emergency manager, hospital, first responder, electric company, and
community member the power to discover the electricity-dependent Medicare
population in their state/territory, county, and ZIP Code.
Data Source: The REST Service’s data is developed from Medicare Fee-For-Service
(Part A & B) (>33M 65+, blind, ESRD [dialysis], dual-eligible,
disabled [adults and children]) and Medicare Advantage (Part C) (>21M
65+, blind, ESRD [dialysis], dual-eligible, disabled [adults and
children]) beneficiary administrative claims data. This data does not
include individuals that are only enrolled in a State Medicaid Program.
Note that Medicare DME are subject to insurance claim reimbursement caps
(e.g. rental caps) that differ by type, so the DME may have different
“look-back” periods (e.g. ventilators are 13 months and oxygen
concentrators are 36 months).
ZIP Code Aggregation: Some ZIP Codes do not have specific geospatial boundary data (e.g.,
P.O. Box ZIP Codes). To capture the complete population data, the HHS
emPOWER Program identified the larger boundary ZIP Code (Parent) within
which the non-boundary ZIP Code (Child) resides. The totals are added
together and displayed under the parent ZIP Code.
Approved Data Uses: The public HHS emPOWER REST Service is approved for use by all partners
and is intended to be used to help inform and support emergency
preparedness, response, recovery, and mitigation activities in all
communities.
Privacy Protections: Protecting the privacy of Medicare beneficiaries is an essential
priority for the HHS emPOWER Program. Therefore, all personally
identifiable information are removed from the data and numerous
de-identification methods are applied to significantly minimize, if not
completely mitigate, any potential for deduction of small cells or
re-identification risk. For example, any cell size found between the
range of 1 and 10 is masked and shown as 11.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor A DICOM dataset for evaluation of medical image de-identification. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].
This dataset is composed of the following files:
basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,data.csv/data.xlsx contains the measured energy consumption and execution time for each basic blockWe first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.
We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:
{
"extr_pfctl_altq.c_pfctl_altq_init": [
# Basic block 1
[
# Instruction 1 of BB1
[
"MOV.W",
"#queue_map",
"R13"
],
# Instruction 2 of BB1
[
"MOV.B",
"#0",
"R14"
],
# Instruction 3 of BB1
[
"CALL",
"#hcreate_r",
null
]
],
# Basic block 2
[
....
]
]
}
The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].
Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.
Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.
bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
137_0;3;8.77;7.08;7.04;8.21;2.92;40;50
Spreadsheet format :
bb_id: the unique identifier of a basic block (cf. Basic Blocks)nb_inst: the number of instructions in the basic blockmax_energy: the maximum energy comsumption (in nJ) measured during the experimentmax_time: the maximum execution time (in us) measured during the experimentavg_time: the average execution time (in us) measured during the experimentavg_energy: the average energy comsumption (in nJ) measured during the experimentenergy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)nb_samples: how much time the basic block energy consumption/execution time has been measuredunroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.
Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.
The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn
The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code
This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference ANR-10-LABX-07-01
Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut
Licensed under the Creative Commons Attribution 4.0 International License
[1] Reymond, H., Amalou, A. N., Puaut, I. “WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)
[2] Da Silva, Anderson Faustino, et al. “Anghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
user_id assigned to each user is consistent across each of the files (i.e., test windows in test_windows.csv for user_id == 10 correspond to user_id == 10 in para.csv, info.csv, etc.).- Paradata collection was designed to operate asynchronously, ensuring that no user interactions were disrupted during data collection. As mLab was a browser-based technology, when users use browser navigation rapidly, there can be things that appear out of order (as we noted in our manuscripts).- Dataframe column datatypes have been converted to satisfy specific analyses. Please check datatypes and convert as needed for your particular needs.- Due to the sensitive nature of the survey data, the CSV/parquet files for survey are not included in this data repository. They will be made available upon reasonable request.- For detailed descriptions of the study design and methodology, please refer to the associated publication.## File Descriptions### facts.csv / facts.parquetThis file records the educational facts shown to users._Column Name_: Descriptiondisplay_timestamp: Unix timestamp when an educational fact was displayed to the user.session_id: Unique identifier for the user’s session when the fact was shown.user_id: Unique identifier for the user the fact was shown to.fact_category: Category of the educational fact displayed to the user.fact_index: Index number of the fact shown to the user.fact_text: Text of the educational fact displayed.### info.csv / info.parquetThis file contains user-specific metadata, and repeated data about each user (alerts and pinned facts)._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instrument: REDCap field indicating the repeat instrument used. For general information about the user (userlocation and numberoflogins), redcap_repeat_instrument is blank. For repeated data (alerts, pinned facts, scheduled tests), redcap_repeat_instrument will identify the instrument.redcap_repeat_instance: Instance number of the repeat instrument (if applicable).user_location: Location of the user (if available). (1: New York City cohort; 2: Chicago cohort)alert_date: A unix timestamp of when an alert was sent to the user.number_of_logins: Total number of logins by the user.alert_subject: Subject or type of the alert sent.alert_read: Indicates whether the alert was read by the user (1: True; 0: False).end_date: Unix timestamp of the end date of scheduled tests.start_date: Unix timestamp of the start date of scheduled tests.fact_category: Category of the educational fact pinned by the user.fact_index: Index number of the fact pinned by the user.fact_text: Text of the educational fact pinned by the user.fact_link: Link to additional information associated with the fact pinned by the user (if available).### para.csv / para.parquetThis file includes paradata (detailed in-app user interactions) collected during the study._Column Name_: Descriptiontimestamp: A timezone-naive timestamp of the user action or event.session_id: Unique identifier for the user’s session.user_id: Unique identifier for the user.user_action: Specific user action (e.g., button press, page navigation). "[]clicked" indicates a pressable element (i.e., button, collapsible/expandable menu) is pressed.current_page: Current page of the app being interacted with.browser: Browser used to access the app.platform: Platform used to access the app (e.g., Windows, iOS).platform_description: Detailed description of the platform.platform_maker: Manufacturer of the platform.device_name: Name of the device used.device_maker: Manufacturer of the device used.device_brand_name: Brand name of the device used.device_type: Type of device used (Mobile, Computer, etc.).user_location: Location of the user (1: New York City cohort; 2: Chicago cohort).### survey.csv / survey.parquetThis file contains survey responses collected from users.*NOTE: Due to the sensitive nature of this data, CSV/parquet files are not included in this data repository. They will be made available upon reasonable request.*_Column Name_: Descriptionuser_id: Unique identifier for the user.timepoint: Timepoint of the survey (baseline/0 months, 6 months, 12 months).race: Race of the user.education: Education level of the user.health_literacy: Health literacy score of the user.health_efficacy: Health efficacy score of the user.itues_mean: Information Technology Usability Evaluation Scale (ITUES) mean score.age: Age of the user.### tests.csv / tests.parquetThis file contains data related to the HIV self-tests performed by users in the mLab App._Column Name_: Descriptionuser_id: Unique identifier for the user that took the test.visual_analysis_date: A unix timestamp of the visual analysis of the test by the user.visual_result: Result of the visual analysis (positive, negative).mlab_analysis_date: A unix timestamp of the analysis conducted by the mLab system.mlab_result: Result from the mLab analysis (positive, negative).signal_ratio: Ratio of the intensity of test signal to the control signal.control_signal: mLab calculated intensity of the control signal.test_signal: mLab calculated intensity of the test signal.browser: Browser used to access the app (from the User Agent string).platform: Platform used to access the app (e.g., Windows, iOS) (from the User Agent string).platform_description: Detailed description of the platform (from the User Agent string).platform_maker: Manufacturer of the platform (from the useragUser Agentent string).device_name: Name of the device used (from the User Agent string).device_maker Manufacturer of the device used (from the User Agent string).device_brand_name: Brand name of the device used (from the User Agent string).device_type: Type of device used (Mobile, Computer, etc.) (from the User Agent string).### test_windows.csv / test_windows.parquetThis file contains information on testing windows assigned to users._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instance: Instance of the repeat instrument.start_date: Start date of the (hard) testing window.end_date: End date of the (hard) testing window.## CitationIf you use this dataset, please cite the associated mLab and mLab paradata publications.
Facebook
Twitterhttps://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Project Summary As part of a qualitative study of abortion reporting in the United States, the research team conducted cognitive interviews to iteratively assess new question wording and introductions designed to improve the accuracy of abortion reporting in surveys (to be shared on the Qualitative Data Repository in a separate submission). As expectations to share the data that underlie research increase, understanding how participants, particularly those taking part in qualitative research, respond to requests for data sharing is necessary. We assessed research participants’ willingness to, understanding of, and motivations for data sharing. Data Overview The data consist of excerpts from cognitive interviews with 64 cisgender women in two states in January and February of 2020 in which researchers asked for respondents for consent to share de-identified data. Eligibility criteria included: assigned female at birth, currently identified as a woman between the ages of 18-49, English-speaking, and reported ever having penile-vaginal sex. Respondents were screened for abortion history as well to ensure that at least half the sample reported a prior abortion. At the end of interviews, participants were asked to reflect on their motivations for agreeing or declining to share their data. The data included here are coded excerpts of their answers. Most respondents consented to data sharing, citing helping others as a primary motivation for agreeing to share their data. However, a substantial number of participants demonstrated limited understanding of “data sharing.” Data available here include the following materials: overview of methods, cognitive interview consent form (with language for data sharing consent), and data sharing analysis coding scheme.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains de-identified data and analysis code from a study using the 20-item Short Form Health Survey (SF-20). The data were collected to evaluate the validity and reliability of the SF-20 instrument. Included files: - De-identified data (sf20.sas7bdat) - Analysis code (iscience_code.sas) Variables in the dataset correspond to survey domains such as physical functioning, mental health, general health, pain, role functioning, and social functioning. The analysis code includes scripts for descriptive statistics, internal consistency, and test-retest reliability. All information is de-identified to protect participant privacy. Further details can be found in the accompanying README file and manuscript.
Facebook
Twitterhttps://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Project Overview Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research. This study addressed the following research questions: RQ1: How is big social data curation similar to and different from qualitative data curation? RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research? RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues? RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations? Data Description and Collection Overview The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions. The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified. The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues. Selection and Organization of Shared Data The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR)...
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Automatic Identification And Data Capture Market Size 2024-2028
The automatic identification and data capture market size is valued to increase by USD 21.52 billion, at a CAGR of 8.1% from 2023 to 2028. Increasing applications of RFID will drive the automatic identification and data capture market.
Market Insights
North America dominated the market and accounted for a 47% growth during the 2024-2028.
By Product - RFID products segment was valued at USD 18.41 billion in 2022
By segment2 - segment2_1 segment accounted for the largest market revenue share in 2022
Market Size & Forecast
Market Opportunities: USD 79.34 million
Market Future Opportunities 2023: USD 21520.40 million
CAGR from 2023 to 2028 : 8.1%
Market Summary
The Automatic Identification and Data Capture (AIDC) market encompasses technologies and solutions that enable businesses to capture and process data in real time. This market is driven by the increasing adoption of RFID technology, which offers benefits such as improved supply chain visibility, inventory management, and operational efficiency. The growing popularity of smart factories, where automation and data-driven processes are integral, further fuels the demand for AIDC solutions. However, the market also faces challenges, including security concerns. With the increasing use of AIDC technologies, there is a growing need to ensure data privacy and security. This has led to the development of advanced encryption techniques and access control mechanisms to mitigate potential risks. A real-world business scenario illustrating the importance of AIDC is in the retail industry. Retailers use AIDC technologies such as RFID tags and barcode scanners to manage inventory levels, track stock movements, and optimize supply chain operations. By automating data capture processes, retailers can reduce manual errors, improve order fulfillment accuracy, and enhance the overall customer experience. Despite the challenges, the AIDC market continues to grow, driven by the need for real-time data processing and automation across various industries.
What will be the size of the Automatic Identification And Data Capture Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleThe Automatic Identification and Data Capture (AIDC) market continues to evolve, driven by advancements in technology and increasing business demands. AIDC solutions, including barcode scanners, RFID systems, and OCR technology, enable organizations to streamline processes, enhance data accuracy, and improve operational efficiency. According to recent research, the use of RFID technology in the retail sector has surged by 25% over the past five years, underpinning its significance in inventory management and supply chain optimization. Moreover, the integration of AIDC technologies with cloud computing services and data visualization dashboards offers real-time data access and analysis, empowering businesses to make informed decisions. For instance, a manufacturing firm can leverage RFID data to monitor production lines, optimize workflows, and ensure compliance with industry regulations. AIDC systems are also instrumental in enhancing data security and privacy, with advanced encryption protocols and access control features ensuring data integrity and confidentiality. By adopting AIDC technologies, organizations can not only improve their operational efficiency but also gain a competitive edge in their respective industries.
Unpacking the Automatic Identification And Data Capture Market Landscape
The market encompasses technologies such as RFID tag identification, data stream management, and data mining techniques. These solutions enable businesses to efficiently process and analyze vast amounts of data from various sources, leading to significant improvements in data quality metrics and workflow optimization strategies. For instance, RFID implementation can result in a 30% increase in inventory accuracy, while data mining techniques can uncover hidden patterns and trends, driving ROI improvement and compliance alignment. Real-time data processing, facilitated by technologies like document understanding AI and image recognition algorithms, ensures swift decision-making and error reduction. Data capture pipelines and database management systems provide a solid foundation for data aggregation and analysis, while semantic web technologies and natural language processing enhance information retrieval and understanding. By integrating sensor data and applying machine vision systems, businesses can achieve high-throughput imaging and object detection, further enhancing their data processing capabilities.
Key Market Drivers Fueling Growth
The significant expansion of RFID (Radio-Frequency Identification) technology applications is the primary market growth catalyst. In the dyna
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.