100+ datasets found

p
Data from: De-Identification Software Package
physionet.org
Updated Dec 18, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2007). De-Identification Software Package [Dataset]. http://doi.org/10.13026/C20M3F
Explore at:
Unique identifier
https://doi.org/10.13026/C20M3F
Dataset updated
Dec 18, 2007
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.
All-Payer Claims Data (APD De-Identified): Prescription Drug Detail 2021
healthdata.gov
health.data.ny.gov
csv, xlsx, xml
Updated Apr 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
health.data.ny.gov (2025). All-Payer Claims Data (APD De-Identified): Prescription Drug Detail 2021 [Dataset]. https://healthdata.gov/State/All-Payer-Claims-Data-APD-De-Identified-Prescripti/xspv-r9gv
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Apr 16, 2025
Dataset provided by
health.data.ny.gov
Description
This dataset is designed to analyze prescription drug use and spending among New York State residents at the drug product level (pharmacy claims data that have been aggregated by labeler code and product code segments of the National Drug Code). The dataset includes the number of prescriptions filled by unique members by payer type, nonproprietary name, labeler name, dosage characteristics, amount insurer paid, and more.
Open Data, Private Learners: A De-Identified Dataset for Learning Analytics...
zenodo.org
json, zip
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Authors; Anonymous Authors (2025). Open Data, Private Learners: A De-Identified Dataset for Learning Analytics Research [Dataset]. http://doi.org/10.5281/zenodo.17087849
Explore at:
zip, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17087849
Dataset updated
Sep 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Authors; Anonymous Authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the dataset files and the code used for feature engineering in the paper titled "Open Data, Private Learners: A De-Identified Dataset for Learning Analytics Research" submitted to the Nature Scientific data journal.
K
Data from: A nationwide dataset of de-identified activity spaces derived...
rdr.kuleuven.be
data.europa.eu
bin, html, png +3
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ate Poorthuis; Ate Poorthuis; Qingqing Chen; Qingqing Chen; Matthew Zook; Matthew Zook (2024). A nationwide dataset of de-identified activity spaces derived from geotagged social media data [Dataset]. http://doi.org/10.48804/MBT32W
Explore at:
bin(350058609), bin(343210705), bin(339567034), bin(7689635), bin(356103458), bin(464238846), bin(343917658), bin(352731206), bin(11), bin(80), bin(349167020), bin(350592850), bin(46), bin(347450565), bin(344180411), png(45943), bin(820), bin(355866409), bin(339723853), bin(343579774), bin(339844041), type/x-r-syntax(2828), text/markdown(10202), text/x-r-notebook(6141), text/x-r-notebook(12401), png(61006), text/markdown(1081), bin(342771225), bin(352829960), bin(338514562), bin(128323523), bin(365224429), type/x-r-syntax(6649), bin(354674831), html(1260453), png(1296462), bin(346287470), bin(341837511), bin(351674357), bin(345585624), bin(342265826), bin(338663571), bin(345766749), bin(338768105), bin(52), bin(350032415), bin(336728301), bin(349781639), bin(343270426), bin(338425309), bin(339578746), bin(414), bin(340596988), bin(352504392), bin(54), bin(337721126), bin(351066190), bin(340283175), bin(341609477)Available download formats
Unique identifier
https://doi.org/10.48804/MBT32W
Dataset updated
Jun 6, 2024
Dataset provided by
KU Leuven RDR
Authors
Ate Poorthuis; Ate Poorthuis; Qingqing Chen; Qingqing Chen; Matthew Zook; Matthew Zook
License
https://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32Whttps://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32W
Area covered
United States
Description
This repository introduces a de-identified and spatially aggregated user activity dataset in the United States, which contains approximately 2 million users and 1.2 billion data points collected from 2012 to 2019. This data is stored in a series of Parquet files, totaling 13GB, in the data/parquet/ directory. A smaller subset with only points within the Denver timezone can be found in data/sample/.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
u
Data from: Data Donation with Dona: De-identified Messaging Data (WhatsApp...
pub.uni-bielefeld.de
data.niaid.nih.gov
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olya Hakobyan; Paul-Julius Hillmann; Hanna Drimalla (2024). Data Donation with Dona: De-identified Messaging Data (WhatsApp and Facebook) and Evaluation Responses [Dataset]. https://pub.uni-bielefeld.de/record/2993360
Explore at:
Dataset updated
Oct 10, 2024
Authors
Olya Hakobyan; Paul-Julius Hillmann; Hanna Drimalla
Description
General information

The dataset contains de-identified messaging meta-data from 78 WhatsApp and 7 Facebook data donations. The dataset was collected in an online study using the data donation platform Dona. After donating their messaging data, the study participants viewed visual summaries of their messaging data and evaluated this visual feedback. The responses to the evaluation questions and the sociodemographic data of the participants are also included in the dataset.

The data was collected from August 2022 to June 2024.

For more information on Dona, the associated publications and updates, please visit https://mbp-lab.github.io/dona-blog/.

File description

donation_table.csv - contains general information about the donations including

donation_id: donation identifier

donor_id: the ID of the donor to distinguish the messages sent by them from those sent by contacts

source: the messaging platform from which the data is donated (WhatsApp or Facebook)

external_id: ID used to connect messaging data with the survey data

messages_table.csv - contains the donated messages including

conversation_id: chat identifier

sender_id: sender identifier

datetime: time of the message, UNIX time for Facebook and device time for WhatsApp

word_count: word count of the messages achieved by splitting the text based on whitespace

donation_id: donation identifier (also listed in donation_table.csv)

messages_filtered_table.csv - same structure as messages_table.csv except that chats with no considerable interactions were removed. This was defined as chats where donor's word count contribution was less than 10% or more than 90%.

survey.xlsx → contains survey responses of the participants.

survey_table_coding.xlsx → contains the mapping between the column names in survey.xlsx and their meaning, including the original survey questions and response options. Different sheets of the Excel file detail the survey questions and responses in one of the study languages (English, German, Armenian).
Data from: Encoding laboratory testing data: case studies of the national...
data-staging.niaid.nih.gov
search.dataone.org
+3more
zip
Updated May 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Cholan; Gregory Pappas; Greg Rehwoldt; Andrew Sills; Elizabeth Korte; I. Khalil Appleton; Natalie Scott; Wendy Rubinstein; Sara Brenner; Riki Merrick; Wilbur Hadden; Keith Campbell; Michael Waters (2022). Encoding laboratory testing data: case studies of the national implementation of HHS requirements and related standards in five laboratories [Dataset]. http://doi.org/10.5061/dryad.0cfxpnw55
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.0cfxpnw55
Dataset updated
May 10, 2022
Dataset provided by
Food and Drug Administrationhttp://www.fda.gov/
Association of Public Health Laboratorieshttps://www.aphl.org/
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Office of the National Coordinator for Health Information Technologyhttp://healthit.gov/
Deloitte (United States)
University of Maryland, College Park
Authors
Raja Cholan; Gregory Pappas; Greg Rehwoldt; Andrew Sills; Elizabeth Korte; I. Khalil Appleton; Natalie Scott; Wendy Rubinstein; Sara Brenner; Riki Merrick; Wilbur Hadden; Keith Campbell; Michael Waters
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: Assess the effectiveness of providing Logical Observation Identifiers Names and Codes (LOINC®)-to-In Vitro Diagnostic (LIVD) coding specification, required by the United States Department of Health and Human Services for SARS-CoV-2 reporting, in medical center laboratories and utilize findings to inform future United States Food and Drug Administration policy on the use of real-world evidence in regulatory decisions. Materials and Methods: We compared gaps and similarities between diagnostic test manufacturers’ recommended LOINC® codes and the LOINC® codes used in medical center laboratories for the same tests. Results: Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (chi-square 45.0, 4 df, P < .0001). Discussion: The five medical centers and three test manufacturers vary in how they organize, categorize, and store LIS catalog information. This variation impacts data quality and interoperability. Conclusion: The results of the study indicate that providing the LIVD mappings was not sufficient to support laboratory data interoperability. National implementation of LIVD and further efforts to promote laboratory interoperability will require a more comprehensive effort and continuing evaluation and quality control. Methods Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6,956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (Chi-square 45.0,4 df,p < .0001). Data Collection from Medical Center Laboratory Pilot Sites: Each medical center was asked to extract about 100 LOINC® Codes from their LIS for prioritized tests of interest focused on high-risk conditions and SARS-CoV-2. For each selected test (e.g., SARS-CoV-2 RNA COVID-19), we collected the following data elements: test names/descriptions (e.g., SARS coronavirus 2 RNA [Presence] in Respiratory specimen by NAA with probe detection), associated instruments (e.g., IVD Vendor Model), and LOINC® codes (e.g., 94500-6). High risk conditions were defined by referencing the CDC’s published list of Underlying Medical Conditions Associated with High Risk for Severe COVID-19.[29] A data collection template spreadsheet was created and disseminated to the medical centers to help provide consistency and reporting clarity for data elements from sites. Data Collection from IVD Manufacturers: We coordinated with SHIELD stakeholders and the IICC to request manufacturer LIVD catalogs containing the LOINC® codes per IVD instrument per test from manufacturers.
D
V-safe COVID-19 MedDRA coded text responses
data.cdc.gov
csv, xlsx, xml
Updated Jul 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). V-safe COVID-19 MedDRA coded text responses [Dataset]. https://data.cdc.gov/Public-Health-Surveillance/v-safe-COVID-19-MedDRA-coded-text-responses/5biu-jjj3
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Jul 24, 2023
Description
Users of the V-safe data are required to adhere to the following standards for the analysis and reporting of research data. All research results must be presented and/or published in a manner that protects the confidentiality of participants. V-safe data will not be presented and/or published in any way in which an individual can be identified.

Therefore, users will:

Not attempt to link or permit others to link the data with individually identified records in another database.

Not attempt to learn the identity of any participant in the data and will not deliberately combine these data with other CDC or non-CDC data for the purpose of matching records to identify individuals. If you should inadvertently discover the identity of any participant, you will ensure the identity of any participant is kept confidential, and will not be used in any publications and/or presentations.

Not imply or state, either in written or oral form, that interpretations based on analysis of the data reflect official CDC policies or positions.

Understand that sub-national analyses are not appropriate for this national sample and will not be conducted.

Understand that V-safe is a voluntary self-enrollment program requiring smartphone access; therefore, information from V-safe might not be representative or generalizable to the US population.
By clicking on the weblink below to download and use these V-safe data, you signify your agreement to comply with the above-stated terms.

V-safe is an active surveillance program to monitor the safety of COVID-19 vaccines that are authorized for use under U.S. Food and Drug Administration (FDA) Emergency Use Authorization (EUA) and after FDA licensure.

These data include MedDRA coded text responses collected through V-safe from 12/13/2020 to 06/30/2023. Please review the V-safe data user agreement before analyzing any V-safe data.
Hypertension Dataset
kaggle.com
zip
Updated Mar 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
INK (2025). Hypertension Dataset [Dataset]. https://www.kaggle.com/datasets/irakozekelly/hypertension-dataset
Explore at:
zip(7708 bytes)Available download formats
Dataset updated
Mar 16, 2025
Authors
INK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides de-identified insurance data for hypertension and hyperlipidemia from three managed care organizations in Allegheny County: Gateway Health Plan, Highmark Health, and UPMC. The data represents the insured population for the 2015 and 2016 calendar years.

The dataset includes valuable insights into the health conditions of individuals covered by these plans but comes with several limitations:

Misclassification and Duplicate Individuals: As administrative claims data was not collected for surveillance purposes, there may be errors in categorizing conditions or identifying individuals. Exclusions: It does not include individuals who were uninsured, not enrolled in one of the represented plans, or were enrolled for less than 90 days. Missing Data: The dataset excludes individuals who did not seek care within the past two years or were enrolled in plans not represented in the dataset.

Disclaimer: Users should exercise caution when using this data to assess disease prevalence or interpret trends over time, as the data was collected for purposes other than public health surveillance.
Data cleaning using unstructured data
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13135983
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this project, we work on repairing three datasets:

Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.

Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.

Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

"{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")

"{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")

"{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
h
Source Code, Data and Additional Material for the Thesis: "Identification of...
heidata.uni-heidelberg.de
Updated Apr 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thorsten Merten; Thorsten Merten (2017). Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data" [Dataset]. http://doi.org/10.11588/DATA/10089
Explore at:
text/plain; charset=utf-8(1086), zip(110360424)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/10089
Dataset updated
Apr 6, 2017
Dataset provided by
heiDATA
Authors
Thorsten Merten; Thorsten Merten
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089
Description
This dataset provides the code and the data sets used in the PHD thesis "Identification of Software Features in Issue Tracking System Data" as well as the files that represent the results measured in experiments. For problem studies (e.g. chapters 10 and 11) the folders include the raw data and the data annotations as well as the tools used to extract the data. For solution studies (e.g. chapters 14, 15, and 16) the folders include the raw data, tools used to extract the data, the gold standards, the code to process the data and finally the experiment results. This archive contains one folder per chaper and every folder again contains a README.md file describing its contents: Chapter10: SOFTWARE FEATURES IN ISSUE TRACKING SYSTEMS – AN EMPIRICAL STUDY Chapter11: ISSUE TYPES AND INFORMATION TYPES – AN EMPIRICAL STUDY Chapter14: PREPROCESSING ISSUES – AN EMPIRICAL STUDY Chapter15: RECOVERING RELATED ISSUES IN ISSUE TRACKING SYSTEMS Chapter16: DETECTING SOFTWARE FEATURE REQUESTS IN ISSUES

Any Healthcare Service Any DME - ZipLevel

arcgis.com
nifc.hub.arcgis.com

Updated Oct 30, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

National Interagency Fire Center (2024). Any Healthcare Service Any DME - ZipLevel [Dataset]. https://www.arcgis.com/sharing/oauth2/social/authorize?socialLoginProviderName=apple&oauth_state=a3YVDu2l8CuwOToizJAZ6zw..fWaSw7_23EpfyTmq6iiyWH45VNYLLxCUG55ts4iryiM8fQ9zbMEt3p_8Q4ZiuqI-8g74-ATHF7Vqr5WVFYJtpyoxa46aVxzUgfcpzWVZaQBfXMGHkyvu-rZa71uiClCof1H2bxm1al9Qm9CGbWRiWmGgTsSyhVjvrIlvvLQ3RyFYAm0Ph3P5zyeSXyUbQyxAAAy3vtIXcEagts7ZKEgJvxy1uUQM7509yLg8Hm5tgNyzk6Ip6dZV5awMt0nKinipzqg74ecgpCnWYaeB-FJaJsSxjvJ2yqeQEne4_HcWMqc142gkb7Bn_2oqHz_PTYe_AXZaCQmbpwXBAV5R9GUX-nXvQSntExRCGbg2CNLLbbIURLD2bTHgDhZTt5w.

Explore at:

Dataset updated

Oct 30, 2024

Dataset authored and provided by

National Interagency Fire Center

Area covered

Description

Data Overview: ASPR, in partnership with the Centers for Medicare and Medicaid Services (CMS), provide de-identified and aggregated Medicare beneficiary claims data at the state/territory, county, and ZIP code levels in the HHS emPOWER Map and this public HHS emPOWER REST Service. The REST Service includes aggregated data from the Medicare Fee-For-Service (Parts A&B) and Medicare Advantage (Part C) Programs for beneficiaries who rely on electricity-dependent durable medical equipment (DME) and cardiac implantable devices.

Data includes the following DME and devices: Cardiac devices (left, right, and bi-ventricular assistive devices
  (LVAD, RVAD, BIVAD) and total artificial hearts (TAH)), ventilators
  (invasive, non-invasive and oscillating vests), bi-level positive airway
  pressure device (BiPAP), oxygen concentrator, enteral feeding tube,
  intravenous (IV) infusion pump, suction pump, end-stage renal disease
  (ESRD) at-home dialysis, motorized wheelchair or scooter, and electric
  bed.



Purpose: Over 3 million Medicare beneficiaries rely on electricity-dependent
 medical equipment, such as ventilators, to live independently in their
 homes. Severe weather and other emergencies, especially those with long
 power outages, can be life-threatening for these individuals. The HHS
 emPOWER Map and public REST Service give every public health official,
 emergency manager, hospital, first responder, electric company, and
 community member the power to discover the electricity-dependent Medicare
 population in their state/territory, county, and ZIP Code.



Data Source: The REST Service’s data is developed from Medicare Fee-For-Service
  (Part A & B) (>33M 65+, blind, ESRD [dialysis], dual-eligible,
  disabled [adults and children]) and Medicare Advantage (Part C) (>21M
  65+, blind, ESRD [dialysis], dual-eligible, disabled [adults and
  children]) beneficiary administrative claims data. This data does not
  include individuals that are only enrolled in a State Medicaid Program.
  Note that Medicare DME are subject to insurance claim reimbursement caps
  (e.g. rental caps) that differ by type, so the DME may have different
  “look-back” periods (e.g. ventilators are 13 months and oxygen
  concentrators are 36 months).



ZIP Code Aggregation: Some ZIP Codes do not have specific geospatial boundary data (e.g.,
  P.O. Box ZIP Codes). To capture the complete population data, the HHS
  emPOWER Program identified the larger boundary ZIP Code (Parent) within
  which the non-boundary ZIP Code (Child) resides. The totals are added
  together and displayed under the parent ZIP Code.




Approved Data Uses: The public HHS emPOWER REST Service is approved for use by all partners
  and is intended to be used to help inform and support emergency
  preparedness, response, recovery, and mitigation activities in all
  communities.






Privacy Protections: Protecting the privacy of Medicare beneficiaries is an essential
  priority for the HHS emPOWER Program. Therefore, all personally
  identifiable information are removed from the data and numerous
  de-identification methods are applied to significantly minimize, if not
  completely mitigate, any potential for deduction of small cells or
  re-identification risk. For example, any cell size found between the
  range of 1 and 10 is masked and shown as 11.

Metadata record for: A DICOM dataset for evaluation of medical image...
figshare.com
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Data Curation Team (2023). Metadata record for: A DICOM dataset for evaluation of medical image de-identification [Dataset]. http://doi.org/10.6084/m9.figshare.14802774.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14802774.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Scientific Data Curation Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor A DICOM dataset for evaluation of medical image de-identification. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case...
zenodo.org
data.niaid.nih.gov
bin, csv, xz
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut (2024). MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case Execution Time (WCET) dataset [Dataset]. http://doi.org/10.5281/zenodo.11066623
Explore at:
csv, bin, xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11066623
Dataset updated
Dec 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].

Folder structure

This dataset is composed of the following files:

basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,

data.csv/data.xlsx contains the measured energy consumption and execution time for each basic block

We first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.

Basic Blocks

We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:

{ "extr_pfctl_altq.c_pfctl_altq_init": [ # Basic block 1 [ # Instruction 1 of BB1 [ "MOV.W", "#queue_map", "R13" ], # Instruction 2 of BB1 [ "MOV.B", "#0", "R14" ], # Instruction 3 of BB1 [ "CALL", "#hcreate_r", null ] ], # Basic block 2 [ .... ] ] }

The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].

Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.

Energy Consumption and Execution Time

Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.

bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor 137_0;3;8.77;7.08;7.04;8.21;2.92;40;50

Spreadsheet format :

bb_id: the unique identifier of a basic block (cf. Basic Blocks)

nb_inst: the number of instructions in the basic block

max_energy: the maximum energy comsumption (in nJ) measured during the experiment

max_time: the maximum execution time (in us) measured during the experiment

avg_time: the average execution time (in us) measured during the experiment

avg_energy: the average energy comsumption (in nJ) measured during the experiment

energy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)

nb_samples: how much time the basic block energy consumption/execution time has been measured

unroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)

Basic Block Unrolling

To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.

Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.

Dataset description

Features

The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn

Code

The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code

Acknowledgment

This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference ANR-10-LABX-07-01

Licensing

Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut

Licensed under the Creative Commons Attribution 4.0 International License

References

[1] Reymond, H., Amalou, A. N., Puaut, I. “WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)

[2] Da Silva, Anderson Faustino, et al. “Anghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
mLab App Paradata
figshare.com
txt
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Scherr; Austin N. Hardcastle; Carson Moore; Dheemanth Majji; Lisa M. Kuhns; Robert Garofalo; Rebecca Schnall (2025). mLab App Paradata [Dataset]. http://doi.org/10.6084/m9.figshare.28211636.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28211636.v1
Dataset updated
Nov 13, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Thomas Scherr; Austin N. Hardcastle; Carson Moore; Dheemanth Majji; Lisa M. Kuhns; Robert Garofalo; Rebecca Schnall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
mLab App Paradata DataThis data is part of the publications associated with the analysis of paradata and user engagement with the mLab App, a digital health application designed to support at-home HIV testing. The dataset includes detailed user interactions, testing results, and metadata collected during a multi-site clinical trial. The data is divided into several files, each with its own schema and purpose, as described below.**Notes:**- The data has been de-identified and cleaned.- The datasets originated from a REDCap project. Due to the different needs that the custom reports used to generate the dataset served, there may be redunancies across the different files.- The user_id assigned to each user is consistent across each of the files (i.e., test windows in test_windows.csv for user_id == 10 correspond to user_id == 10 in para.csv, info.csv, etc.).- Paradata collection was designed to operate asynchronously, ensuring that no user interactions were disrupted during data collection. As mLab was a browser-based technology, when users use browser navigation rapidly, there can be things that appear out of order (as we noted in our manuscripts).- Dataframe column datatypes have been converted to satisfy specific analyses. Please check datatypes and convert as needed for your particular needs.- Due to the sensitive nature of the survey data, the CSV/parquet files for survey are not included in this data repository. They will be made available upon reasonable request.- For detailed descriptions of the study design and methodology, please refer to the associated publication.## File Descriptions### facts.csv / facts.parquetThis file records the educational facts shown to users._Column Name_: Descriptiondisplay_timestamp: Unix timestamp when an educational fact was displayed to the user.session_id: Unique identifier for the user’s session when the fact was shown.user_id: Unique identifier for the user the fact was shown to.fact_category: Category of the educational fact displayed to the user.fact_index: Index number of the fact shown to the user.fact_text: Text of the educational fact displayed.### info.csv / info.parquetThis file contains user-specific metadata, and repeated data about each user (alerts and pinned facts)._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instrument: REDCap field indicating the repeat instrument used. For general information about the user (userlocation and numberoflogins), redcap_repeat_instrument is blank. For repeated data (alerts, pinned facts, scheduled tests), redcap_repeat_instrument will identify the instrument.redcap_repeat_instance: Instance number of the repeat instrument (if applicable).user_location: Location of the user (if available). (1: New York City cohort; 2: Chicago cohort)alert_date: A unix timestamp of when an alert was sent to the user.number_of_logins: Total number of logins by the user.alert_subject: Subject or type of the alert sent.alert_read: Indicates whether the alert was read by the user (1: True; 0: False).end_date: Unix timestamp of the end date of scheduled tests.start_date: Unix timestamp of the start date of scheduled tests.fact_category: Category of the educational fact pinned by the user.fact_index: Index number of the fact pinned by the user.fact_text: Text of the educational fact pinned by the user.fact_link: Link to additional information associated with the fact pinned by the user (if available).### para.csv / para.parquetThis file includes paradata (detailed in-app user interactions) collected during the study._Column Name_: Descriptiontimestamp: A timezone-naive timestamp of the user action or event.session_id: Unique identifier for the user’s session.user_id: Unique identifier for the user.user_action: Specific user action (e.g., button press, page navigation). "[]clicked" indicates a pressable element (i.e., button, collapsible/expandable menu) is pressed.current_page: Current page of the app being interacted with.browser: Browser used to access the app.platform: Platform used to access the app (e.g., Windows, iOS).platform_description: Detailed description of the platform.platform_maker: Manufacturer of the platform.device_name: Name of the device used.device_maker: Manufacturer of the device used.device_brand_name: Brand name of the device used.device_type: Type of device used (Mobile, Computer, etc.).user_location: Location of the user (1: New York City cohort; 2: Chicago cohort).### survey.csv / survey.parquetThis file contains survey responses collected from users.*NOTE: Due to the sensitive nature of this data, CSV/parquet files are not included in this data repository. They will be made available upon reasonable request.*_Column Name_: Descriptionuser_id: Unique identifier for the user.timepoint: Timepoint of the survey (baseline/0 months, 6 months, 12 months).race: Race of the user.education: Education level of the user.health_literacy: Health literacy score of the user.health_efficacy: Health efficacy score of the user.itues_mean: Information Technology Usability Evaluation Scale (ITUES) mean score.age: Age of the user.### tests.csv / tests.parquetThis file contains data related to the HIV self-tests performed by users in the mLab App._Column Name_: Descriptionuser_id: Unique identifier for the user that took the test.visual_analysis_date: A unix timestamp of the visual analysis of the test by the user.visual_result: Result of the visual analysis (positive, negative).mlab_analysis_date: A unix timestamp of the analysis conducted by the mLab system.mlab_result: Result from the mLab analysis (positive, negative).signal_ratio: Ratio of the intensity of test signal to the control signal.control_signal: mLab calculated intensity of the control signal.test_signal: mLab calculated intensity of the test signal.browser: Browser used to access the app (from the User Agent string).platform: Platform used to access the app (e.g., Windows, iOS) (from the User Agent string).platform_description: Detailed description of the platform (from the User Agent string).platform_maker: Manufacturer of the platform (from the useragUser Agentent string).device_name: Name of the device used (from the User Agent string).device_maker Manufacturer of the device used (from the User Agent string).device_brand_name: Brand name of the device used (from the User Agent string).device_type: Type of device used (Mobile, Computer, etc.) (from the User Agent string).### test_windows.csv / test_windows.parquetThis file contains information on testing windows assigned to users._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instance: Instance of the repeat instrument.start_date: Start date of the (hard) testing window.end_date: End date of the (hard) testing window.## CitationIf you use this dataset, please cite the associated mLab and mLab paradata publications.
Q
Data for: Qualitative Data Sharing: Participant Understanding, Motivation,...
data.qdr.syr.edu
pdf, tsv, txt
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alicia VandeVusse; Alicia VandeVusse; Jennifer Mueller; Jennifer Mueller (2023). Data for: Qualitative Data Sharing: Participant Understanding, Motivation, and Consent [Dataset]. http://doi.org/10.5064/F6YYA3O3
Explore at:
tsv(1613), pdf(215887), tsv(33480), txt(3490), pdf(219898)Available download formats
Unique identifier
https://doi.org/10.5064/F6YYA3O3
Dataset updated
Nov 1, 2023
Dataset provided by
Qualitative Data Repository
Authors
Alicia VandeVusse; Alicia VandeVusse; Jennifer Mueller; Jennifer Mueller
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
Jan 2020 - Feb 2020
Area covered
Wisconsin, United States, United States, New Jersey
Dataset funded by
Eunice Kennedy Shriver National Institute Of Child Health & Human Development of the National Institutes of Health
Description
Project Summary As part of a qualitative study of abortion reporting in the United States, the research team conducted cognitive interviews to iteratively assess new question wording and introductions designed to improve the accuracy of abortion reporting in surveys (to be shared on the Qualitative Data Repository in a separate submission). As expectations to share the data that underlie research increase, understanding how participants, particularly those taking part in qualitative research, respond to requests for data sharing is necessary. We assessed research participants’ willingness to, understanding of, and motivations for data sharing. Data Overview The data consist of excerpts from cognitive interviews with 64 cisgender women in two states in January and February of 2020 in which researchers asked for respondents for consent to share de-identified data. Eligibility criteria included: assigned female at birth, currently identified as a woman between the ages of 18-49, English-speaking, and reported ever having penile-vaginal sex. Respondents were screened for abortion history as well to ensure that at least half the sample reported a prior abortion. At the end of interviews, participants were asked to reflect on their motivations for agreeing or declining to share their data. The data included here are coded excerpts of their answers. Most respondents consented to data sharing, citing helping others as a primary motivation for agreeing to share their data. However, a substantial number of participants demonstrated limited understanding of “data sharing.” Data available here include the following materials: overview of methods, cognitive interview consent form (with language for data sharing consent), and data sharing analysis coding scheme.
H
Replication Data for: SF20
dataverse.harvard.edu
search.dataone.org
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mi-Sun Lee (2025). Replication Data for: SF20 [Dataset]. http://doi.org/10.7910/DVN/AXKNZ2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/AXKNZ2
Dataset updated
Oct 15, 2025
Dataset provided by
Harvard Dataverse
Authors
Mi-Sun Lee
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains de-identified data and analysis code from a study using the 20-item Short Form Health Survey (SF-20). The data were collected to evaluate the validity and reliability of the SF-20 instrument. Included files: - De-identified data (sf20.sas7bdat) - Analysis code (iscience_code.sas) Variables in the dataset correspond to survey domains such as physical functioning, mental health, general health, pain, role functioning, and social functioning. The analysis code includes scripts for descriptive statistics, internal consistency, and test-retest reliability. All information is de-identified to protect participant privacy. Further details can be found in the accompanying README file and manuscript.
Q
Interviews regarding data curation for qualitative data reuse and big social...
data.qdr.syr.edu
bin, pdf, txt
Updated Apr 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Mannheimer; Sara Mannheimer (2023). Interviews regarding data curation for qualitative data reuse and big social research [Dataset]. http://doi.org/10.5064/F6GWMU4O
Explore at:
pdf(111223), pdf(170851), pdf(174860), pdf(220706), pdf(181317), pdf(155781), pdf(176948), pdf(186400), pdf(216506), pdf(186156), pdf(166627), pdf(204315), pdf(120883), pdf(223955), pdf(197623), pdf(209721), pdf(212401), pdf(111468), pdf(175067), pdf(194133), pdf(194606), bin(254918656), pdf(174896), txt(8346), pdf(180451), pdf(192049), pdf(119959), pdf(214380), bin(2258685), pdf(547705), pdf(189347), pdf(196971), pdf(115127), pdf(213879), pdf(146828), pdf(195493), pdf(177017), pdf(189665), pdf(149437), pdf(183110), pdf(221008), pdf(200024)Available download formats
Unique identifier
https://doi.org/10.5064/F6GWMU4O
Dataset updated
Apr 26, 2023
Dataset provided by
Qualitative Data Repository
Authors
Sara Mannheimer; Sara Mannheimer
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
Mar 1, 2019 - Jun 1, 2023
Area covered
United States
Description
Project Overview Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research. This study addressed the following research questions: RQ1: How is big social data curation similar to and different from qualitative data curation? RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research? RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues? RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations? Data Description and Collection Overview The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions. The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified. The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues. Selection and Organization of Shared Data The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR)...
Automatic Identification And Data Capture Market Analysis North America,...
technavio.com
pdf
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Automatic Identification And Data Capture Market Analysis North America, APAC, Europe, South America, Middle East and Africa - China, US, Japan, UK, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/automatic-identification-and-data-capture-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Oct 30, 2024
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2024 - 2028
Area covered
United States, United Kingdom
Description
Snapshot img

Automatic Identification And Data Capture Market Size 2024-2028

The automatic identification and data capture market size is valued to increase by USD 21.52 billion, at a CAGR of 8.1% from 2023 to 2028. Increasing applications of RFID will drive the automatic identification and data capture market.

Market Insights

North America dominated the market and accounted for a 47% growth during the 2024-2028. By Product - RFID products segment was valued at USD 18.41 billion in 2022 By segment2 - segment2_1 segment accounted for the largest market revenue share in 2022

Market Size & Forecast

Market Opportunities: USD 79.34 million Market Future Opportunities 2023: USD 21520.40 million CAGR from 2023 to 2028 : 8.1%

Market Summary

The Automatic Identification and Data Capture (AIDC) market encompasses technologies and solutions that enable businesses to capture and process data in real time. This market is driven by the increasing adoption of RFID technology, which offers benefits such as improved supply chain visibility, inventory management, and operational efficiency. The growing popularity of smart factories, where automation and data-driven processes are integral, further fuels the demand for AIDC solutions. However, the market also faces challenges, including security concerns. With the increasing use of AIDC technologies, there is a growing need to ensure data privacy and security. This has led to the development of advanced encryption techniques and access control mechanisms to mitigate potential risks. A real-world business scenario illustrating the importance of AIDC is in the retail industry. Retailers use AIDC technologies such as RFID tags and barcode scanners to manage inventory levels, track stock movements, and optimize supply chain operations. By automating data capture processes, retailers can reduce manual errors, improve order fulfillment accuracy, and enhance the overall customer experience. Despite the challenges, the AIDC market continues to grow, driven by the need for real-time data processing and automation across various industries.

What will be the size of the Automatic Identification And Data Capture Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free SampleThe Automatic Identification and Data Capture (AIDC) market continues to evolve, driven by advancements in technology and increasing business demands. AIDC solutions, including barcode scanners, RFID systems, and OCR technology, enable organizations to streamline processes, enhance data accuracy, and improve operational efficiency. According to recent research, the use of RFID technology in the retail sector has surged by 25% over the past five years, underpinning its significance in inventory management and supply chain optimization. Moreover, the integration of AIDC technologies with cloud computing services and data visualization dashboards offers real-time data access and analysis, empowering businesses to make informed decisions. For instance, a manufacturing firm can leverage RFID data to monitor production lines, optimize workflows, and ensure compliance with industry regulations. AIDC systems are also instrumental in enhancing data security and privacy, with advanced encryption protocols and access control features ensuring data integrity and confidentiality. By adopting AIDC technologies, organizations can not only improve their operational efficiency but also gain a competitive edge in their respective industries.

Unpacking the Automatic Identification And Data Capture Market Landscape

The market encompasses technologies such as RFID tag identification, data stream management, and data mining techniques. These solutions enable businesses to efficiently process and analyze vast amounts of data from various sources, leading to significant improvements in data quality metrics and workflow optimization strategies. For instance, RFID implementation can result in a 30% increase in inventory accuracy, while data mining techniques can uncover hidden patterns and trends, driving ROI improvement and compliance alignment. Real-time data processing, facilitated by technologies like document understanding AI and image recognition algorithms, ensures swift decision-making and error reduction. Data capture pipelines and database management systems provide a solid foundation for data aggregation and analysis, while semantic web technologies and natural language processing enhance information retrieval and understanding. By integrating sensor data and applying machine vision systems, businesses can achieve high-throughput imaging and object detection, further enhancing their data processing capabilities.

Key Market Drivers Fueling Growth

The significant expansion of RFID (Radio-Frequency Identification) technology applications is the primary market growth catalyst. In the dyna

Facebook

Twitter

Click to copy link

Link copied

Cite

(2007). De-Identification Software Package [Dataset]. http://doi.org/10.13026/C20M3F

Data from: De-Identification Software Package

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.13026/C20M3F

Dataset updated

Dec 18, 2007

License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.

Clear search

Close search

Google apps

Main menu

Data from: De-Identification Software Package

All-Payer Claims Data (APD De-Identified): Prescription Drug Detail 2021

Open Data, Private Learners: A De-Identified Dataset for Learning Analytics...

Data from: A nationwide dataset of de-identified activity spaces derived...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: Data Donation with Dona: De-identified Messaging Data (WhatsApp...

Data from: Encoding laboratory testing data: case studies of the national...

V-safe COVID-19 MedDRA coded text responses

Hypertension Dataset

Data cleaning using unstructured data

Source Code, Data and Additional Material for the Thesis: "Identification of...

Any Healthcare Service Any DME - ZipLevel

Metadata record for: A DICOM dataset for evaluation of medical image...

MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case...

Folder structure

Basic Blocks

Energy Consumption and Execution Time

Basic Block Unrolling

Dataset description

Features

Code

Acknowledgment

Licensing

References

Simulation Data Set

mLab App Paradata

Data for: Qualitative Data Sharing: Participant Understanding, Motivation,...

Replication Data for: SF20

Interviews regarding data curation for qualitative data reuse and big social...

Automatic Identification And Data Capture Market Analysis North America,...

Snapshot img

Data from: De-Identification Software Package