100+ datasets found
  1. p

    Data from: De-Identification Software Package

    • physionet.org
    Updated Dec 18, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). De-Identification Software Package [Dataset]. http://doi.org/10.13026/C20M3F
    Explore at:
    Dataset updated
    Dec 18, 2007
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.

  2. All-Payer Claims Data (APD De-Identified): Prescription Drug Detail 2021

    • healthdata.gov
    • health.data.ny.gov
    csv, xlsx, xml
    Updated Apr 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    health.data.ny.gov (2025). All-Payer Claims Data (APD De-Identified): Prescription Drug Detail 2021 [Dataset]. https://healthdata.gov/State/All-Payer-Claims-Data-APD-De-Identified-Prescripti/xspv-r9gv
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    health.data.ny.gov
    Description

    This dataset is designed to analyze prescription drug use and spending among New York State residents at the drug product level (pharmacy claims data that have been aggregated by labeler code and product code segments of the National Drug Code). The dataset includes the number of prescriptions filled by unique members by payer type, nonproprietary name, labeler name, dosage characteristics, amount insurer paid, and more.

  3. Open Data, Private Learners: A De-Identified Dataset for Learning Analytics...

    • zenodo.org
    json, zip
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Authors; Anonymous Authors (2025). Open Data, Private Learners: A De-Identified Dataset for Learning Analytics Research [Dataset]. http://doi.org/10.5281/zenodo.17087849
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Authors; Anonymous Authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the dataset files and the code used for feature engineering in the paper titled "Open Data, Private Learners: A De-Identified Dataset for Learning Analytics Research" submitted to the Nature Scientific data journal.

  4. K

    Data from: A nationwide dataset of de-identified activity spaces derived...

    • rdr.kuleuven.be
    • data.europa.eu
    bin, html, png +3
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ate Poorthuis; Ate Poorthuis; Qingqing Chen; Qingqing Chen; Matthew Zook; Matthew Zook (2024). A nationwide dataset of de-identified activity spaces derived from geotagged social media data [Dataset]. http://doi.org/10.48804/MBT32W
    Explore at:
    bin(350058609), bin(343210705), bin(339567034), bin(7689635), bin(356103458), bin(464238846), bin(343917658), bin(352731206), bin(11), bin(80), bin(349167020), bin(350592850), bin(46), bin(347450565), bin(344180411), png(45943), bin(820), bin(355866409), bin(339723853), bin(343579774), bin(339844041), type/x-r-syntax(2828), text/markdown(10202), text/x-r-notebook(6141), text/x-r-notebook(12401), png(61006), text/markdown(1081), bin(342771225), bin(352829960), bin(338514562), bin(128323523), bin(365224429), type/x-r-syntax(6649), bin(354674831), html(1260453), png(1296462), bin(346287470), bin(341837511), bin(351674357), bin(345585624), bin(342265826), bin(338663571), bin(345766749), bin(338768105), bin(52), bin(350032415), bin(336728301), bin(349781639), bin(343270426), bin(338425309), bin(339578746), bin(414), bin(340596988), bin(352504392), bin(54), bin(337721126), bin(351066190), bin(340283175), bin(341609477)Available download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    KU Leuven RDR
    Authors
    Ate Poorthuis; Ate Poorthuis; Qingqing Chen; Qingqing Chen; Matthew Zook; Matthew Zook
    License

    https://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32Whttps://rdr.kuleuven.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.48804/MBT32W

    Area covered
    United States
    Description

    This repository introduces a de-identified and spatially aggregated user activity dataset in the United States, which contains approximately 2 million users and 1.2 billion data points collected from 2012 to 2019. This data is stored in a series of Parquet files, totaling 13GB, in the data/parquet/ directory. A smaller subset with only points within the Denver timezone can be found in data/sample/.

  5. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Massachusetts General Hospital
    Harvard Medical School
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  6. u

    Data from: Data Donation with Dona: De-identified Messaging Data (WhatsApp...

    • pub.uni-bielefeld.de
    • data.niaid.nih.gov
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olya Hakobyan; Paul-Julius Hillmann; Hanna Drimalla (2024). Data Donation with Dona: De-identified Messaging Data (WhatsApp and Facebook) and Evaluation Responses [Dataset]. https://pub.uni-bielefeld.de/record/2993360
    Explore at:
    Dataset updated
    Oct 10, 2024
    Authors
    Olya Hakobyan; Paul-Julius Hillmann; Hanna Drimalla
    Description

    General information

    The dataset contains de-identified messaging meta-data from 78 WhatsApp and 7 Facebook data donations. The dataset was collected in an online study using the data donation platform Dona. After donating their messaging data, the study participants viewed visual summaries of their messaging data and evaluated this visual feedback. The responses to the evaluation questions and the sociodemographic data of the participants are also included in the dataset.

    The data was collected from August 2022 to June 2024.

    For more information on Dona, the associated publications and updates, please visit https://mbp-lab.github.io/dona-blog/.

    File description

    1. donation_table.csv - contains general information about the donations including
      • donation_id: donation identifier
      • donor_id: the ID of the donor to distinguish the messages sent by them from those sent by contacts
      • source: the messaging platform from which the data is donated (WhatsApp or Facebook)
      • external_id: ID used to connect messaging data with the survey data
    2. messages_table.csv - contains the donated messages including
      • conversation_id: chat identifier
      • sender_id: sender identifier
      • datetime: time of the message, UNIX time for Facebook and device time for WhatsApp
      • word_count: word count of the messages achieved by splitting the text based on whitespace
      • donation_id: donation identifier (also listed in donation_table.csv)
    3. messages_filtered_table.csv - same structure as messages_table.csv except that chats with no considerable interactions were removed. This was defined as chats where donor's word count contribution was less than 10% or more than 90%.
    4. survey.xlsx → contains survey responses of the participants.
    5. survey_table_coding.xlsx → contains the mapping between the column names in survey.xlsx and their meaning, including the original survey questions and response options. Different sheets of the Excel file detail the survey questions and responses in one of the study languages (English, German, Armenian).

  7. Data from: Encoding laboratory testing data: case studies of the national...

    • data-staging.niaid.nih.gov
    • search.dataone.org
    • +3more
    zip
    Updated May 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Cholan; Gregory Pappas; Greg Rehwoldt; Andrew Sills; Elizabeth Korte; I. Khalil Appleton; Natalie Scott; Wendy Rubinstein; Sara Brenner; Riki Merrick; Wilbur Hadden; Keith Campbell; Michael Waters (2022). Encoding laboratory testing data: case studies of the national implementation of HHS requirements and related standards in five laboratories [Dataset]. http://doi.org/10.5061/dryad.0cfxpnw55
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 10, 2022
    Dataset provided by
    Food and Drug Administrationhttp://www.fda.gov/
    Association of Public Health Laboratorieshttps://www.aphl.org/
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    Office of the National Coordinator for Health Information Technologyhttp://healthit.gov/
    Deloitte (United States)
    University of Maryland, College Park
    Authors
    Raja Cholan; Gregory Pappas; Greg Rehwoldt; Andrew Sills; Elizabeth Korte; I. Khalil Appleton; Natalie Scott; Wendy Rubinstein; Sara Brenner; Riki Merrick; Wilbur Hadden; Keith Campbell; Michael Waters
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: Assess the effectiveness of providing Logical Observation Identifiers Names and Codes (LOINC®)-to-In Vitro Diagnostic (LIVD) coding specification, required by the United States Department of Health and Human Services for SARS-CoV-2 reporting, in medical center laboratories and utilize findings to inform future United States Food and Drug Administration policy on the use of real-world evidence in regulatory decisions. Materials and Methods: We compared gaps and similarities between diagnostic test manufacturers’ recommended LOINC® codes and the LOINC® codes used in medical center laboratories for the same tests. Results: Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (chi-square 45.0, 4 df, P < .0001). Discussion: The five medical centers and three test manufacturers vary in how they organize, categorize, and store LIS catalog information. This variation impacts data quality and interoperability. Conclusion: The results of the study indicate that providing the LIVD mappings was not sufficient to support laboratory data interoperability. National implementation of LIVD and further efforts to promote laboratory interoperability will require a more comprehensive effort and continuing evaluation and quality control. Methods Five medical centers and three test manufacturers extracted data from laboratory information systems (LIS) for prioritized tests of interest. The data submission ranged from 74 to 532 LOINC® codes per site. Three test manufacturers submitted 15 LIVD catalogs representing 26 distinct devices, 6,956 tests, and 686 LOINC® codes. We identified mismatches in how medical centers use LOINC® to encode laboratory tests compared to how test manufacturers encode the same laboratory tests. Of 331 tests available in the LIVD files, 136 (41%) were represented by a mismatched LOINC® code by the medical centers (Chi-square 45.0,4 df,p < .0001). Data Collection from Medical Center Laboratory Pilot Sites: Each medical center was asked to extract about 100 LOINC® Codes from their LIS for prioritized tests of interest focused on high-risk conditions and SARS-CoV-2. For each selected test (e.g., SARS-CoV-2 RNA COVID-19), we collected the following data elements: test names/descriptions (e.g., SARS coronavirus 2 RNA [Presence] in Respiratory specimen by NAA with probe detection), associated instruments (e.g., IVD Vendor Model), and LOINC® codes (e.g., 94500-6). High risk conditions were defined by referencing the CDC’s published list of Underlying Medical Conditions Associated with High Risk for Severe COVID-19.[29] A data collection template spreadsheet was created and disseminated to the medical centers to help provide consistency and reporting clarity for data elements from sites. Data Collection from IVD Manufacturers: We coordinated with SHIELD stakeholders and the IICC to request manufacturer LIVD catalogs containing the LOINC® codes per IVD instrument per test from manufacturers.

  8. D

    V-safe COVID-19 MedDRA coded text responses

    • data.cdc.gov
    csv, xlsx, xml
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). V-safe COVID-19 MedDRA coded text responses [Dataset]. https://data.cdc.gov/Public-Health-Surveillance/v-safe-COVID-19-MedDRA-coded-text-responses/5biu-jjj3
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Jul 24, 2023
    Description

    Users of the V-safe data are required to adhere to the following standards for the analysis and reporting of research data. All research results must be presented and/or published in a manner that protects the confidentiality of participants. V-safe data will not be presented and/or published in any way in which an individual can be identified.

    Therefore, users will:

    1. Not attempt to link or permit others to link the data with individually identified records in another database.
    2. Not attempt to learn the identity of any participant in the data and will not deliberately combine these data with other CDC or non-CDC data for the purpose of matching records to identify individuals. If you should inadvertently discover the identity of any participant, you will ensure the identity of any participant is kept confidential, and will not be used in any publications and/or presentations.
    3. Not imply or state, either in written or oral form, that interpretations based on analysis of the data reflect official CDC policies or positions.
    4. Understand that sub-national analyses are not appropriate for this national sample and will not be conducted.
    5. Understand that V-safe is a voluntary self-enrollment program requiring smartphone access; therefore, information from V-safe might not be representative or generalizable to the US population.
    By clicking on the weblink below to download and use these V-safe data, you signify your agreement to comply with the above-stated terms.

    V-safe is an active surveillance program to monitor the safety of COVID-19 vaccines that are authorized for use under U.S. Food and Drug Administration (FDA) Emergency Use Authorization (EUA) and after FDA licensure.

    These data include MedDRA coded text responses collected through V-safe from 12/13/2020 to 06/30/2023. Please review the V-safe data user agreement before analyzing any V-safe data.

  9. Hypertension Dataset

    • kaggle.com
    zip
    Updated Mar 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INK (2025). Hypertension Dataset [Dataset]. https://www.kaggle.com/datasets/irakozekelly/hypertension-dataset
    Explore at:
    zip(7708 bytes)Available download formats
    Dataset updated
    Mar 16, 2025
    Authors
    INK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides de-identified insurance data for hypertension and hyperlipidemia from three managed care organizations in Allegheny County: Gateway Health Plan, Highmark Health, and UPMC. The data represents the insured population for the 2015 and 2016 calendar years.

    The dataset includes valuable insights into the health conditions of individuals covered by these plans but comes with several limitations:

    Misclassification and Duplicate Individuals: As administrative claims data was not collected for surveillance purposes, there may be errors in categorizing conditions or identifying individuals.
    Exclusions: It does not include individuals who were uninsured, not enrolled in one of the represented plans, or were enrolled for less than 90 days.
    Missing Data: The dataset excludes individuals who did not seek care within the past two years or were enrolled in plans not represented in the dataset.
    

    Disclaimer: Users should exercise caution when using this data to assess disease prevalence or interpret trends over time, as the data was collected for purposes other than public health surveillance.

  10. Data cleaning using unstructured data

    • zenodo.org
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this project, we work on repairing three datasets:

    • Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.
    • Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.
    • Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

    N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

    • "{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")
    • "{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")
    • "{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
  11. h

    Source Code, Data and Additional Material for the Thesis: "Identification of...

    • heidata.uni-heidelberg.de
    Updated Apr 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thorsten Merten; Thorsten Merten (2017). Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data" [Dataset]. http://doi.org/10.11588/DATA/10089
    Explore at:
    text/plain; charset=utf-8(1086), zip(110360424)Available download formats
    Dataset updated
    Apr 6, 2017
    Dataset provided by
    heiDATA
    Authors
    Thorsten Merten; Thorsten Merten
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.2/customlicense?persistentId=doi:10.11588/DATA/10089

    Description

    This dataset provides the code and the data sets used in the PHD thesis "Identification of Software Features in Issue Tracking System Data" as well as the files that represent the results measured in experiments. For problem studies (e.g. chapters 10 and 11) the folders include the raw data and the data annotations as well as the tools used to extract the data. For solution studies (e.g. chapters 14, 15, and 16) the folders include the raw data, tools used to extract the data, the gold standards, the code to process the data and finally the experiment results. This archive contains one folder per chaper and every folder again contains a README.md file describing its contents: Chapter10: SOFTWARE FEATURES IN ISSUE TRACKING SYSTEMS – AN EMPIRICAL STUDY Chapter11: ISSUE TYPES AND INFORMATION TYPES – AN EMPIRICAL STUDY Chapter14: PREPROCESSING ISSUES – AN EMPIRICAL STUDY Chapter15: RECOVERING RELATED ISSUES IN ISSUE TRACKING SYSTEMS Chapter16: DETECTING SOFTWARE FEATURE REQUESTS IN ISSUES

  12. a

    Any Healthcare Service Any DME - ZipLevel

    • arcgis.com
    • nifc.hub.arcgis.com
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Interagency Fire Center (2024). Any Healthcare Service Any DME - ZipLevel [Dataset]. https://www.arcgis.com/sharing/oauth2/social/authorize?socialLoginProviderName=apple&oauth_state=a3YVDu2l8CuwOToizJAZ6zw..fWaSw7_23EpfyTmq6iiyWH45VNYLLxCUG55ts4iryiM8fQ9zbMEt3p_8Q4ZiuqI-8g74-ATHF7Vqr5WVFYJtpyoxa46aVxzUgfcpzWVZaQBfXMGHkyvu-rZa71uiClCof1H2bxm1al9Qm9CGbWRiWmGgTsSyhVjvrIlvvLQ3RyFYAm0Ph3P5zyeSXyUbQyxAAAy3vtIXcEagts7ZKEgJvxy1uUQM7509yLg8Hm5tgNyzk6Ip6dZV5awMt0nKinipzqg74ecgpCnWYaeB-FJaJsSxjvJ2yqeQEne4_HcWMqc142gkb7Bn_2oqHz_PTYe_AXZaCQmbpwXBAV5R9GUX-nXvQSntExRCGbg2CNLLbbIURLD2bTHgDhZTt5w.
    Explore at:
    Dataset updated
    Oct 30, 2024
    Dataset authored and provided by
    National Interagency Fire Center
    Area covered
    Description

    Data Overview: ASPR, in partnership with the Centers for Medicare and Medicaid Services (CMS), provide de-identified and aggregated Medicare beneficiary claims data at the state/territory, county, and ZIP code levels in the HHS emPOWER Map and this public HHS emPOWER REST Service. The REST Service includes aggregated data from the Medicare Fee-For-Service (Parts A&B) and Medicare Advantage (Part C) Programs for beneficiaries who rely on electricity-dependent durable medical equipment (DME) and cardiac implantable devices.

    Data includes the following DME and devices: Cardiac devices (left, right, and bi-ventricular assistive devices
      (LVAD, RVAD, BIVAD) and total artificial hearts (TAH)), ventilators
      (invasive, non-invasive and oscillating vests), bi-level positive airway
      pressure device (BiPAP), oxygen concentrator, enteral feeding tube,
      intravenous (IV) infusion pump, suction pump, end-stage renal disease
      (ESRD) at-home dialysis, motorized wheelchair or scooter, and electric
      bed.
    
    
    
    Purpose: Over 3 million Medicare beneficiaries rely on electricity-dependent
     medical equipment, such as ventilators, to live independently in their
     homes. Severe weather and other emergencies, especially those with long
     power outages, can be life-threatening for these individuals. The HHS
     emPOWER Map and public REST Service give every public health official,
     emergency manager, hospital, first responder, electric company, and
     community member the power to discover the electricity-dependent Medicare
     population in their state/territory, county, and ZIP Code.
    
    
    
    Data Source: The REST Service’s data is developed from Medicare Fee-For-Service
      (Part A & B) (>33M 65+, blind, ESRD [dialysis], dual-eligible,
      disabled [adults and children]) and Medicare Advantage (Part C) (>21M
      65+, blind, ESRD [dialysis], dual-eligible, disabled [adults and
      children]) beneficiary administrative claims data. This data does not
      include individuals that are only enrolled in a State Medicaid Program.
      Note that Medicare DME are subject to insurance claim reimbursement caps
      (e.g. rental caps) that differ by type, so the DME may have different
      “look-back” periods (e.g. ventilators are 13 months and oxygen
      concentrators are 36 months).
    
    
    
    ZIP Code Aggregation: Some ZIP Codes do not have specific geospatial boundary data (e.g.,
      P.O. Box ZIP Codes). To capture the complete population data, the HHS
      emPOWER Program identified the larger boundary ZIP Code (Parent) within
      which the non-boundary ZIP Code (Child) resides. The totals are added
      together and displayed under the parent ZIP Code.
    
    
    
    
    Approved Data Uses: The public HHS emPOWER REST Service is approved for use by all partners
      and is intended to be used to help inform and support emergency
      preparedness, response, recovery, and mitigation activities in all
      communities.
    
    
    
    
    
    
    Privacy Protections: Protecting the privacy of Medicare beneficiaries is an essential
      priority for the HHS emPOWER Program. Therefore, all personally
      identifiable information are removed from the data and numerous
      de-identification methods are applied to significantly minimize, if not
      completely mitigate, any potential for deduction of small cells or
      re-identification risk. For example, any cell size found between the
      range of 1 and 10 is masked and shown as 11.
    
  13. Metadata record for: A DICOM dataset for evaluation of medical image...

    • figshare.com
    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: A DICOM dataset for evaluation of medical image de-identification [Dataset]. http://doi.org/10.6084/m9.figshare.14802774.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor A DICOM dataset for evaluation of medical image de-identification. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format
    
  14. MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, xz
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut (2024). MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case Execution Time (WCET) dataset [Dataset]. http://doi.org/10.5281/zenodo.11066623
    Explore at:
    csv, bin, xzAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].

    Folder structure

    This dataset is composed of the following files:

    • basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,
    • data.csv/data.xlsx contains the measured energy consumption and execution time for each basic block

    We first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.

    Basic Blocks

    We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:

    {
      "extr_pfctl_altq.c_pfctl_altq_init": [
         # Basic block 1
        [
          # Instruction 1 of BB1
          [
            "MOV.W",
            "#queue_map",
            "R13"
          ],
          # Instruction 2 of BB1
          [
            "MOV.B",
            "#0",
            "R14"
          ],
          # Instruction 3 of BB1
          [
            "CALL",
            "#hcreate_r",
            null
          ]
        ],
        # Basic block 2
        [
          ....
        ]
      ]
    }

    The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].

    Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.

    Energy Consumption and Execution Time

    Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.

    bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
    137_0;3;8.77;7.08;7.04;8.21;2.92;40;50

    Spreadsheet format :

    • bb_id: the unique identifier of a basic block (cf. Basic Blocks)
    • nb_inst: the number of instructions in the basic block
    • max_energy: the maximum energy comsumption (in nJ) measured during the experiment
    • max_time: the maximum execution time (in us) measured during the experiment
    • avg_time: the average execution time (in us) measured during the experiment
    • avg_energy: the average energy comsumption (in nJ) measured during the experiment
    • energy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)
    • nb_samples: how much time the basic block energy consumption/execution time has been measured
    • unroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)

    Basic Block Unrolling

    To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.

    Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.

    Dataset description

    Features

    The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn

    Code

    The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code

    Acknowledgment

    This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference ANR-10-LABX-07-01

    Licensing

    Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut

    Licensed under the Creative Commons Attribution 4.0 International License

    References

    [1] Reymond, H., Amalou, A. N., Puaut, I. “WORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)

    [2] Da Silva, Anderson Faustino, et al. “Anghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.

  15. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  16. mLab App Paradata

    • figshare.com
    txt
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Scherr; Austin N. Hardcastle; Carson Moore; Dheemanth Majji; Lisa M. Kuhns; Robert Garofalo; Rebecca Schnall (2025). mLab App Paradata [Dataset]. http://doi.org/10.6084/m9.figshare.28211636.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Thomas Scherr; Austin N. Hardcastle; Carson Moore; Dheemanth Majji; Lisa M. Kuhns; Robert Garofalo; Rebecca Schnall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    mLab App Paradata DataThis data is part of the publications associated with the analysis of paradata and user engagement with the mLab App, a digital health application designed to support at-home HIV testing. The dataset includes detailed user interactions, testing results, and metadata collected during a multi-site clinical trial. The data is divided into several files, each with its own schema and purpose, as described below.**Notes:**- The data has been de-identified and cleaned.- The datasets originated from a REDCap project. Due to the different needs that the custom reports used to generate the dataset served, there may be redunancies across the different files.- The user_id assigned to each user is consistent across each of the files (i.e., test windows in test_windows.csv for user_id == 10 correspond to user_id == 10 in para.csv, info.csv, etc.).- Paradata collection was designed to operate asynchronously, ensuring that no user interactions were disrupted during data collection. As mLab was a browser-based technology, when users use browser navigation rapidly, there can be things that appear out of order (as we noted in our manuscripts).- Dataframe column datatypes have been converted to satisfy specific analyses. Please check datatypes and convert as needed for your particular needs.- Due to the sensitive nature of the survey data, the CSV/parquet files for survey are not included in this data repository. They will be made available upon reasonable request.- For detailed descriptions of the study design and methodology, please refer to the associated publication.## File Descriptions### facts.csv / facts.parquetThis file records the educational facts shown to users._Column Name_: Descriptiondisplay_timestamp: Unix timestamp when an educational fact was displayed to the user.session_id: Unique identifier for the user’s session when the fact was shown.user_id: Unique identifier for the user the fact was shown to.fact_category: Category of the educational fact displayed to the user.fact_index: Index number of the fact shown to the user.fact_text: Text of the educational fact displayed.### info.csv / info.parquetThis file contains user-specific metadata, and repeated data about each user (alerts and pinned facts)._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instrument: REDCap field indicating the repeat instrument used. For general information about the user (userlocation and numberoflogins), redcap_repeat_instrument is blank. For repeated data (alerts, pinned facts, scheduled tests), redcap_repeat_instrument will identify the instrument.redcap_repeat_instance: Instance number of the repeat instrument (if applicable).user_location: Location of the user (if available). (1: New York City cohort; 2: Chicago cohort)alert_date: A unix timestamp of when an alert was sent to the user.number_of_logins: Total number of logins by the user.alert_subject: Subject or type of the alert sent.alert_read: Indicates whether the alert was read by the user (1: True; 0: False).end_date: Unix timestamp of the end date of scheduled tests.start_date: Unix timestamp of the start date of scheduled tests.fact_category: Category of the educational fact pinned by the user.fact_index: Index number of the fact pinned by the user.fact_text: Text of the educational fact pinned by the user.fact_link: Link to additional information associated with the fact pinned by the user (if available).### para.csv / para.parquetThis file includes paradata (detailed in-app user interactions) collected during the study._Column Name_: Descriptiontimestamp: A timezone-naive timestamp of the user action or event.session_id: Unique identifier for the user’s session.user_id: Unique identifier for the user.user_action: Specific user action (e.g., button press, page navigation). "[]clicked" indicates a pressable element (i.e., button, collapsible/expandable menu) is pressed.current_page: Current page of the app being interacted with.browser: Browser used to access the app.platform: Platform used to access the app (e.g., Windows, iOS).platform_description: Detailed description of the platform.platform_maker: Manufacturer of the platform.device_name: Name of the device used.device_maker: Manufacturer of the device used.device_brand_name: Brand name of the device used.device_type: Type of device used (Mobile, Computer, etc.).user_location: Location of the user (1: New York City cohort; 2: Chicago cohort).### survey.csv / survey.parquetThis file contains survey responses collected from users.*NOTE: Due to the sensitive nature of this data, CSV/parquet files are not included in this data repository. They will be made available upon reasonable request.*_Column Name_: Descriptionuser_id: Unique identifier for the user.timepoint: Timepoint of the survey (baseline/0 months, 6 months, 12 months).race: Race of the user.education: Education level of the user.health_literacy: Health literacy score of the user.health_efficacy: Health efficacy score of the user.itues_mean: Information Technology Usability Evaluation Scale (ITUES) mean score.age: Age of the user.### tests.csv / tests.parquetThis file contains data related to the HIV self-tests performed by users in the mLab App._Column Name_: Descriptionuser_id: Unique identifier for the user that took the test.visual_analysis_date: A unix timestamp of the visual analysis of the test by the user.visual_result: Result of the visual analysis (positive, negative).mlab_analysis_date: A unix timestamp of the analysis conducted by the mLab system.mlab_result: Result from the mLab analysis (positive, negative).signal_ratio: Ratio of the intensity of test signal to the control signal.control_signal: mLab calculated intensity of the control signal.test_signal: mLab calculated intensity of the test signal.browser: Browser used to access the app (from the User Agent string).platform: Platform used to access the app (e.g., Windows, iOS) (from the User Agent string).platform_description: Detailed description of the platform (from the User Agent string).platform_maker: Manufacturer of the platform (from the useragUser Agentent string).device_name: Name of the device used (from the User Agent string).device_maker Manufacturer of the device used (from the User Agent string).device_brand_name: Brand name of the device used (from the User Agent string).device_type: Type of device used (Mobile, Computer, etc.) (from the User Agent string).### test_windows.csv / test_windows.parquetThis file contains information on testing windows assigned to users._Column Name_: Descriptionuser_id: Unique identifier for the user.redcap_repeat_instance: Instance of the repeat instrument.start_date: Start date of the (hard) testing window.end_date: End date of the (hard) testing window.## CitationIf you use this dataset, please cite the associated mLab and mLab paradata publications.

  17. Q

    Data for: Qualitative Data Sharing: Participant Understanding, Motivation,...

    • data.qdr.syr.edu
    pdf, tsv, txt
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alicia VandeVusse; Alicia VandeVusse; Jennifer Mueller; Jennifer Mueller (2023). Data for: Qualitative Data Sharing: Participant Understanding, Motivation, and Consent [Dataset]. http://doi.org/10.5064/F6YYA3O3
    Explore at:
    tsv(1613), pdf(215887), tsv(33480), txt(3490), pdf(219898)Available download formats
    Dataset updated
    Nov 1, 2023
    Dataset provided by
    Qualitative Data Repository
    Authors
    Alicia VandeVusse; Alicia VandeVusse; Jennifer Mueller; Jennifer Mueller
    License

    https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions

    Time period covered
    Jan 2020 - Feb 2020
    Area covered
    Wisconsin, United States, United States, New Jersey
    Dataset funded by
    Eunice Kennedy Shriver National Institute Of Child Health & Human Development of the National Institutes of Health
    Description

    Project Summary As part of a qualitative study of abortion reporting in the United States, the research team conducted cognitive interviews to iteratively assess new question wording and introductions designed to improve the accuracy of abortion reporting in surveys (to be shared on the Qualitative Data Repository in a separate submission). As expectations to share the data that underlie research increase, understanding how participants, particularly those taking part in qualitative research, respond to requests for data sharing is necessary. We assessed research participants’ willingness to, understanding of, and motivations for data sharing. Data Overview The data consist of excerpts from cognitive interviews with 64 cisgender women in two states in January and February of 2020 in which researchers asked for respondents for consent to share de-identified data. Eligibility criteria included: assigned female at birth, currently identified as a woman between the ages of 18-49, English-speaking, and reported ever having penile-vaginal sex. Respondents were screened for abortion history as well to ensure that at least half the sample reported a prior abortion. At the end of interviews, participants were asked to reflect on their motivations for agreeing or declining to share their data. The data included here are coded excerpts of their answers. Most respondents consented to data sharing, citing helping others as a primary motivation for agreeing to share their data. However, a substantial number of participants demonstrated limited understanding of “data sharing.” Data available here include the following materials: overview of methods, cognitive interview consent form (with language for data sharing consent), and data sharing analysis coding scheme.

  18. H

    Replication Data for: SF20

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mi-Sun Lee (2025). Replication Data for: SF20 [Dataset]. http://doi.org/10.7910/DVN/AXKNZ2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Mi-Sun Lee
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains de-identified data and analysis code from a study using the 20-item Short Form Health Survey (SF-20). The data were collected to evaluate the validity and reliability of the SF-20 instrument. Included files: - De-identified data (sf20.sas7bdat) - Analysis code (iscience_code.sas) Variables in the dataset correspond to survey domains such as physical functioning, mental health, general health, pain, role functioning, and social functioning. The analysis code includes scripts for descriptive statistics, internal consistency, and test-retest reliability. All information is de-identified to protect participant privacy. Further details can be found in the accompanying README file and manuscript.

  19. Q

    Interviews regarding data curation for qualitative data reuse and big social...

    • data.qdr.syr.edu
    bin, pdf, txt
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Mannheimer; Sara Mannheimer (2023). Interviews regarding data curation for qualitative data reuse and big social research [Dataset]. http://doi.org/10.5064/F6GWMU4O
    Explore at:
    pdf(111223), pdf(170851), pdf(174860), pdf(220706), pdf(181317), pdf(155781), pdf(176948), pdf(186400), pdf(216506), pdf(186156), pdf(166627), pdf(204315), pdf(120883), pdf(223955), pdf(197623), pdf(209721), pdf(212401), pdf(111468), pdf(175067), pdf(194133), pdf(194606), bin(254918656), pdf(174896), txt(8346), pdf(180451), pdf(192049), pdf(119959), pdf(214380), bin(2258685), pdf(547705), pdf(189347), pdf(196971), pdf(115127), pdf(213879), pdf(146828), pdf(195493), pdf(177017), pdf(189665), pdf(149437), pdf(183110), pdf(221008), pdf(200024)Available download formats
    Dataset updated
    Apr 26, 2023
    Dataset provided by
    Qualitative Data Repository
    Authors
    Sara Mannheimer; Sara Mannheimer
    License

    https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions

    Time period covered
    Mar 1, 2019 - Jun 1, 2023
    Area covered
    United States
    Description

    Project Overview Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research. This study addressed the following research questions: RQ1: How is big social data curation similar to and different from qualitative data curation? RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research? RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues? RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations? Data Description and Collection Overview The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions. The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified. The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues. Selection and Organization of Shared Data The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR)...

  20. Automatic Identification And Data Capture Market Analysis North America,...

    • technavio.com
    pdf
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Automatic Identification And Data Capture Market Analysis North America, APAC, Europe, South America, Middle East and Africa - China, US, Japan, UK, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/automatic-identification-and-data-capture-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2024 - 2028
    Area covered
    United States, United Kingdom
    Description

    Snapshot img

    Automatic Identification And Data Capture Market Size 2024-2028

    The automatic identification and data capture market size is valued to increase by USD 21.52 billion, at a CAGR of 8.1% from 2023 to 2028. Increasing applications of RFID will drive the automatic identification and data capture market.

    Market Insights

    North America dominated the market and accounted for a 47% growth during the 2024-2028.
    By Product - RFID products segment was valued at USD 18.41 billion in 2022
    By segment2 - segment2_1 segment accounted for the largest market revenue share in 2022
    

    Market Size & Forecast

    Market Opportunities: USD 79.34 million 
    Market Future Opportunities 2023: USD 21520.40 million
    CAGR from 2023 to 2028 : 8.1%
    

    Market Summary

    The Automatic Identification and Data Capture (AIDC) market encompasses technologies and solutions that enable businesses to capture and process data in real time. This market is driven by the increasing adoption of RFID technology, which offers benefits such as improved supply chain visibility, inventory management, and operational efficiency. The growing popularity of smart factories, where automation and data-driven processes are integral, further fuels the demand for AIDC solutions. However, the market also faces challenges, including security concerns. With the increasing use of AIDC technologies, there is a growing need to ensure data privacy and security. This has led to the development of advanced encryption techniques and access control mechanisms to mitigate potential risks. A real-world business scenario illustrating the importance of AIDC is in the retail industry. Retailers use AIDC technologies such as RFID tags and barcode scanners to manage inventory levels, track stock movements, and optimize supply chain operations. By automating data capture processes, retailers can reduce manual errors, improve order fulfillment accuracy, and enhance the overall customer experience. Despite the challenges, the AIDC market continues to grow, driven by the need for real-time data processing and automation across various industries.

    What will be the size of the Automatic Identification And Data Capture Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free SampleThe Automatic Identification and Data Capture (AIDC) market continues to evolve, driven by advancements in technology and increasing business demands. AIDC solutions, including barcode scanners, RFID systems, and OCR technology, enable organizations to streamline processes, enhance data accuracy, and improve operational efficiency. According to recent research, the use of RFID technology in the retail sector has surged by 25% over the past five years, underpinning its significance in inventory management and supply chain optimization. Moreover, the integration of AIDC technologies with cloud computing services and data visualization dashboards offers real-time data access and analysis, empowering businesses to make informed decisions. For instance, a manufacturing firm can leverage RFID data to monitor production lines, optimize workflows, and ensure compliance with industry regulations. AIDC systems are also instrumental in enhancing data security and privacy, with advanced encryption protocols and access control features ensuring data integrity and confidentiality. By adopting AIDC technologies, organizations can not only improve their operational efficiency but also gain a competitive edge in their respective industries.

    Unpacking the Automatic Identification And Data Capture Market Landscape

    The market encompasses technologies such as RFID tag identification, data stream management, and data mining techniques. These solutions enable businesses to efficiently process and analyze vast amounts of data from various sources, leading to significant improvements in data quality metrics and workflow optimization strategies. For instance, RFID implementation can result in a 30% increase in inventory accuracy, while data mining techniques can uncover hidden patterns and trends, driving ROI improvement and compliance alignment. Real-time data processing, facilitated by technologies like document understanding AI and image recognition algorithms, ensures swift decision-making and error reduction. Data capture pipelines and database management systems provide a solid foundation for data aggregation and analysis, while semantic web technologies and natural language processing enhance information retrieval and understanding. By integrating sensor data and applying machine vision systems, businesses can achieve high-throughput imaging and object detection, further enhancing their data processing capabilities.

    Key Market Drivers Fueling Growth

    The significant expansion of RFID (Radio-Frequency Identification) technology applications is the primary market growth catalyst. In the dyna

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2007). De-Identification Software Package [Dataset]. http://doi.org/10.13026/C20M3F

Data from: De-Identification Software Package

Related Article
Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 18, 2007
License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

The deid software package includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records.

Search
Clear search
Close search
Google apps
Main menu