78 datasets found
  1. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  2. d

    Replication Data for: Data pre-processing pipeline generation for AutoETL

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanelli, Joseph (2023). Replication Data for: Data pre-processing pipeline generation for AutoETL [Dataset]. http://doi.org/10.7910/DVN/B3YNCI
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Giovanelli, Joseph
    Description

    Data pre-processing plays a key role in a data analytics process (e.g., applying a classification algorithm on a predictive task). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these prototypes can be instantiated and optimized e.g., using Bayesian Optimization. In this work, we study the impact of transformations when chained together into prototypes, and the impact of transformations when instantiated via various operators. We develop and scrutinize a generic method that allows to generate pre-processing pipelines, as a step towards AutoETL. We make use of rules that enable the construction of prototypes (i.e., define the order of transformations), and rules that guide the instantiation of the transformations inside the prototypes (i.e., define the operator for each transformation). The optimization of our effective pipeline prototypes provide results that compared to an exhaustive search, get 90% of the predictive accuracy in the median, but with a time cost that is 24 times smaller.

  3. CLM Preliminary Assessment Extent Definition & Report( CLM PAE)

    • researchdata.edu.au
    • data.gov.au
    • +1more
    Updated Mar 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2016). CLM Preliminary Assessment Extent Definition & Report( CLM PAE) [Dataset]. https://researchdata.edu.au/clm-preliminary-assessment-clm-pae/2992711
    Explore at:
    Dataset updated
    Mar 22, 2016
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

    The Preliminary Assessment Extent (PAE) is a spatial layer that defines the land surface area contained within a bioregion over which coal resource development may have potential impact on water-dependent assets and receptors associated with those assets (Barrett et al 2013).

    Purpose

    The role of the PAE is to optimise research agency effort by focussing on those locations where a material causal link may occur between coal resource development and impacts on water dependent assets. The lists of assets collated by the Program are filtered for "proximity" such that only those assets that intersect with the PAE are considered further in the assessment process. Changes to the PAE such as through the identification of a different development pathway or an improved hydrological understanding may require the proximity of assets to be considered again. Should the assessment process identify a material connection between a water dependent asset outside the PAE and coal resource development impacts, the PAE would need to be amended.

    Dataset History

    The PAE is derived from the intersection of surface hydrology features; groundwater management units; mining development leases and/or CSG tenements; and, directional flows of surface and groundwater.

    The following 5 inputs were used by the Specialists to define the Preliminary Assessment Extent:

    1.\tBioregion boundary

    2.\tGeology and the coal resource

    3.\tSurface water hydrology

    4.\tGroundwater hydrology

    5.\tFlow paths (Known available information on gradients of pressure, water table height, stream direction, surface-ground water interactions and any other available data)

    Dataset Citation

    Bioregional Assessment Programme (2014) CLM Preliminary Assessment Extent Definition & Report( CLM PAE). Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/2cdd0e81-026e-4a41-87b0-ec003eddc5c1.

    Dataset Ancestors

  4. f

    Data Sheet 2_Behavioral and neural effects of temporoparietal...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Feb 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heffernan, Joseph; Youssofzadeh, Vahab; Ustine, Candida; Mueller, Kimberly D.; Shah-Basak, Priyanka P.; Schold, Shelby; Fellmeth, Mason; Pillay, Sara B.; Granadillo, Elias D.; Binder, Jeffrey R.; Kraegel, Peter; Ikonomidou, Chrysanthy; Raghavan, Manoj; Okonkwo, Ozioma (2025). Data Sheet 2_Behavioral and neural effects of temporoparietal high-definition transcranial direct current stimulation in logopenic variant primary progressive aphasia: a preliminary study.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001323810
    Explore at:
    Dataset updated
    Feb 25, 2025
    Authors
    Heffernan, Joseph; Youssofzadeh, Vahab; Ustine, Candida; Mueller, Kimberly D.; Shah-Basak, Priyanka P.; Schold, Shelby; Fellmeth, Mason; Pillay, Sara B.; Granadillo, Elias D.; Binder, Jeffrey R.; Kraegel, Peter; Ikonomidou, Chrysanthy; Raghavan, Manoj; Okonkwo, Ozioma
    Description

    BackgroundHigh-definition-tDCS (HD-tDCS) is a recent technology that allows for localized cortical stimulation, but has not yet been investigated as an augmentative therapy while targeting the left temporoparietal cortex in logopenic variant PPA (lvPPA). The changes in neuronal oscillatory patterns and resting-state functional connectivity in response to HD-tDCS also remains poorly understood.ObjectiveWe sought to investigate the effects of HD-tDCS with phonologic-based language training on language, cognition, and resting-state functional connectivity in lvPPA.MethodsWe used a double-blind, within-subject, sham-controlled crossover design with a 4-month between-treatment period in four participants with lvPPA. Participants completed language, cognitive assessments, and imaging with magnetoencephalography (MEG) and resting-state functional MRI (fMRI) prior to treatment with either anodal HD-tDCS or sham targeting the left supramarginal gyrus over 10 sessions. Language and cognitive assessments, MEG, and fMRI were repeated after the final session and at 2 months follow-up. Preliminary data on efficacy was evaluated based on relative changes from baseline in language and cognitive scores. Language measures included metrics derived from spontaneous speech from picture description. Changes in resting-state functional connectivity within the phonological network were analyzed using fMRI. Magnitudes of source-level evoked responses and hemispheric laterality indices from language task-based MEG were used to assess changes in cortical engagement induced by HD-tDCS.ResultsAll four participants were retained across the 4-month between-treatment period, with satisfactory blinding of participants and investigators throughout the study. Anodal HD-tDCS was well tolerated with a side effect profile that did not extend past the immediate treatment period. No benefit of HD-tDCS over sham on language and cognitive measures was observed in this small sample. Functional imaging results using MEG and fMRI indicated an excitatory effect of anodal HD-tDCS compared to sham and suggested that greater temporoparietal activation and connectivity was positively associated with language outcomes.ConclusionAnodal HD-tDCS to the inferior parietal cortex combined with language training appears feasible and well tolerated in participants with lvPPA. Language outcomes may be explained by regression to the mean, and to a lesser degree, by ceiling effects and differences in baseline disease severity. The intervention has apparent temporoparietal correlates, and its clinical efficacy should be further studied in larger trials.Clinical trial registrationClinicalTrials.gov, Number NCT03805659.

  5. d

    NWTC Ceilometer (1) Pre-campaign / Derived Data

    • catalog.data.gov
    • data.openei.org
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wind Energy Technologies Office (WETO) (2023). NWTC Ceilometer (1) Pre-campaign / Derived Data [Dataset]. https://catalog.data.gov/dataset/nwtc-ceilometer-1-pre-campaign-derived-data
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Wind Energy Technologies Office (WETO)
    Description

    Overview This instrument will be testing the data transfer process before deploying the campaign. The netCDF L3 data file has level 3 (L3) data that have gone through the calculation service and contains all the data from the algorithms, including mixing layer height values and quality index data. L3 default files contain L3 data that use the default preset for a live plot. File naming schema: L3_DEFAULT_YYYYMMDDHHMM_.nc Name Description L3 Identification of the data level DEFAULT Identification of the L3 file type CUSTOM OFFLINE STATION_NUMBER WMO station number, if defined YYYYMMDDHHMM UTC time ParameterKey Identification of the advanced algorithm settings. See the table below for an explanation. FREE_FORMAT File suffix, if defined Data Quality This data have been passed through a processing algorithm to determine cloud layer heights and backscatter intensity profiles.

  6. w

    Preliminary definition of the geothermal resources potential of West...

    • data.wu.ac.at
    Updated Apr 9, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Preliminary definition of the geothermal resources potential of West Virginia [Dataset]. https://data.wu.ac.at/odso/geothermaldata_org/YTI1YjdkMDEtNjA0Ni00ZjljLWE5NjgtMzI4ZGI0YTcxYjEy
    Explore at:
    Dataset updated
    Apr 9, 2018
    Description

    No Publication Abstract is Available

  7. Data from: S1 Dataset -

    • plos.figshare.com
    xlsx
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305699.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThere is a need to develop harmonized procedures and a Minimum Data Set (MDS) for cross-border Multi Casualty Incidents (MCI) in medical emergency scenarios to ensure appropriate management of such incidents, regardless of place, language and internal processes of the institutions involved. That information should be capable of real-time communication to the command-and-control chain. It is crucial that the models adopted are interoperable between countries so that the rights of patients to cross-border healthcare are fully respected.ObjectiveTo optimize management of cross-border Multi Casualty Incidents through a Minimum Data Set collected and communicated in real time to the chain of command and control for each incident. To determine the degree of agreement among experts.MethodWe used the modified Delphi method supplemented with the Utstein technique to reach consensus among experts. In the first phase, the minimum requirements of the project, the profile of the experts who were to participate, the basic requirements of each variable chosen and the way of collecting the data were defined by providing bibliography on the subject. In the second phase, the preliminary variables were grouped into 6 clusters, the objectives, the characteristics of the variables and the logistics of the work were approved. Several meetings were held to reach a consensus to choose the MDS variables using a Modified Delphi technique. Each expert had to score each variable from 1 to 10. Non-voting variables were eliminated, and the round of voting ended. In the third phase, the Utstein Style was applied to discuss each group of variables and choose the ones with the highest consensus. After several rounds of discussion, it was agreed to eliminate the variables with a score of less than 5 points. In phase four, the researchers submitted the variables to the external experts for final assessment and validation before their use in the simulations. Data were analysed with SPSS Statistics (IBM, version 2) software.ResultsSix data entities with 31 sub-entities were defined, generating 127 items representing the final MDS regarded as essential for incident management. The level of consensus for the choice of items was very high and was highest for the category ‘Incident’ with an overall kappa of 0.7401 (95% CI 0.1265–0.5812, p 0.000), a good level of consensus in the Landis and Koch model. The items with the greatest degree of consensus at ten were those relating to location, type of incident, date, time and identification of the incident. All items met the criteria set, such as digital collection and real-time transmission to the chain of command and control.ConclusionsThis study documents the development of a MDS through consensus with a high degree of agreement among a group of experts of different nationalities working in different fields. All items in the MDS were digitally collected and forwarded in real time to the chain of command and control. This tool has demonstrated its validity in four large cross-border simulations involving more than eight countries and their emergency services.

  8. a

    VT - Vermont Rational Service Areas

    • hub.arcgis.com
    • geodata.vermont.gov
    • +4more
    Updated Oct 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VT-AHS (2016). VT - Vermont Rational Service Areas [Dataset]. https://hub.arcgis.com/datasets/b28362657f1f4fac8312741ddf601782
    Explore at:
    Dataset updated
    Oct 31, 2016
    Dataset authored and provided by
    VT-AHS
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Description

    Data Layer Name: Vermont Rational Service Areas (RSAs)Alternate Name: Vermont RSAsOverview:Rational Service Areas (RSAs), originally developed in 2001 and revised in 2011, are generalized catchment areas relating to the delivery of primary health care services. In Vermont, RSA area delineations rely primarily on utilization data. The methods used are similar to those used by David Goodman to define primary care service areas based on Medicare data, but include additional sources of utilization data. Using these methods, towns were assigned based on where residents are going for their primary care. The process used to delineate Vermont RSAs was iterative. It began by examining utilization patterns based on: (1) the primary care service areas that Goodman had defined for Vermont from Medicare data; (2) Vermont Medicaid assignments of clients to primary care providers; and, (3) responses to the “town of residence”/”town of primary care” questions in the Vermont Behavioral Risk Factor survey. Taking into account the limitations of each of these sources of data, VDH statisticians defined preliminary town centers and were able to assign approximately two/thirds of the towns to a town center. For towns with no clear utilization patterns, they examined mileage from these preliminary centers, and mileage from towns that had primary care physicians. Contiguity of areas was also examined. A few centers were added and others were deleted. After all towns were assigned to a center and mapped, outliers were identified and reviewed by referring to both mileage maps and utilization patterns. Drive time information was not available. In some cases where the mileage map seemed to indicate one center, but the utilization patterns were strongly supportive of another center, utilization was used as a proxy for drive time.Preliminary RSAs were presented to the Vermont Primary Care Collaborative, the Vermont Coalition of Clinics for the Uninsured and other community members for their feedback. Department of Health District Directors from the Division of Community Public Health were also consulted. These groups suggested modifications to the areas based on their experience working in the areas in question. As a result of this review a few centers were added, deleted and combined, and several towns were reassigned. The Vermont Primary Care Collaborative reviewed the final version of RSAs. The result of this process is 38 Rational Service Areas.Given the limitations of the information available for this purpose, the delineation approach was deemed reasonable and has resulted in a set of RSAs that have been widely reviewed and accepted. Because of the iterative process, it is recognized that this is not a "pure" methodology in the sense that someone else attempting to replicate this process would probably not produce exactly the same results. RSAs have been reviewed periodically to keep up with changes in demographics and provider practice locations. One revision occurred in 2011. This 2011 revision took towns that had originally been assigned as using out-of-state providers and reassigned them to Vermont RSAs. Technical Details:Vermont RSAs were defined using 3 sources of primary care utilization data and mileage maps. Each of the data sources had limitations, and these limitations had to be considered as towns were assigned to a RSA. A description of each of these data sources is provided. Medicare utilization data was obtained from the Primary Care Service Areas developed by David Goodman using 1996 and 1997 Medicare Part B and Outpatient files. Thirty-eight primary care service areas were defined for Vermont. The major limitation of these assignments was that they were based on zip codes rather than town boundaries. Many small towns do not have their own zip code, or the town may be divided into multiple zip codes shared with multiple other towns. As the utilization data was reviewed consideration was given to whether the zip code in question represented the town, or whether utilization from that town may have been masked by a larger town's utilization patterns. A second consideration was that the Medicare data used 1996 & 1997 utilization. In areas where there were new practices established after 1997, the Medicare data would not be able to reflect their utilization.Medicaid claims data only included children age 17 and under. The file contained Medicaid clients in 2000 with the town of residence of the client and the town of the primary care provider. The limitation in this file was that although the Medicaid database included a field for the geographic location of the provider separate from the mailing address, after examining the file it was determined that in many cases the mailing address was also being entered into the geographic location. In areas where practices were owned by a larger organization, the utilization patterns could not be determined. For example, in the St. Johnsbury RSA there were practices owned by an out-of-state medical center. Although it is known that there are medicaid providers in some of the towns in that area, all of the utilization was coded to out of state. Therefore the Medicaid data had to be disregarded in this area. The St. Johnsbury RSA was subsequently defined around three town centers (St. Johnsbury, Lyndon, and Danville) because more precise utilization patterns could not be distinguished.The BRFSS data was obtained from the 1998-2000 surveys. Respondents were asked for the town of their primary care provider. The town of residence of the respondent is also collected. These responses represented all Vermonters age 18-64 years old, regardless of type of insurance. The limitation of this data was small number of respondents in the smaller towns. Mileage information was obtained from the Vermont Medicaid program. This mileage information was derived using GIS mapping software to assess all statewide roads. However, drive-time data could not be determined at that time because there was no distinction between primary and secondary roads. The Medicaid program applied GIS mapping software to assign clients to primary care providers using 15 miles as a proxy for 30-minute drive time. This standard was also used in 2001 when the original RSAs were developed.The VDH Public Health Statistics program periodically updates RSA GIS data. (last updated in 2011)

  9. o

    Gross Domestic Product by Kind of Economic Activity at Current Prices -...

    • kapsarc.opendatasoft.com
    • datasource.kapsarc.org
    csv, excel, json
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Gross Domestic Product by Kind of Economic Activity at Current Prices - Quarterly [Dataset]. https://kapsarc.opendatasoft.com/explore/dataset/saudi-arabia-gross-domestic-product-by-kind-of-economic-activity-at-current-pric/table/?sort=time_period&flg=en&disjunctive.economic_activity=
    Explore at:
    excel, csv, jsonAvailable download formats
    Dataset updated
    Feb 15, 2023
    Description

    This dataset contains Saudi Arabia Distribution of Gross Domestic Product by Kind of Economic Activity at Current Prices. Data from General Authority for Statistics .Follow datasource.kapsarc.org for timely data to advance energy economics research.Preliminary Data: 2021Indicator Definition: It expresses the total of final goods and services produced during a certain period of time (usually a year) within the geographical borders of the country, whether that was done by the citizens of the country or foreigners residing in it.

  10. D

    Replication Data for: Constructing Meaning: Historical Changes in MIHI EST...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    pdf +2
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihaela Ilioaia; Mihaela Ilioaia (2025). Replication Data for: Constructing Meaning: Historical Changes in MIHI EST and HABEO Constructions in Romanian [Dataset]. http://doi.org/10.18710/JD3EEP
    Explore at:
    pdf(85556), pdf(157449), pdf(81503), pdf(95951), pdf(85021), pdf(66525), pdf(82993), pdf(66480), pdf(83552), txt(10971), pdf(82631), pdf(87467), pdf(87314), pdf(79911), pdf(82604), pdf(45770), pdf(57270), pdf(62741), text/comma-separated-values(1449206)Available download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    DataverseNO
    Authors
    Mihaela Ilioaia; Mihaela Ilioaia
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    1500 - 2016
    Area covered
    Romania
    Dataset funded by
    Research Foundation-Flanders (FWO Fonds Wetenschappelijk Onderzoek)
    Description

    In the present study, the evolution of two Latin patterns, MIHI EST and HABEO, is investigated for Romanian, paying special attention to the set of nouns with which they occur. In the literature, the MIHI EST pattern is mostly associated with nouns from the field of psychological and physiological states in Romanian, while for HABEO, no semantic categories have been proposed (cf. Ilioaia 2020, 2021; Ilioaia and Van Peteghem 2021; Vangaever and Ilioaia 2021). Bearing this in mind, the question arises as to how the two constructions interact with the set of nouns with which they occur throughout the six centuries documented for Romanian (16th – 21st c.). In order to understand the dynamics of the set of nouns that occur with the MIHI EST and HABEO constructions, I carried out a corpus study based on texts from pre-21st century Romanian and from the present-day language. For pre-21st century Romanian, I worked with a corpus made by myself, which is accessible on demand for research purposes on the Sketch Engine platform. This corpus, labelled Pre-21st century Romanian, contains nearly six million words and comprises several types of texts: administrative, religious, literature. As for the present-day language, I worked with the Romanian Web 2016 (roTenTen16) corpus containing over two billion words from types of texts that can be found on the web, namely news/commercial/specialized websites, blogs, forums, which was compiled and made available on Sketch Engine. In addition to the .csv file with the data, I include in this dataset a codebook for the respective data file, as well as files containing tables and graphs used in the related publication. Abstract article: In this article, I address the evolution of the competition between two Latin patterns, HABEO and MIHI EST, in Romanian. As opposed to the other Romance languages, which replace the MIHI EST pattern with HABEO in possessor and experiencer contexts, Romanian maintains both Latin patterns. The general evolution of these patterns in the Romance languages is well known, however, a detailed usage-based account is currently lacking. Building on the theoretical findings on the role of functional competition in linguistic change, the rivalry between the two patterns in Romanian has already been argued to have settled in terms of differentiation, with each of the two forms specializing in different functional domains by Vangaever and Ilioaia in 2021 in their study “Specialisation through competition: HABEO vs. MIHI EST from Latin to Romanian”. With this idea as a starting point, I investigate, by means of a diachronic corpus study, whether the dynamics in the inventory of state nouns occurring in these constructions can affect their evolution and productivity. The preliminary results show that this is indeed the case. Concomitantly, I explore whether the historical changes that the two patterns have undergone over the centuries can be described in terms of grammaticalization, constructionalization, or in terms of constructional change.

  11. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  12. A

    Data from: VSRR Provisional Drug Overdose Death Counts

    • data.amerigeoss.org
    • datalumos.org
    • +8more
    csv, json, rdf, xsl
    Updated Jul 30, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2019). VSRR Provisional Drug Overdose Death Counts [Dataset]. https://data.amerigeoss.org/pl/dataset/vsrr-provisional-drug-overdose-death-counts-54e35
    Explore at:
    csv, rdf, json, xslAvailable download formats
    Dataset updated
    Jul 30, 2019
    Dataset provided by
    United States
    Description

    This data contains provisional counts for drug overdose deaths based on a current flow of mortality data in the National Vital Statistics System. Counts for the most recent final annual data are provided for comparison. National provisional counts include deaths occurring within the 50 states and the District of Columbia as of the date specified and may not include all deaths that occurred during a given time period. Provisional counts are often incomplete and causes of death may be pending investigation (see Technical notes) resulting in an underestimate relative to final counts. To address this, methods were developed to adjust provisional counts for reporting delays by generating a set of predicted provisional counts (see Technical notes). Starting in June 2018, this monthly data release will include both reported and predicted provisional counts.

    The provisional data include: (a) the reported and predicted provisional counts of deaths due to drug overdose occurring nationally and in each jurisdiction; (b) the percentage changes in provisional drug overdose deaths for the current 12 month-ending period compared with the 12-month period ending in the same month of the previous year, by jurisdiction; and (c) the reported and predicted provisional counts of drug overdose deaths involving specific drugs or drug classes occurring nationally and in selected jurisdictions. The reported and predicted provisional counts represent the numbers of deaths due to drug overdose occurring in the 12-month periods ending in the month indicated. These counts include all seasons of the year and are insensitive to variations by seasonality. Deaths are reported by the jurisdiction in which the death occurred.

    Several data quality metrics, including the percent completeness in overall death reporting, percentage of deaths with cause of death pending further investigation, and the percentage of drug overdose deaths with specific drugs or drug classes reported are included to aid in interpretation of provisional data as these measures are related to the accuracy of provisional counts (see Technical notes). Reporting of the specific drugs and drug classes involved in drug overdose deaths varies by jurisdiction, and comparisons of death rates involving specific drugs across selected jurisdictions should not be made (see Technical notes). Provisional data will be updated on a monthly basis as additional records are received.

    Technical notes

    Nature and sources of data

    Provisional drug overdose death counts are based on death records received and processed by the National Center for Health Statistics (NCHS) as of a specified cutoff date. The cutoff date is generally the first Sunday of each month. National provisional estimates include deaths occurring within the 50 states and the District of Columbia. NCHS receives the death records from state vital registration offices through the Vital Statistics Cooperative Program (VSCP).

    The timeliness of provisional mortality surveillance data in the National Vital Statistics System (NVSS) database varies by cause of death. The lag time (i.e., the time between when the death occurred and when the data are available for analysis) is longer for drug overdose deaths compared with other causes of death (1). Thus, provisional estimates of drug overdose deaths are reported 6 months after the date of death.

    Provisional death counts presented in this data visualization are for “12-month ending periods,” defined as the number of deaths occurring in the 12-month period ending in the month indicated. For example, the 12-month ending period in June 2017 would include deaths occurring from July 1, 2016, through June 30, 2017. The 12-month ending period counts include all seasons of the year and are insensitive to reporting variations by seasonality. Counts for the 12-month period ending in the same month of the previous year are shown for comparison. These provisional counts of drug overdose deaths and related data quality metrics are provided for public health surveillance and monitoring of emerging trends. Provisional drug overdose death data are often incomplete, and the degree of completeness varies by jurisdiction and 12-month ending period. Consequently, the numbers of drug overdose deaths are underestimated based on provisional data relative to final data and are subject to random variation. Methods to adjust provisional counts have been developed to provide predicted provisional counts of drug overdose deaths, accounting for delayed reporting (see Percentage of records pending investigation and Adjustments for delayed reporting).

    Provisional data are based on available records that meet certain data quality criteria at the time of analysis and may not include all deaths that occurred during a given time period. Therefore, they should not be considered comparable with final data and are subject to change.

    Cause-of-death classification and definition of drug deaths
    Mortality statistics are compiled in accordance with World Health Organization (WHO) regulations specifying that WHO member nations classify and code causes of death with the current revision of the International Statistical Classification of Diseases and Related Health Problems (ICD). ICD provides the basic guidance used in virtually all countries to code and classify causes of death. It provides not only disease, injury, and poisoning categories but also the rules used to select the single underlying cause of death for tabulation from the several diagnoses that may be reported on a single death certificate, as well as definitions, tabulation lists, the format of the death certificate, and regulations on use of the classification. Causes of death for data presented in this report were coded according to ICD guidelines described in annual issues of Part 2a of the NCHS Instruction Manual (2).

    Drug overdose deaths are identified using underlying cause-of-death codes from the Tenth Revision of ICD (ICD–10): X40–X44 (unintentional), X60–X64 (suicide), X85 (homicide), and Y10–Y14 (undetermined). Drug overdose deaths involving selected drug categories are identified by specific multiple cause-of-death codes. Drug categories presented include: heroin (T40.1); natural opioid analgesics, including morphine and codeine, and semisynthetic opioids, including drugs such as oxycodone, hydrocodone, hydromorphone, and oxymorphone (T40.2); methadone, a synthetic opioid (T40.3); synthetic opioid analgesics other than methadone, including drugs such as fentanyl and tramadol (T40.4); cocaine (T40.5); and psychostimulants with abuse potential, which includes methamphetamine (T43.6). Opioid overdose deaths are identified by the presence of any of the following MCOD codes: opium (T40.0); heroin (T40.1); natural opioid analgesics (T40.2); methadone (T40.3); synthetic opioid analgesics other than methadone (T40.4); or other and unspecified narcotics (T40.6). This latter category includes drug overdose deaths where ‘opioid’ is reported without more specific information to assign a more specific ICD–10 code (T40.0–T40.4) (3,4). Among deaths with an underlying cause of drug overdose, the percentage with at least one drug or drug class specified is defined as that with at least one ICD–10 multiple cause-of-death code in the range T36–T50.8.

    Drug overdose deaths may involve multiple drugs; therefore, a single death might be included in more than one category when describing the number of drug overdose deaths involving specific drugs. For example, a death that involved both heroin and fentanyl would be included in both the number of drug overdose deaths involving heroin and the number of drug overdose deaths involving synthetic opioids other than methadone.

    Selection of specific states and other jurisdictions to report
    Provisional counts are presented by the jurisdiction in which the death occurred (i.e., the reporting jurisdiction). Data quality and timeliness for drug overdose deaths vary by reporting jurisdiction. Provisional counts are presented for reporting jurisdictions based on measures of data quality: the percentage of records where the manner of death is listed as “pending investigation,” the overall completeness of the data, and the percentage of drug overdose death records with specific drugs or drug classes recorded. These criteria are defined below.

    Percentage of records pending investigation

    Drug overdose deaths often require lengthy investigations, and death certificates may be initially filed with a manner of death “pending investigation” and/or with a preliminary or unknown cause of death. When the percentage of records reported as “pending investigation” is high for a given jurisdiction, the number of drug overdose deaths is likely to be underestimated. For jurisdictions reporting fewer than 1% of records as “pending investigation”, the provisional number of drug overdose deaths occurring in the fourth quarter of 2015 was approximately 5% lower than the final count of drug overdose deaths occurring in that same time period. For jurisdictions reporting greater than 1% of records as “pending investigation” the provisional counts of drug overdose deaths may underestimate the final count of drug overdose deaths by as much as 30%. Thus, jurisdictions are included in Table 2 if 1% or fewer of their records in NVSS are reported as “pending investigation,” following a 6-month lag for the 12-month ending periods included in the dashboard. Values for records pending investigation are updated with each monthly release and reflect the most current data available.

    Percent completeness

    NCHS receives monthly counts of the estimated number of deaths from each jurisdictional vital registration offices (referred to as “control counts”). This number represents the best estimate of how many

  13. h

    KnowCoder-Schema-Understanding-Data

    • huggingface.co
    Updated Mar 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ICT-Golaxy (2024). KnowCoder-Schema-Understanding-Data [Dataset]. https://huggingface.co/datasets/golaxy/KnowCoder-Schema-Understanding-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2024
    Dataset authored and provided by
    ICT-Golaxy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

    📃 Paper | 🤗 Resource (Schema • Data • Model) | 🚀 Try KnowCoder (coming soon)!

      Schema Understanding Data
    

    The schema understanding data includes schema definition codes and schema instance codes.

      Schema Definition Codes
    

    The schema definition codes are built based on a schema library, with statistical results as follows.

    Due to data protection concerns, here… See the full description on the dataset page: https://huggingface.co/datasets/golaxy/KnowCoder-Schema-Understanding-Data.

  14. f

    Data from: Speciation without Pre-Defined Fitness Functions

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 15, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristescu, Melania E.; Hendry, Andrew P.; Gras, Robin; Golestani, Abbas (2015). Speciation without Pre-Defined Fitness Functions [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001907714
    Explore at:
    Dataset updated
    Sep 15, 2015
    Authors
    Cristescu, Melania E.; Hendry, Andrew P.; Gras, Robin; Golestani, Abbas
    Description

    The forces promoting and constraining speciation are often studied in theoretical models because the process is hard to observe, replicate, and manipulate in real organisms. Most models analyzed to date include pre-defined functions influencing fitness, leaving open the question of how speciation might proceed without these built-in determinants. To consider the process of speciation without pre-defined functions, we employ the individual-based ecosystem simulation platform EcoSim. The environment is initially uniform across space, and an evolving behavioural model then determines how prey consume resources and how predators consume prey. Simulations including natural selection (i.e., an evolving behavioural model that influences survival and reproduction) frequently led to strong and distinct phenotypic/genotypic clusters between which hybridization was low. This speciation was the result of divergence between spatially-localized clusters in the behavioural model, an emergent property of evolving ecological interactions. By contrast, simulations without natural selection (i.e., behavioural model turned off) but with spatial isolation (i.e., limited dispersal) produced weaker and overlapping clusters. Simulations without natural selection or spatial isolation (i.e., behaviour model turned off and high dispersal) did not generate clusters. These results confirm the role of natural selection in speciation by showing its importance even in the absence of pre-defined fitness functions.

  15. D

    VSRR Provisional County-Level Drug Overdose Death Counts

    • data.cdc.gov
    • data.virginia.gov
    • +5more
    csv, xlsx, xml
    Updated Jul 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCHS/DVS (2025). VSRR Provisional County-Level Drug Overdose Death Counts [Dataset]. https://data.cdc.gov/w/gb4e-yj24/tdwk-ruhb?cur=YuhayouiVq4
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Jul 16, 2025
    Dataset authored and provided by
    NCHS/DVS
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    This data visualization presents county-level provisional counts for drug overdose deaths based on a current flow of mortality data in the National Vital Statistics System. County-level provisional counts include deaths occurring within the 50 states and the District of Columbia, as of the date specified and may not include all deaths that occurred during a given time period. Provisional counts are often incomplete and causes of death may be pending investigation resulting in an underestimate relative to final counts (see Technical Notes).

    The provisional data presented on the dashboard below include reported 12 month-ending provisional counts of death due to drug overdose by the decedent’s county of residence and the month in which death occurred.

    Percentages of deaths with a cause of death pending further investigation and a note on historical completeness (e.g. if the percent completeness was under 90% after 6 months) are included to aid in interpretation of provisional data as these measures are related to the accuracy of provisional counts (see Technical Notes). Counts between 1-9 are suppressed in accordance with NCHS confidentiality standards. Provisional data presented on this page will be updated on a quarterly basis as additional records are received.

    Technical Notes

    Nature and Sources of Data

    Provisional drug overdose death counts are based on death records received and processed by the National Center for Health Statistics (NCHS) as of a specified cutoff date. The cutoff date is generally the first Sunday of each month. National provisional estimates include deaths occurring within the 50 states and the District of Columbia. NCHS receives the death records from the state vital registration offices through the Vital Statistics Cooperative Program (VSCP).

    The timeliness of provisional mortality surveillance data in the National Vital Statistics System (NVSS) database varies by cause of death and jurisdiction in which the death occurred. The lag time (i.e., the time between when the death occurred and when the data are available for analysis) is longer for drug overdose deaths compared with other causes of death due to the time often needed to investigate these deaths (1). Thus, provisional estimates of drug overdose deaths are reported 6 months after the date of death.

    Provisional death counts presented in this data visualization are for “12 month-ending periods,” defined as the number of deaths occurring in the 12 month period ending in the month indicated. For example, the 12 month-ending period in June 2020 would include deaths occurring from July 1, 2019 through June 30, 2020. The 12 month-ending period counts include all seasons of the year and are insensitive to reporting variations by seasonality. These provisional counts of drug overdose deaths and related data quality metrics are provided for public health surveillance and monitoring of emerging trends. Provisional drug overdose death data are often incomplete, and the degree of completeness varies by jurisdiction and 12 month-ending period. Consequently, the numbers of drug overdose deaths are underestimated based on provisional data relative to final data and are subject to random variation.

    Cause of Death Classification and Definition of Drug Deaths

    Mortality statistics are compiled in accordance with the World Health Organizations (WHO) regulations specifying that WHO member nations classify and code causes of death with the current revision of the International Statistical Classification of Diseases and Related Health Problems (ICD). ICD provides the basic guidance used in virtually all countries to code and classify causes of death. It provides not only disease, injury, and poisoning categories but also the rules used to select the single underlying cause of death for tabulation from the several diagnoses that may be reported on a single death certificate, as well as definitions, tabulation lists, the format of the death certificate, and regulations on use of the classification. Causes of death for data presented on this report were coded according to ICD guidelines described in annual issues of Part 2a of the NCHS Instruction Manual (2). Drug overdose deaths are identified using underlying cause-of-death codes from the Tenth Revision of ICD (ICD–10): X40–X44 (unintentional), X60–X64 (suicide), X85 (homicide), and Y10–Y14 (undetermined).

    Selection of Specific Jurisdictions to Report

    Provisional counts are presented by the jurisdiction where the decedent resides (e.g. county of residence). Data quality and timeliness for drug overdose deaths vary by reporting jurisdiction. Provisional counts are presented, along with measures of data quality: the percentage of records where the manner of death is listed as “pending investigation”, and a note for specific jurisdictions with historically lower levels of data completeness (where provisional 2019 data were less than 90% complete after 6 months).

    Percentage of Records Pending Investigation

    Drug overdose deaths often require lengthy investigations, and death certificates may be initially filed with a manner of death “pending investigation” and/or with a preliminary or unknown cause of death. When the percentage of records reported as “pending investigation” is high for a given jurisdiction, the number of drug overdose deaths is likely to be underestimated. Counts of drug overdose deaths may be underestimated to a greater extent in jurisdictions or counties where more records in NVSS are reported as “pending investigation” for the six most recent 12 month-ending periods.

    Historical Completeness

    The historical percent completeness of provisional data is obtained by dividing the number of death records in the NVSS database for each jurisdiction and county after a 6-month lag for deaths occurring in 2019 by the number of deaths eventually included in the final data files. Counties with historically lower levels of provisional data completeness are flagged with a note to indicate that the data may be incomplete in these areas. However, the completeness of provisional data may change over time, and therefore the degree of underestimation will not be known until data are finalized (typically 11-12 months after the end of the data year).

    Differences between Final and Provisional Data

    There may be differences between provisional and final data for a given data year (e.g., 2020). Final drug overdose death data published annually through NCHS statistical reports (3) and CDC WONDER undergo additional data quality checks and processing. Provisional counts reported here are subject to change as additional data are received.

    Source

    NCHS, National Vital Statistics System. Estimates for 2020 and 2021 are based on provisional data. Estimates for 2019 are based on final data (available from: https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm).

    References

    1. Spencer MR, Ahmad F. Timeliness of death certificate data for mortality surveillance and provisional estimates. National Center for Health Statistics. 2016. Available from: https://www.cdc.gov/nchs/data/vsrr/report001.pdf
    2. National Vital Statistics System. Instructions for classifying the underlying cause of death. In: NCHS instruction manual; Part 2a. Published annually.
    3. Hedegaard H, Miniño AM, Warner M. Drug overdose deaths in the United States, 1999–2018. NCHS Data Brief, no 356. Hyattsville, MD: National Center for Health Statistics. 2020. Available from: https://www.cdc.gov/nchs/products/databriefs/db356.htm

    Suggested Citation

    Ahmad FB, Anderson RN, Cisewski JA, Rossen LM, Warner M, Sutton P. County-level provisional drug overdose death counts. National Center for Health Statistics. 2021.

    Designed by MirLogic Solutions Corp: National Center for Health Statistics.

  16. d

    GLO Preliminary Assessment Extent

    • data.gov.au
    • researchdata.edu.au
    Updated Nov 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). GLO Preliminary Assessment Extent [Dataset]. https://data.gov.au/data/dataset/activity/c0dba74a-feee-440f-a53a-ea31be8479a5
    Explore at:
    Dataset updated
    Nov 20, 2019
    Dataset provided by
    Bioregional Assessment Program
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    The Preliminary Assessment Extent (PAE) is a spatial layer that defines the land surface area contained within a bioregion (and possibly outside) over which coal resource development may have potential impact on water-dependent assets and receptors associated with those assets (Barrett et al 2013). The Gloucester subregion is defined by the geological Gloucester Basin (Roberts et al., 1991). As this is a geological mapping unit, there has been no consideration beyond the subregion boundary of groundwater and/or surface water connection. From a groundwater perspective it is a closed system with groundwater discharging to lower portions of the landscape and being evaporated through riparian vegetation (Parsons Brinckerhoff, 2012a, pp. 30-31). Hence, as there is no groundwater connection to assets beyond the boundary of the Gloucester subregion, no further consideration of groundwater connectivity is required. The PAE of the Gloucester subregion is comprised of the union of the Gloucester subregion boundary and, to account for changes in surface water flows, 1 km buffer zones either side of the major rivers flowing from the northern component (to the outlet of the Gloucester River) and from the southern component to Port Stephens. It is assumed that water from rivers will be pumped a maximum of 1 km, hence the 1 km buffer around streams. The relative contributions to flow from various streams was examined using climate data and the Budyko framework to determine how far outside the bioregion the surface water might be significantly impacted by changes to the water balance within the bioregion.

    Barrett DJ, Couch CA, Metcalfe DJ, Lytton L, Adhikary DP and Schmidt RK (2013) Methodology for bioregional assessments of the impacts of coal seam gas and coal mining development on water resources. A report prepared for the Independent Expert Scientific Committee on Coal Seam Gas and Large Coal Mining Development through the Department of the Environment. Department of the Environment, Australia. Viewed 2 October 2014, http://iesc.environment.gov.au/publications/methodology-bioregional-assessments-impacts-coal-seam-gas-and-coal-mining-development-water.

    Parsons Brinckerhoff (2012a) Phase 2 groundwater investigations - Stage 1 Gas Field Development Area: Gloucester Gas Project. Technical report by Parsons Brinckerhoff Australia Pty Limited for AGL Upstream Investments Pty Ltd. Parsons Brinckerhoff Australia Pty Limited, Sydney. Viewed 12 Aug 2013, http://agk.com.au/gloucester/assets/pdf/PB%20Gloucester%20Groundwater%20Report%20Phase%202%20Text.pdf

    Roberts J, Engel B and Chapman J (1991) Geology of the Camberwell, Dungog, and Bulahdelah 1:100 000 Geological Sheets 9133, 9233, 9333. New South Wales Geological Survey, Sydney.

    Purpose

    The role of the PAE is to optimise research agency effort by focussing on those locations where a material causal link may occur between coal resource development and impacts on water dependent assets. The lists of assets collated by the Program are filtered for "proximity" such that only those assets that intersect with the PAE are considered further in the assessment process. Changes to the PAE such as through the identification of a different development pathway or an improved hydrological understanding may require the proximity of assets to be considered again. Should the assessment process identify a material connection between a water dependent asset outside the PAE and coal resource development impacts, the PAE would need to be amended.

    Dataset History

    The Gloucester subregion is defined by the geological Gloucester Basin (Roberts et al., 1991). As this is a geological mapping unit, there has been no consideration beyond the subregion boundary of groundwater and/or surface water connection. The PAE of the Gloucester subregion is comprised of the union of the Gloucester subregion boundary and, to account for changes in surface water flows, 1 km buffer zones either side of the major rivers flowing from the northern component (to the outlet of the Gloucester River) and from the southern component to Port Stephens. The Australian Hydrological Geospatial Fabric (Geofabric) - developed by the Bureau of Meteorology (2012) - was used to define a set of catchments that flow into and out of the northern and southern components of the Gloucester subregion. This process identified 13 subcatchments. Seven subcatchments define the north flowing rivers that comprise the Manning river basin, five subcatchments constitute the south flowing rivers that make-up the Karuah river basin, the remainder is the Wallamba River catchment. As the Wallamba River catchment (in which the town of Forster is located) is not hydrologically connected to surface water flowing from the Gloucester subregion, it is not considered further.

    Two versions form the data set:

    Gloucester_PAE_20130719.shp - originally defined area with doughnut (missing area completely contained within the bounding polygon) due to geological basin bnd.

    Gloucester_PAE_for_map_aesthetics.shp - as above with doughnut removed.

    Dataset Citation

    Bioregional Assessment Programme (2013) GLO Preliminary Assessment Extent. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/c0dba74a-feee-440f-a53a-ea31be8479a5.

    Dataset Ancestors

  17. f

    Agricultural Census, 2010 - Poland

    • microdata.fao.org
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistical Office (CSO) (2021). Agricultural Census, 2010 - Poland [Dataset]. https://microdata.fao.org/index.php/catalog/1706
    Explore at:
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    Central Statistical Office (CSO)
    Time period covered
    2010
    Area covered
    Poland
    Description

    Abstract

    The agricultural census and the survey on agricultural production methods were conducted jointly, i.e. within the same organisational structure, at the same time, and using a single electronic questionnaire and the same methods of data collection and processing. The agricultural census covered about 1.8 million of agricultural holdings. At all farms participating in the census, respondents were asked about the "other gainful activities carried out by the labour force" (OGA). The frame for the full survey was prepared on the basis of the list of holdings prepared for the census. When creating the list, an object-oriented approach was adopted for the first time, which meant that at the first stage the holdings (objects) were identified, their coordinates defined (they were located spatially) and their holders were identified on the basis of data from administrative sources. For domestic purposes, the farms with the smallest area, as well as those of little economic importance (meeting very low national thresholds) were included in the sample survey carried out jointly with the census. The survey on agricultural production methods was conducted on a sample of approximately 200 thousand farms in respect of the precision requirements set out in Regulation (EC) 1166/2008. The frame prepared for the agricultural census was used as the sampling frame.

    Geographic coverage

    National coverage

    Analysis unit

    Households

    Universe

    The statistical unit was the agricultural holding, defined as "an agricultural area, including forest land, buildings or their parts, equipment and stock if they constitute or may constitute an organized economic unit as well as rights related to running the farm". Two types of holding were distinguished (i) the natural persons' holdings (to which thresholds were applied) and (ii) legal persons holdings (no threshold applied).

    Kind of data

    Census/enumeration data [cen]

    Sampling procedure

    (a) Frame The frame for the agricultural census and the survey on agricultural production methods was based on the list of agricultural holdings. In the process of the list of farms creation for the needs of AC and SAPM 2010 the objective approach was used for the first time, which meant that on the first stage of work agricultural holdings were identified, its coordinates were defined (farms were located in space), and its holder was determined according to administrative data as described below. The list creation started from identification of all land parcels used for agricultural purposes. The land parcels found in the set of the Agency for Restructuring and Modernisation of Agriculture (including the Records of holdings and Records of producers) were combined into holding and had their holders defined. For the rest of land parcels, the holders were defined from the Records of Land and Buildings, afterwards the data concerning users were updated by the set of Real Property Tax Record.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    A single electronic questionnaire was used for data collection, combining information related to both the AC 2010 and the SAPM. The census covered all 16 core items recommended in the WCA 2010.

    Questionnaire:

    Section 0. Identifying characters Section 1. Land use Section 2. Economic activity Section 3. Income structure Section 4. Sown and other area Section 5. Livestock Section 6. tractor, machines and equipment Section 7. Use of fertilizers Section 8. Labour force Section 9. Agricultural production methods

    Cleaning operations

    a. DATA PROCESSING AND ARCHIVING The data captured through the CAPI, CATI and CAWI channels were gathered in the Operational Microdata Base (OMB) built for the AC 2010 and processed there (including control and correction of data, as well as completing the file obtained in the AC with the data obtained from administrative sources, imputed units and estimation for the SAPM). The data, depersonalized and validated in the OMB, were exported to an Analytical Microdata Base (AMB) to conduct analyses, prepare the data set for transmission to Eurostat and develop multidimensional tables for internal and external users.

    b. CENSUS DATA QUALITY Except for a few isolated cases, the CAPI and CATI method resulted in fully completed questionnaires. The computer applications used enabled controls for completeness and correctness of the data already at the collection stage, also facilitating the use of necessary definitions and clarifications during the questionnaire completion process. A set of detailed questionnaire completion guidelines was developed and delivered during training sessions.

    Data appraisal

    The preliminary results of the agricultural census were published in February 2011 (basic data at the national level), and then in July 2011 in the publication entitled "Report on the Results of the 2010 Agricultural Census" (in a broader thematic scope, at NUTS3 2 level). The final results of the AC 2010 were disseminated by a sequence of publications, covering the main thematic areas of the census. The reference publications were released in paper form, and are available online (www.stat.gov.pl http://www.stat.gov.pl), and on CD-ROMs.

  18. d

    MBC Analysis Boundaries 20160718 v01

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Apr 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). MBC Analysis Boundaries 20160718 v01 [Dataset]. https://data.gov.au/data/dataset/922bfcc7-d4fb-44ec-bfb1-cd24e4b31a0a
    Explore at:
    zip(343160)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset includes the current boundary data required for the bioregional assessment impact analysis for the Maranoa-Balonne_Condamine (MBC) subregion. These data are (1) the current Preliminary Assessment Extent (PAE), (2) the Analysis Extent (AE) and (3) the Analysis Domain Extent (ADE) and the Focal Analysis Extent (FAE).

    The PAE is defined and explained in the BA submethodology (1.3 Description of the water-dependent asset register) and, specifically for the MBC subregion in product 1.3 Water-dependent asset register for the MBC subregion. The Analysis Extent (AE) is defined as the geographic area that encompasses all the possible areas that may be reported as part of the impact analysis component of a bioregional assessment, specifically, the subregion boundary, the PAE and the relevant groundwater model domain. The Focal Analysis Extent (FAE) is based on the defined water balance areas which encompass the modelled limits of potential groundwater change due to additional coal resource development. The Analysis Domain Extent (ADE) is defined as the geographic area used for geoprocessing and data preparation purposes that encompasses the Analysis Extent plus additional areas sufficient to ensure all relevant data is included for the impact analysis component of a bioregional assessment. For MBC, the ADE had at least an additional 20 km geographic buffer added to the AE boundary.

    All data are in the Australian Albers coordinate system (EPSG 3577).

    Purpose

    The purpose of this dataset is to provide the boundaries needed for the impact analysis component of the BA.

    Dataset History

    This dataset includes the current boundary data required for the bioregional assessment impact analysis for the Maranoa-Balonne_Condamine (MBC) subregion. These data are (1) the current Preliminary Assessment Extent (PAE), (2) the Analysis Extent (AE) and (3) the Analysis Domain Extent (ADE) and the Focal Analysis Extent (FAE). All data are in the Australian Albers coordinate system (EPSG 3577).

    The PAE is defined and explained in the BA submethodology (1.3 Description of the water-dependent asset register) and, specifically for the MBC subregion in product 1.3 Water-dependent asset register for the MBC subregion. The Analysis Extent (AE) is defined as the geographic area that encompasses all the possible areas that may be reported as part of the impact analysis component of a bioregional assessment, specifically, the subregion boundary, the PAE and the relevant groundwater model domain. The Focal Analysis Extent (FAE) is based on the defined water balance areas which encompass the modelled limits of potential groundwater change due to additional coal resource development. The Analysis Domain Extent (ADE) is defined as the geographic area used for geoprocessing and data preparation purposes that encompasses the Analysis Extent plus additional areas sufficient to ensure all relevant data is included for the impact analysis component of a bioregional assessment. For MBC, the ADE had at least an additional 20 km geographic buffer added to the AE boundary.

    The Analysis Domain Extent (ADE) is defined as the geographic area used for geoprocessing and data preparation purposes that encompasses the Analysis Extent plus additional areas sufficient to ensure all relevant data is included for the impact analysis component of a bioregional assessment. For Gloucester, at least an additional 1.5 km was added to the AE boundary and, in places further extensions were required to include all of the groundwater model domain.

    Dataset Citation

    Bioregional Assessment Programme (2016) MBC Analysis Boundaries 20160718 v01. Bioregional Assessment Derived Dataset. Viewed 25 October 2017, http://data.bioregionalassessments.gov.au/dataset/922bfcc7-d4fb-44ec-bfb1-cd24e4b31a0a.

    Dataset Ancestors

  19. NAM Analysis Boundaries 20160908 v01

    • researchdata.edu.au
    Updated Dec 10, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2018). NAM Analysis Boundaries 20160908 v01 [Dataset]. https://researchdata.edu.au/nam-analysis-boundaries-20160908-v01/2985400
    Explore at:
    Dataset updated
    Dec 10, 2018
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset includes the current boundary data required for the bioregional assessment impact analysis for the Namoi (NAM) subregion. These data are (1) the current Preliminary Assessment Extent (PAE), (2) the Analysis Extent (AE) and (3) and the Analysis Domain Extent (AD).

    The PAE is defined and explained in the BA submethodology (1.3 Description of the water-dependent asset register) and, specifically for the NAM subregion in product 1.3 Water-dependent asset register for the NAM subregion. The Analysis Extent (AE) is defined as the geographic area that encompasses all the possible areas that may be reported as part of the impact analysis component of a bioregional assessment, specifically, the subregion boundary and the PAE. The Analysis Domain extent (AD) is defined as the geographic area used for geoprocessing and data preparation purposes that encompasses the Analysis Extent plus additional areas sufficient to ensure all relevant data is included for the impact analysis component of a bioregional assessment. For NAM, the ADE had at least an additional 20 km geographic buffer added to the AE boundary.

    All data are in the Australian Albers coordinate system (EPSG 3577).

    Purpose

    The purpose of the various boundary polygons are to assist in the efficient spatial analysis of the impact of coal resource development in the Namoi subregion.

    Dataset History

    This dataset includes the current boundary data required for the bioregional assessment impact analysis for the Namoi (NAM) subregion. These data are (1) the current Preliminary Assessment Extent (PAE), (2) the Analysis Extent (AE) and (3) and the Analysis Domain Extent (AD).

    The PAE is defined and explained in the BA submethodology (1.3 Description of the water-dependent asset register) and, specifically for the NAM subregion in product 1.3 Water-dependent asset register for the NAM subregion. The Analysis Extent (AE) is defined as the geographic area that encompasses all the possible areas that may be reported as part of the impact analysis component of a bioregional assessment, specifically, the subregion boundary and the PAE. The Analysis Domain extent (AD) is defined as the geographic area used for geoprocessing and data preparation purposes that encompasses the Analysis Extent plus additional areas sufficient to ensure all relevant data is included for the impact analysis component of a bioregional assessment. For NAM, the ADE had at least an additional 20 km geographic buffer added to the AE boundary.

    All data are in the Australian Albers coordinate system (EPSG 3577).

    Dataset Citation

    Bioregional Assessment Programme (XXXX) NAM Analysis Boundaries 20160908 v01. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/b71e38ac-a7cd-4781-a255-0b13548e6a90.

    Dataset Ancestors

  20. WIBR Crime Data (Current)

    • data.milwaukee.gov
    csv
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milwaukee Police Department (2025). WIBR Crime Data (Current) [Dataset]. https://data.milwaukee.gov/dataset/wibr
    Explore at:
    csv(117409245)Available download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Milwaukee Police Departmenthttp://city.milwaukee.gov/police
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update Frequency: Daily

    Current year to date. The data included in this dataset has been reviewed and approved by a Milwaukee Police Department supervisor and the Milwaukee Police Department’s Records Management Division. This approval process can take a few weeks from the reported date of the crime. For preliminary crime data, please visit the Milwaukee Police Department’s Crime Maps and Statistics dashboard at https://city.milwaukee.gov/police/Information-Services/Crime-Maps-and-Statistics.

    Wisconsin Incident Based Report (WIBR) Group A Offenses.

    The Crime Data represents incident level data defined by Wisconsin Incident Based Reporting System (WIBRS) codes. WIBRS reporting is a crime reporting standard and can not be compared to any previous UCR report. Therefore, the Crime Data may reflect:

    • Information not yet verified by further investigation
    • Preliminary crime classifications that may be changed at a later date based upon further investigation
    • Information that may include mechanical or human error

    Neither the City of Milwaukee nor the Milwaukee Police Department guarantee (either express or implied) the accuracy, completeness, timeliness, or correct sequencing of the Crime Data. The City of Milwaukee and the Milwaukee Police Department shall have no liability for any error or omission, or for the use of, or the results obtained from the use of the Crime Data. In addition, the City of Milwaukee and the Milwaukee Police Department caution against using the Crime Data to make decisions/comparisons regarding the safety of or the amount of crime occurring in a particular area. When reviewing the Crime Data, the site user should consider that:

    • The information represents only police services where a report was made and does not include other calls for police service
    • The information does not reflect or certify "safe" or "unsafe" areas
    • The information will sometimes reflect where the crime was reported versus where the crime occurred

    This data is not intended to represent a total number/sum of crimes, rather 1 = True and 0 = False.

    The use of the Crime Data indicates the site user's unconditional acceptance of all risks associated with the use of the Crime Data.

    To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page. XY fields in data is in projection Wisconsin State Plane South NAD27 (WKID 32054).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm

Data from: Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

Search
Clear search
Close search
Google apps
Main menu