11 datasets found
  1. f

    SQL code.

    • plos.figshare.com
    7z
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001
    Explore at:
    7zAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The code is about how to extract data from the MIMIC-III. (7Z)

  2. Cleaned Retail Customer Dataset (SQL-based ETL)

    • kaggle.com
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rizwan Bin Akbar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

    This table contains information about customers, including their unique identifiers and demographic details.

    Columns:
    
      cst_id: Customer ID (Primary Key)
    
      cst_gndr: Gender
    
      cst_marital_status: Marital status
    
      cst_create_date: Customer account creation date
    
    Cleaning Steps:
    
      Removed duplicates and handled missing or null cst_id values.
    
      Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
    
      Standardized gender values and identified inconsistencies in marital status.
    
    1. Product Information (s_crm_prd_info / b_crm_prd_info)

    This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

    Columns:
    
      prd_id: Product ID
    
      prd_key: Product key
    
      prd_nm: Product name
    
      prd_cost: Product cost
    
      prd_start_dt: Product start date
    
      prd_end_dt: Product end date
    
    Cleaning Steps:
    
      Checked for duplicates and null values in the prd_key column.
    
      Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
    
      Corrected product costs to remove invalid entries (e.g., negative values).
    
    1. Sales Details (s_crm_sales_details / b_crm_sales_details)

    This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

    Columns:
    
      sls_order_dt: Sales order date
    
      sls_due_dt: Sales due date
    
      sls_sales: Total sales amount
    
      sls_quantity: Number of products sold
    
      sls_price: Product unit price
    
    Cleaning Steps:
    
      Validated sales order dates and corrected invalid entries.
    
      Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
    
      Removed null and negative values from sls_sales, sls_quantity, and sls_price.
    
    1. ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

    This table contains additional customer demographic data, including gender and birthdate.

    Columns:
    
      cid: Customer ID
    
      gen: Gender
    
      bdate: Birthdate
    
    Cleaning Steps:
    
      Checked for missing or null gender values and standardized inconsistent entries.
    
      Removed leading/trailing spaces from gen and bdate.
    
      Validated birthdates to ensure they were within a realistic range.
    
    1. Location Information (b_erp_loc_a101)

    This table contains country information related to the customers' locations.

    Columns:
    
      cntry: Country
    
    Cleaning Steps:
    
      Standardized country names (e.g., "US" and "USA" were mapped to "United States").
    
      Removed special characters (e.g., carriage returns) and trimmed whitespace.
    
    1. Product Category (b_erp_px_cat_g1v2)

    This table contains product category information.

    Columns:
    
      Product category data (no significant cleaning required).
    

    Key Features:

    Customer demographics, including gender and marital status
    
    Product details such as cost, start date, and end date
    
    Sales data with order dates, quantities, and sales amounts
    
    ERP-specific customer and location data
    

    Data Cleaning Process:

    This dataset underwent extensive cleaning and validation, including:

    Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
    
    Date Validations: Ensuring correct date ranges and chronological consistency.
    
    Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
    
    Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
    

    This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

  3. Z

    Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

    • data.niaid.nih.gov
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zacchiroli, Stefano (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    Simon, Tournier
    Zacchiroli, Stefano
    Courtès, Ludovic
    Sample, Timothy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication package for the paper:

    Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

    Generating the paper

    The paper can be generated using the following command:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make

    This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

    It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

    GNU Make

    SQLite 3

    GNU AWK

    Rubber

    Graphviz

    TeXLive

    Structure

    data/ contains the data examined in the paper

    scripts/ contains dedicated code for the paper

    logs/ contains logs generated during certain computations

    Preservation of Guix

    Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

    The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

    Analysis

    Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

    The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

    The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

    The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

    Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

    Estimating missing sources

    The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

    A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

    This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

    To generate the estimate, use:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make data/missing-sources.txt

    If not using Guix, you will need additional software beyond what is used to generate the paper:

    GNU Guile

    GNU Bash

    GNU Mailutils

    GNU Parallel

    Measuring link rot

    In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

    guix repl -q scripts/table-per-origin.scm

    When running these scripts for the paper, we tracked their output and saved it inside the logs directory.

  4. Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eskom (2019). Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/757
    Explore at:
    Dataset updated
    Jun 20, 2019
    Dataset provided by
    Eskomhttp://www.eskom.co.za/
    Stellenbosch University
    University of Cape Town
    Time period covered
    1995 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset contains sensitive data that has not been disclosed in the online version of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. In contrast to the DELS dataset, the DELS Secure Data contains partially anonymised survey responses with only the names of respondents and home owners removed. The DELSS contains street and postal addresses, as well as GPS level location data for households from 2000 onwards. The GPS data is obtained through an auxiliary dataset, the Site Reference database. Like the DELS, the DELSS dataset has been retrieved and anonymised from the original SQL database with the python package delretrieve.

    Geographic coverage

    The study had national coverage.

    Analysis unit

    Households and individuals

    Universe

    The survey covers electrified households that received electricity either directly from Eskom or from their local municipality. Particular attention was devoted to rural and low income households, as well as surveying households electrified over a range of years, thus having had access to electricity from recent times to several decades.

    Kind of data

    Sample survey data

    Sampling procedure

    See sampling procedure for DELS 1994-2014

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    This dataset has been produced by extracting only the survey responses from the original NRS Load Research SQL database using the saveAnswers function from the delretrieve python package (https://github.com/wiebket/delretrieve: release v1.0). Full instructions on how to use delretrieve to extract data are in the README file contained in the package.

    PARTIAL DE-IDENTIFICATION Partial de-identification was done in the process of extracting the data from the SQL database with the delretrieve package. Only the names of respondents and home owners have been removed from the survey responses by replacing responses with an 'a' in the dataset. Documents with full details of the variables that have been anonymised are included as external resources.

    MISSING VALUES Other than partial de-identification no post-processing was done and all database records, including missing values, are stored exactly as retrieved.

    Data appraisal

    See notes on data quality for DELS 1994-2014

  5. f

    Fields of Format 1 which cannot be transformed.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Fields of Format 1 which cannot be transformed. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.

  6. f

    DQD results of Format 3.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). DQD results of Format 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.

  7. f

    Execution time of ETL process for example data 3.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Execution time of ETL process for example data 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.

  8. f

    Fields of Format 3 which cannot be transformed.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Fields of Format 3 which cannot be transformed. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.

  9. Open Trade Statistics Database

    • zenodo.org
    bin
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauricio Vargas Sepulveda; Mauricio Vargas Sepulveda (2024). Open Trade Statistics Database [Dataset]. http://doi.org/10.5281/zenodo.13370487
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mauricio Vargas Sepulveda; Mauricio Vargas Sepulveda
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Open Trade Statistics initiative was developed to ease access to international trade data by providing downloadable SQL database dumps, a public API, a dashboard, and an R package for data retrieval. This project was born out of the recognition that many academic institutions in Latin America lack access to academic subscriptions and comprehensive datasets like the United Nations Commodity Trade Statistics Database. The OTS project not only offers a solution to this problem regarding international trade data but also emphasizes the importance of reproducibility in data processing. Through the use of open-source tools, the project ensures that its datasets are accessible and easy to use for research and analysis.

    OTS, based on the official correlation tables, provides a harmonized dataset where the values are converted to HS revision 2012 for the years 1980-2021 and it involved transforming some of the reported data to find equivalent codes between the different classifications. For instance, the HS revision 1992 code '271011' (aviation spirit) does not have a direct equivalent in HS revision 2012 and it can be converted to the more general code '271000' (oils petroleum, bituminous, distillates, except crude). The same process was applied to the SITC codes.

    Country codes are also standardized in OTS. For instance, missing ISO-3 country codes in the raw data were replaced by the values expressed in UN COMTRADE documentation. For instance, the numeric code '490' corresponds to 'e-490' but it appears as a blank value in the raw data, and UN COMTRADE documentation
    indicates that 'e-490' corresponds to 'Other Asia, Not Elsewhere Specified (NES)'.

    Commercial purposes are strictly out of the boundaries of what you can do with this data according to UN Comtrade dissemination clauses.

    Visit tradestatistics.io to access the dashboard and R package for data retrieval.

  10. f

    Execution time of ETL process for example data 1.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Execution time of ETL process for example data 1. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.

  11. f

    Code mapping coverage of Format 3.

    • plos.figshare.com
    xls
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Code mapping coverage of Format 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Melissa Finster; Maxim Moinat; Elham Taghizadeh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It comprises all OMOP CDM domains including local concepts, and ICD-10-GM and OPS to standard mappings. Example data 3 contains fictional OPS codes. A PZN mapping to map codes from domain Drug to standard concepts is lacking.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001

SQL code.

Related Article
Explore at:
7zAvailable download formats
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The code is about how to extract data from the MIMIC-III. (7Z)

Search
Clear search
Close search
Google apps
Main menu