11 datasets found

f
SQL code.
plos.figshare.com
7z
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276835.s001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The code is about how to extract data from the MIMIC-III. (7Z)

Cleaned Retail Customer Dataset (SQL-based ETL)

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl/versions/2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rizwan Bin Akbar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description

This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

This table contains information about customers, including their unique identifiers and demographic details.

Columns:

  cst_id: Customer ID (Primary Key)

  cst_gndr: Gender

  cst_marital_status: Marital status

  cst_create_date: Customer account creation date

Cleaning Steps:

  Removed duplicates and handled missing or null cst_id values.

  Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.

  Standardized gender values and identified inconsistencies in marital status.

Product Information (s_crm_prd_info / b_crm_prd_info)

This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

Columns:

  prd_id: Product ID

  prd_key: Product key

  prd_nm: Product name

  prd_cost: Product cost

  prd_start_dt: Product start date

  prd_end_dt: Product end date

Cleaning Steps:

  Checked for duplicates and null values in the prd_key column.

  Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.

  Corrected product costs to remove invalid entries (e.g., negative values).

Sales Details (s_crm_sales_details / b_crm_sales_details)

This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

Columns:

  sls_order_dt: Sales order date

  sls_due_dt: Sales due date

  sls_sales: Total sales amount

  sls_quantity: Number of products sold

  sls_price: Product unit price

Cleaning Steps:

  Validated sales order dates and corrected invalid entries.

  Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.

  Removed null and negative values from sls_sales, sls_quantity, and sls_price.

ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

This table contains additional customer demographic data, including gender and birthdate.

Columns:

  cid: Customer ID

  gen: Gender

  bdate: Birthdate

Cleaning Steps:

  Checked for missing or null gender values and standardized inconsistent entries.

  Removed leading/trailing spaces from gen and bdate.

  Validated birthdates to ensure they were within a realistic range.

Location Information (b_erp_loc_a101)

This table contains country information related to the customers' locations.

Columns:

  cntry: Country

Cleaning Steps:

  Standardized country names (e.g., "US" and "USA" were mapped to "United States").

  Removed special characters (e.g., carriage returns) and trimmed whitespace.

Product Category (b_erp_px_cat_g1v2)

This table contains product category information.

Columns:

  Product category data (no significant cleaning required).

Key Features:

Customer demographics, including gender and marital status

Product details such as cost, start date, and end date

Sales data with order dates, quantities, and sales amounts

ERP-specific customer and location data

Data Cleaning Process:

This dataset underwent extensive cleaning and validation, including:

Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).

Date Validations: Ensuring correct date ranges and chronological consistency.

Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.

Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.

This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

Z
Source Code Archiving to the Rescue of Reproducible Deployment — Replication...
data.niaid.nih.gov
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zacchiroli, Stefano (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
Explore at:
Dataset updated
May 23, 2024
Dataset provided by
Simon, Tournier
Zacchiroli, Stefano
Courtès, Ludovic
Sample, Timothy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication package for the paper:

Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

Generating the paper

The paper can be generated using the following command:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make

This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

GNU Make

SQLite 3

GNU AWK

Rubber

Graphviz

TeXLive

Structure

data/ contains the data examined in the paper

scripts/ contains dedicated code for the paper

logs/ contains logs generated during certain computations

Preservation of Guix

Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

Analysis

Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

Estimating missing sources

The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

To generate the estimate, use:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt

If not using Guix, you will need additional software beyond what is used to generate the paper:

GNU Guile

GNU Bash

GNU Mailutils

GNU Parallel

Measuring link rot

In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

guix repl -q scripts/table-per-origin.scm

When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa
datafirst.uct.ac.za
Updated Jun 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eskom (2019). Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/757
Explore at:
Dataset updated
Jun 20, 2019
Dataset provided by
Eskomhttp://www.eskom.co.za/
Stellenbosch University
University of Cape Town
Time period covered
1995 - 2014
Area covered
South Africa
Description
Abstract

This dataset contains sensitive data that has not been disclosed in the online version of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. In contrast to the DELS dataset, the DELS Secure Data contains partially anonymised survey responses with only the names of respondents and home owners removed. The DELSS contains street and postal addresses, as well as GPS level location data for households from 2000 onwards. The GPS data is obtained through an auxiliary dataset, the Site Reference database. Like the DELS, the DELSS dataset has been retrieved and anonymised from the original SQL database with the python package delretrieve.

Geographic coverage

The study had national coverage.

Analysis unit

Households and individuals

Universe

The survey covers electrified households that received electricity either directly from Eskom or from their local municipality. Particular attention was devoted to rural and low income households, as well as surveying households electrified over a range of years, thus having had access to electricity from recent times to several decades.

Kind of data

Sample survey data

Sampling procedure

See sampling procedure for DELS 1994-2014

Mode of data collection

Face-to-face [f2f]

Cleaning operations

This dataset has been produced by extracting only the survey responses from the original NRS Load Research SQL database using the saveAnswers function from the delretrieve python package (https://github.com/wiebket/delretrieve: release v1.0). Full instructions on how to use delretrieve to extract data are in the README file contained in the package.

PARTIAL DE-IDENTIFICATION Partial de-identification was done in the process of extracting the data from the SQL database with the delretrieve package. Only the names of respondents and home owners have been removed from the survey responses by replacing responses with an 'a' in the dataset. Documents with full details of the variables that have been anonymised are included as external resources.

MISSING VALUES Other than partial de-identification no post-processing was done and all database records, including missing values, are stored exactly as retrieved.

Data appraisal

See notes on data quality for DELS 1994-2014
f
Fields of Format 1 which cannot be transformed.
plos.figshare.com
xls
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Fields of Format 1 which cannot be transformed. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t001
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
f
DQD results of Format 3.
plos.figshare.com
xls
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). DQD results of Format 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t006
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
f
Execution time of ETL process for example data 3.
plos.figshare.com
xls
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Execution time of ETL process for example data 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t008
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
f
Fields of Format 3 which cannot be transformed.
plos.figshare.com
xls
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Fields of Format 3 which cannot be transformed. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t005
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
Open Trade Statistics Database
zenodo.org
bin
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mauricio Vargas Sepulveda; Mauricio Vargas Sepulveda (2024). Open Trade Statistics Database [Dataset]. http://doi.org/10.5281/zenodo.13370487
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13370487
Dataset updated
Aug 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mauricio Vargas Sepulveda; Mauricio Vargas Sepulveda
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The Open Trade Statistics initiative was developed to ease access to international trade data by providing downloadable SQL database dumps, a public API, a dashboard, and an R package for data retrieval. This project was born out of the recognition that many academic institutions in Latin America lack access to academic subscriptions and comprehensive datasets like the United Nations Commodity Trade Statistics Database. The OTS project not only offers a solution to this problem regarding international trade data but also emphasizes the importance of reproducibility in data processing. Through the use of open-source tools, the project ensures that its datasets are accessible and easy to use for research and analysis.

OTS, based on the official correlation tables, provides a harmonized dataset where the values are converted to HS revision 2012 for the years 1980-2021 and it involved transforming some of the reported data to find equivalent codes between the different classifications. For instance, the HS revision 1992 code '271011' (aviation spirit) does not have a direct equivalent in HS revision 2012 and it can be converted to the more general code '271000' (oils petroleum, bituminous, distillates, except crude). The same process was applied to the SITC codes.

Country codes are also standardized in OTS. For instance, missing ISO-3 country codes in the raw data were replaced by the values expressed in UN COMTRADE documentation. For instance, the numeric code '490' corresponds to 'e-490' but it appears as a blank value in the raw data, and UN COMTRADE documentation
indicates that 'e-490' corresponds to 'Other Asia, Not Elsewhere Specified (NES)'.

Commercial purposes are strictly out of the boundaries of what you can do with this data according to UN Comtrade dissemination clauses.

Visit tradestatistics.io to access the dashboard and R package for data retrieval.
f
Execution time of ETL process for example data 1.
plos.figshare.com
xls
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Execution time of ETL process for example data 1. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t004
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
f
Code mapping coverage of Format 3.
plos.figshare.com
xls
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). Code mapping coverage of Format 3. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t007
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It comprises all OMOP CDM domains including local concepts, and ICD-10-GM and OPS to standard mappings. Example data 3 contains fictional OPS codes. A PZN mapping to map codes from domain Drug to standard concepts is lacking.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang (2023). SQL code. [Dataset]. http://doi.org/10.1371/journal.pone.0276835.s001

SQL code.

Explore at:

7zAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0276835.s001

Dataset updated

Jun 21, 2023

Dataset provided by

PLOS ONE

Authors

Dengao Li; Jian Fu; Jumin Zhao; Junnan Qin; Lihui Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The code is about how to extract data from the MIMIC-III. (7Z)

Clear search

Close search

Google apps

Main menu

SQL code.

Cleaned Retail Customer Dataset (SQL-based ETL)

Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Data appraisal

Fields of Format 1 which cannot be transformed.

DQD results of Format 3.

Execution time of ETL process for example data 3.

Fields of Format 3 which cannot be transformed.

Open Trade Statistics Database

Execution time of ETL process for example data 1.

Code mapping coverage of Format 3.

SQL code.