Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Analyzing metabolites using mass spectrometry provides valuable insight into an individual’s health or disease status. However, various sources of experimental variation can be introduced during sample handling, preparation, and measurement, which can negatively affect the data. Quality assurance and quality control practices are essential to ensuring accurate and reproducible metabolomics data. These practices include measuring reference samples to monitor instrument stability, blank samples to evaluate the background signal, and strategies to correct for changes in instrumental performance. In this context, we introduce mzQuality, a user-friendly, open-source R-Shiny app designed to assess and correct technical variations in mass spectrometry-based metabolomics data. It processes peak-integrated data independently of vendor software and provides essential quality control features, including batch correction, outlier detection, and background signal assessment, and it visualizes trends in signal or retention time. We demonstrate its functionality using a data set of 419 samples measured across six batches, including quality control samples. mzQuality visualizes data through sample plots, PCA plots, and violin plots, which illustrate its ability to reduce the effect of experiment variation. Compound quality is further assessed by evaluating the relative standard deviation of quality control samples and the background signal from blank samples. Based on these quality metrics, compounds are classified into confidence levels. mzQuality provides an accessible solution to improve the data quality without requiring prior programming skills. Its customizable settings integrate seamlessly into research workflows, enhancing the accuracy and reproducibility of the metabolomics data. Additionally, with an R-compatible output, the data are ready for statistical analysis and biological interpretation.
Facebook
TwitterA comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
Facebook
Twitter
Facebook
TwitterLoad and view a real-world dataset in RStudio
• Calculate “Measure of Frequency” metrics
• Calculate “Measure of Central Tendency” metrics
• Calculate “Measure of Dispersion” metrics
• Use R’s in-built functions for additional data quality metrics
• Create a custom R function to calculate descriptive statistics on any given dataset
Facebook
TwitterThe dataset includes a PDF file containing the results and an Excel file with the following tables:
Table S1 Results of comparing the performance of MetaFetcheR to MetaboAnalystR using Diamanti et al. Table S2 Results of comparing the performance of MetaFetcheR to MetaboAnalystR for Priolo et al. Table S3 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool using Diamanti et al. Table S4 Results of comparing the performance of MetaFetcheR to MetaboAnalyst 5.0 webtool for Priolo et al. Table S5 Data quality test results for running 100 iterations on HMDB database. Table S6 Data quality test results for running 100 iterations on KEGG database. Table S7 Data quality test results for running 100 iterations on ChEBI database. Table S8 Data quality test results for running 100 iterations on PubChem database. Table S9 Data quality test results for running 100 iterations on LIPID MAPS database. Table S10 The list of metabolites that were not mapped by MetaboAnalystR for Diamanti et al. Table S11 An example of an input matrix for MetaFetcheR. Table S12 Results of comparing the performance of MetaFetcheR to MS_targeted using Diamanti et al. Table S13 Data set from Diamanti et al. Table S14 Data set from Priolo et al. Table S15 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Diamanti et al. Table S16 Results of comparing the performance of MetaFetcheR to CTS using LIPID MAPS identifiers available in Diamanti et al. Table S17 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. Table S18 Results of comparing the performance of MetaFetcheR to CTS using KEGG identifiers available in Priolo et al. (See the "index" tab in the Excel file for more information)
Small-compound databases contain a large amount of information for metabolites and metabolic pathways. However, the plethora of such databases and the redundancy of their information lead to major issues with analysis and standardization. Lack of preventive establishment of means of data access at the infant stages of a project might lead to mislabelled compounds, reduced statistical power and large delays in delivery of results.
We developed MetaFetcheR, an open-source R package that links metabolite data from several small-compound databases, resolves inconsistencies and covers a variety of use-cases of data fetching. We showed that the performance of MetaFetcheR was superior to existing approaches and databases by benchmarking the performance of the algorithm in three independent case studies based on two published datasets.
The dataset was originally published in DiVA and moved to SND in 2024.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterThis archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a landing page. To access the datasets, expand the RELATED DATASETS section below, and follow the link to the dataset you require. \r \r --------------------------------------\r \r The Remote Sensing Organisational Unit as part of the Water Group, within the NSW Department of Climate Change, Energy, the Environment and Water (NSW DCCEEW) is dedicated to harnessing the power of satellite earth observations, aerial imagery, in-situ data, and advanced modelling techniques to produce cutting-edge remote sensing information products. Our team employs a multi-faceted approach, integrating remote sensing data captured by satellites operating at various temporal and spatial scales with on-the-ground observations and key spatial datasets, including land-use mapping, weather data, and ancillary verification datasets. This synthesis of diverse information sources enables us to derive critical insights that significantly contribute to water resource planning, policy formulation, and advancements in scientific research.\r \r Drawing upon satellite imagery from reputable sources such as NASA, the European Space Agency, and commercial providers like Planet and SPOT, our team places a special emphasis on leveraging Landsat and Sentinel satellite imagery. Renowned for their archived, calibrated, and consistent datasets, these sources provide a significant advantage in our pursuit of delivering accurate and reliable information. To ensure the robustness of our information products, we implement thorough validation processes, incorporating semi-automation techniques that facilitate rapid turnaround times.\r \r Our operational efficiency is further enhanced through strategic interventions in our workflows, including the automation of processes through efficient computing scripts and the utilization of Google Earth Engine for cloud computing. This integrated approach allows us to maintain high standards of data quality while meeting the increasing demand for timely and accurate information.\r \r Our commitment to providing high-quality, professional, and technically accurate Remote Sensing - Geographic Information System (RS-GIS) data packages, maps, and information is underscored by our recognition of the growing role of technology in information transfer and the promotion of information sharing. Moreover, our dedication to ensuring the currency of RS-GIS methods, interpretation techniques, and 3D modelling enables us to continually deliver innovative products that align with evolving client expectations. Through these efforts, our team strives to contribute meaningfully to the advancement of remote sensing applications for improved environmental understanding and informed decision-making.\r \r -----------------------------------\r \r Note: If you would like to ask a question, make any suggestions, or tell us how you are using this dataset, please visit the NSW Water Hub which has an online forum you can join.\r \r \r \r \r
Facebook
TwitterContinuous PM2.5 and tVOC sensor data paired with coincidental RH and temp measurements. This dataset is associated with the following publication: Clements, A., S. Reece, T. Conner, and R. Williams. Observed Data Quality Concerns Involving Low-Cost Air Sensors. Atmospheric Environment: X. Elsevier B.V., Amsterdam, NETHERLANDS, 3: 100034, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes R code, specifically the package WaterML, to download water quality data from iUTAH GAMUT station sensors installed to look at water quality/quantity along three montane-to-urban watersheds: Logan River, Red Butte Creek, and Provo River. An explanation of the GAMUT sensor network can be found at gamut.iutahepscor.org. The code requires installation of packages 'plyr' and 'WaterML'. Instructions for modifying code to extract sensor data for your timepoint of interest are included in the README file. The code has the option to write sensor data to .csv files in your working directory.
Additional code available at https://github.com/erinfjones/GAMUTdownload
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a value-added product based on 'Up-to-date air quality station measurements', administered by the European Environmental Agency (EEA) and collected by its member states. The original hourly measurement data (NO2, SO2, O3, PM10, PM2.5 in µg/m³) was reshaped, gapfilled and aggregated to different temporal resolutions, making it ready to use in time series analysis or spatial interpolation tasks.
Reproducible code for accessing and processing this data and notebooks for demonstration can be found on Github.
Hourly data was retrieved through the API of the EEA Air Quality Download Service. Measurements (single files per station and pollutant) were joined to create a single time series per station with observations for multiple pollutants. As PM2.5 data is sparse but correlates well with PM10, gapfilling was performed according to methods described in Horálek et al., 2023¹. Validity and verification flags from the original data were passed on for quality filtering. Reproducible computational notebooks using the R programming language are available for the data access and the gapfilling procedure.
Data was aggregated to three coarser temporal resolutions: day, month, and year. Coverage (ratio of non-missing value) was calculated for each pollutant and temporal increment. A threshold of 75% was applied to generate reliable aggregates. All pollutants were aggregated by their aritmethic mean. Additionally, two pollutants were aggregated using a percentile method, which has shown to be more appropriate for mapping applications. PM10 was summarized using the 90.41th percentile. Daily O3 was further summarized as the maximum of the 8-hour running mean. Based thereon, monthly and annual O3 was aggregated using the 93.15th percentile of the daily maxima. For more details refer to the reproducible computational notebook on temporal aggregation.
| column | hourly | daily | monthly | annual | description |
| Air.Quality.Station.EoI.Code | x | x | x | x | Unique station ID |
| Countrycode | x | x | x | x | Two-letter ISO country code |
| Start | x | Start time of (hourly) measurement period | |||
| x | x | x | x | One of NO2; SO2; O3; O3_max8h_93.15; PM10; PM10_90.41; PM2.5 in µg/m³ | |
| Validity_ | x | Validity flag of the respective pollutant | |||
| Verification_ | x | Verification flag of the respective pollutant | |||
| filled_PM2.5 | x | Flag indicating if PM2.5 value is measured or supplemented through gapfilling (boolean) | |||
| year | x | x | x | Year (2015-2023) | |
| cov.year_ | x | x | Data coverage throughout the year (0-1) | ||
| month | x | x | Month (1-12) | ||
| cov.month_ | x | x | Data coverage throughout the month (0-1) | ||
| doy | x | Day of year (0-366) | |||
| cov.day_ | x | Data coverage throughout the day (0-1) |
To avoid redundant information and optimize file size, some relevant meta data is not stored in the air quality data tables, but rather seperately (in a file named "EEA_stations_meta_table.parquet"). This includes type and area of measurement stations, as well as their coordinates.
| column | description |
| Air.Quality.Station.EoI.Code | Unique station ID (required for join) |
| Countrycode | Two-letter ISO country code |
| Station.Type | One of "background", "industrial", or "traffic" |
| Station.Area | One of "urban", "suburban", "rural", "rural-nearcity", "rural-regional", "rural-remote" |
| Longitude & Latitude | Geographic coordinates of the station |
This dataset is shipped in [Parquet files. Hourly and aggregated data are distributed in four individual datasets. Daily and hourly data are partitioned by `Countrycode` (one file per country) to enable reading smaller subsets. Monthly and annual data files are small (> 20Mb) and stored in a single file each. Parquet is a relatively new and very memory-efficient format, that differs from traditional tabular file formats (e.g. CSV) in the sense that it is binary and cannot be opened and displayed by common tabular software (e.g. MS Excel, Libre Office, etc.). Users rather have to use an Apache Arrow implementation, for example in Python, R, C++, or another scripting language. Reading the data there is straight forward (click to see the code samples below).
R code:
# required librarieslibrary(arrow)library(dplyr)# read air quality and meta dataaq = read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet") meta = read_parquet("EEA_stations_meta_table.parquet")
# join the two for further analysisaq_meta = inner_join(aq, meta, by = join_by(Air.Quality.Station.EoI.Code))
Python code: # required librariesimport pandas as pd
# read air quality and meta dataaq = pd.read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet") meta = pd.read_parquet("EEA_stations_meta_table.parquet")
# join the two for further analysisaq_meta = aq.merge(meta,on = ["Air.Quality.Station.EoI.Code", "Countrycode"])
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide here the observed monthly total streamflow (q) and precipitation (p) records at 161 unregulated catchments in Victoria, Australia. All observations are in units of millimeters per month. The number of missing days of streamflow per month is also provided as a measure of data quality (see “gap days”).
This data is sourced from Peterson et al. (2021), wherein the data is provided within the R package, HydroState. Its original use was for investigating non-recovery of streamflow after droughts and catchment resilience.
The precipitation data is the area-weighted catchment average depth and was derived using the R package, AWAPer (see Peterson et al., 2019). The compilation of the streamflow data was led by Dr Margarita Saft and the observed data was sourced from https://data.water.vic.gov.au/. For additional details of the data see Peterson et al. (2021) – Supplemental file.
Also provided here are the following GIS shapefiles of the study catchments: boundaries, stream gauge location, state boundary, major lakes and watercourse of Victoria. The spatial data files were derived from https://www.data.vic.gov.au/. The units of area and elevation therein are square meter and meter, respectively.
The files here were used in Goswami et al. (2022) and were uploaded to Zenodo to comply with the journal data availability requirements.
Update: Temperature and PET datasets of the 161 catchments were added to support analysis carried out in a follow up paper, Goswami et al 2023.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global data catalog market is experiencing steady growth, driven by the increasing volume and complexity of enterprise data. As organizations face the challenge of managing multiple data sources and ensuring data quality and governance, the adoption of data catalogs has become increasingly important. According to market research, the total value of the market in 2025 was approximately $2.61 billion, with a projected CAGR of 2.50% from 2025 to 2033. This growth is primarily attributed to the growing need for data-driven decision-making and the proliferation of big data and artificial intelligence (AI) technologies. Key industry trends indicate a growing emphasis on cloud-based data catalog solutions, as well as the integration of AI and machine learning (ML) capabilities. These technologies enhance the automation and efficiency of data cataloging processes, while providing advanced features such as data lineage tracking and data quality monitoring. Furthermore, the convergence of data catalog solutions with other enterprise applications, such as data governance and data analytics platforms, creates opportunities for comprehensive data management and improved data utilization. The competitive landscape is characterized by a mix of established vendors and emerging players, with companies such as Tamr Inc, Collibra NV, TIBCO Software Inc, and IBM Corporation holding significant market share. Ongoing innovations and strategic acquisitions are shaping the market dynamics, as vendors strive to differentiate their offerings and meet evolving customer requirements. The global data catalog market size was valued at USD 2.0 billion in 2022 and is expected to expand at a compound annual growth rate (CAGR) of 24.3% from 2023 to 2030, reaching USD 12.0 billion by 2030. Recent developments include: November 2022 - Amazon EMR customers can now use AWS Glue Data Catalog from their streaming and batch SQL workflows on Flink. The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog. With this release, Companies can directly run Flink SQL queries against the tables stored in the Data Catalog., September 2022 - Syniti, a global leader in enterprise data management, updated new data quality and catalog capabilities available in its industry-leading Syniti Knowledge Platform, building on the enhancements in data migration and data matching added earlier this year. The Syniti Knowledge Platform now includes data quality, catalog, matching, replication, migration, and governance, all available under one login in a single cloud solution. It provides users with a complete and unified data management platform enabling them to deliver faster and better business outcomes with data they can trust., August 2022 - Oracle Cloud Infrastructure collaborated with Anaconda, the world's most recognized data science platform provider. By permitting and integrating the latter company's repository throughout OCI Machine Learning and Artificial Intelligence services, the collaboration aimed to give safe, open-source Python and R tools and packages.. Key drivers for this market are: Growing adoption of Cloud Based Solutions, Solutions Segment is Expected to Hold a Larger Market Size. Potential restraints include: Lack of Standardization and Security Concerns. Notable trends are: Solutions Segment is Expected to Hold a Larger Market Size.
Facebook
TwitterThis resource contains a set of Jupyter Notebooks that provide Python code examples for using the Python dataretrieval package for retrieving data from the United States Geological Survey's (USGS) National Water Information System (NWIS).The dataretrieval package is a Python alternative to USGS-R's dataRetrieval package for the R Statistical Computing Environment used for obtaining USGS or Environmental Protection Agency (EPA) water quality data, streamflow data, and metadata directly from web services. The dataretrieval Python package is an alternative to the R package, not a port, in that it reproduces the functionality of the R package but its organization and functionality differ to some degree. The dataretrieval package was originally created by Timothy Hodson at USGS. Additional contributions to the Python package and these Jupyter Notebook examples were created at Utah State University under funding from the National Science Foundation. A link to the GitHub source code repository for the dataretrieval package is provided in the related resources section below.
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Data Catalog Market, valued at $2.61 billion in 2025, is projected to experience steady growth, driven by the escalating need for data governance, improved data quality, and the rising adoption of cloud-based data solutions. The Compound Annual Growth Rate (CAGR) of 2.50% over the forecast period (2025-2033) indicates a consistent, albeit moderate, expansion. This growth is fueled by several key factors. Organizations are increasingly recognizing the strategic value of their data assets and are investing heavily in tools and technologies that enhance data discoverability, accessibility, and usability. The increasing complexity of data landscapes, with data residing across diverse sources and formats, further necessitates the implementation of robust data cataloging solutions. The market's growth is also being propelled by the growing adoption of big data analytics, machine learning, and artificial intelligence, all of which rely heavily on the efficient management and organization of data. Furthermore, stringent data privacy regulations such as GDPR and CCPA are driving demand for solutions that ensure data compliance and traceability. Leading players like IBM, Microsoft, and Informatica are actively shaping the market landscape through continuous innovation, strategic partnerships, and acquisitions. While the market enjoys consistent growth, challenges remain. The high initial investment costs associated with implementing and maintaining data cataloging solutions can pose a barrier for smaller organizations. Furthermore, ensuring data quality and consistency across diverse data sources remains a significant hurdle. Despite these challenges, the long-term outlook for the data catalog market remains positive, driven by the ongoing digital transformation initiatives undertaken by businesses worldwide and the growing realization of the strategic imperative to effectively manage and leverage data assets. The market is expected to reach approximately $3.3 billion by 2033. Recent developments include: November 2022 - Amazon EMR customers can now use AWS Glue Data Catalog from their streaming and batch SQL workflows on Flink. The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog. With this release, Companies can directly run Flink SQL queries against the tables stored in the Data Catalog., September 2022 - Syniti, a global leader in enterprise data management, updated new data quality and catalog capabilities available in its industry-leading Syniti Knowledge Platform, building on the enhancements in data migration and data matching added earlier this year. The Syniti Knowledge Platform now includes data quality, catalog, matching, replication, migration, and governance, all available under one login in a single cloud solution. It provides users with a complete and unified data management platform enabling them to deliver faster and better business outcomes with data they can trust., August 2022 - Oracle Cloud Infrastructure collaborated with Anaconda, the world's most recognized data science platform provider. By permitting and integrating the latter company's repository throughout OCI Machine Learning and Artificial Intelligence services, the collaboration aimed to give safe, open-source Python and R tools and packages.. Key drivers for this market are: Growing adoption of Cloud Based Solutions, Solutions Segment is Expected to Hold a Larger Market Size. Potential restraints include: Growing adoption of Cloud Based Solutions, Solutions Segment is Expected to Hold a Larger Market Size. Notable trends are: Solutions Segment is Expected to Hold a Larger Market Size.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset Description This dataset and accompanying R scripts support the intensive carbon dynamics observation platform conducted in tropical alpine peatlands of Guatavita, Colombia. The data include half-hourly and cumulative greenhouse-gas fluxes (CO₂, CH₄, N₂O), dissolved organic carbon (DOC) transport, and related hydrological and meteorological measurements, together with model outputs and analysis scripts. All analyses were performed in R (version ≥ 4.2). The repository is organized into two main components: Chamber and Bayesian analysis pipeline (root folder) Tower flux gap-filling and uncertainty analysis (folder golden/) 1. Chamber and Bayesian Workflow This section integrates chamber measurements, water-table data, and modeled fluxes for both conserved and degraded peatland plots. The scripts allow data preparation, prediction of half-hourly fluxes, Bayesian partitioning of net ecosystem exchange (NEE) into gross primary production (GPP) and ecosystem respiration (ER), and generation of publication-quality figures. Main steps: Data preparation – Cleaning and merging chamber and tower data (flux_chamber3.r, flux_wt_guatavita_jc.r, waterlevel.r). Prediction dataset construction – Builds model input datasets (flux predict.R, flux predict2.R). Bayesian flux partitioning – Separates NEE into GPP and ER using hierarchical Bayesian models (bayesian models.r, bayesianflux.r). This step must be run separately for each station (ST1 and ST2) by modifying the station code inside the scripts. Trace gas analyses – Quantifies N₂O and DOC fluxes (N2Oflux.r, DOC_flux.r). Visualization and summaries – Produces the cumulative and seasonal flux figures and summary tables (final plot.r). Primary outputs: Modelled CO₂ and CH₄ fluxes (*_Model_EC_long.csv, _pred_30min_.csv) Seasonal and cumulative carbon balance summaries (Final_Cumulative_CO2_CH4_CO2eq_2023_2024_bySeason_Method_Station.csv, Summary_CO2_CH4_CO2eq_byMethod_Station_Season_Year.csv) Mean and confidence-interval tables for each gas (PerGas_CO2_CH4_with_CO2eq_Mg_ha_mean95CI.csv, Totals_CO2eq_across_gases_Mg_ha_mean95CI.csv) Publication figures (figure.png, figure_transparent.png, figure.svg) 2. Tower Flux (Eddy-Covariance) Workflow The folder golden/ contains the workflow used for tower-based fluxes, including gap-filling, uncertainty analysis, and manuscript-quality visualization. These scripts use the REddyProc R package and standard meteorological variables. Scripts: REddyProc_Guatavita_Station1_Gold.R – Gap-filling for Station 1 REddyProc_Guatavita_Station2_Gold.R – Gap-filling for Station 2 Guatavita_gapfilling_uncertainty.R – Quantifies gap-filling uncertainty Guatavita_plot_manuscript.R – Generates final tower flux figures Each station’s eddy-covariance data were processed independently following standard u-star filtering and uncertainty propagation routines. Data Files Input data include chamber fluxes (co2flux.csv, ch4flux.csv, db_gutavita_N2O_all.csv), water-table and hydrological measurements (WaterTable.csv, wtd_martos_21_25.csv), DOC transport (DOC transport.csv), and auxiliary meteorological variables (tower_var.csv). Intermediate model results are stored in .rds files, and cumulative or seasonal summaries are provided in .csv and .xlsx formats. Reproducibility Notes All scripts assume relative paths from the project root. To reproduce the complete analyses: Install required R packages (tidyverse, ggplot2, rjags, coda, REddyProc, among others). Run the chamber workflow in the order listed above. Repeat the Bayesian modeling step for both stations. Execute the tower scripts in the golden/ folder for gap-filling and visualization. Large intermediate .rds files are retained for reproducibility and should not be deleted unless re-running the models from scratch. Citation and Contact Principal Investigator: Juan C. Benavides, Pontificia Universidad Javeriana, Bogotá, Colombia
Facebook
TwitterThis dataset package is focused on U.S construction materials and three construction companies: Cemex, Martin Marietta & Vulcan.
In this package, SpaceKnow tracks manufacturing and processing facilities for construction material products all over the US. By tracking these facilities, we are able to give you near-real-time data on spending on these materials, which helps to predict residential and commercial real estate construction and spending in the US.
The dataset includes 40 indices focused on asphalt, cement, concrete, and building materials in general. You can look forward to receiving country-level and regional data (activity in the North, East, West, and South of the country) and the aforementioned company data.
SpaceKnow uses satellite (SAR) data to capture activity and building material manufacturing and processing facilities in the US.
Data is updated daily, has an average lag of 4-6 days, and history back to 2017.
The insights provide you with level and change data for refineries, storage, manufacturing, logistics, and employee parking-based locations.
SpaceKnow offers 3 delivery options: CSV, API, and Insights Dashboard
Available Indices Companies: Cemex (CX): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Martin Marietta (MLM): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Vulcan (VMC): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates
USA Indices:
Aggregates USA Asphalt USA Cement USA Cement Refinery USA Cement Storage USA Concrete USA Construction Materials USA Construction Mining USA Construction Parking Lots USA Construction Materials Transfer Hub US Cement - Midwest, Northeast, South, West Cement Refinery - Midwest, Northeast, South, West Cement Storage - Midwest, Northeast, South, West
Why get SpaceKnow's U.S Construction Materials Package?
Monitor Construction Market Trends: Near-real-time insights into the construction industry allow clients to understand and anticipate market trends better.
Track Companies Performance: Monitor the operational activities, such as the volume of sales
Assess Risk: Use satellite activity data to assess the risks associated with investing in the construction industry.
Index Methodology Summary Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices; CFI-R index gives the data in levels. It shows how many square meters are covered by metallic objects (for example employee cars at a facility). CFI-S index gives the change in data. It shows how many square meters have changed within the locations between two consecutive satellite images.
How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.
Where the data comes from SpaceKnow brings you the data edge by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.
In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the construction industry with just a 4-6 day lag, on average.
The construction materials data help you to estimate the performance of the construction sector and the business activity of the selected companies.
The foundation of delivering high-quality data is based on the success of defining each location to observe and extract the data. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.
See below how our Construction Materials index performs against the US Non-residential construction spending benchmark
Each individual location is precisely defined to avoid noise in the data, which may arise from traffic or changing vegetation due to seasonal reasons.
SpaceKnow uses radar imagery and its own unique algorithms, so the indices do not lose their significance in bad weather conditions such as rain or heavy clouds.
→ Reach out to get free trial
...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Reddit [source]
This dataset explores the media content on Reddit and how it is received by its community, providing detailed insights into both the popularity and quality of subreditvideos. Here you will find data about videos posted on Reddit, compiled from various metrics such as their upvotes, number of comments, date and time posted, body text and more. With this data you can dive deeper into the types of videos being shared and the topics being discussed – gaining a better understanding of what resonates with the Reddit community. This information allows us to gain insight into what kind of content has potential to reach a wide audience on Reddit; it also reveals which types of videos have been enjoying popularity amongst users over time. These insights can help researchers uncover valuable findings about media trends on popular social media sites such as Reddit – so don't hesitate to explore!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How To Use This Dataset
This dataset is a great resource for analyzing the content and popularity of videos posted on Reddit. It provides various metrics such as score, url, comment count and creation date that let you compare the different types of content being shared on the subredditvideos subreddit.
To get started, take a look at the title field for each post. This gives you an idea of what type of video is being shared, which can be helpful in understanding what topics are popular on the platform.
Next, use the score field to identify posts that have done well in terms of receiving upvotes from users. The higher its score, the more popular it has been with viewers. A higher score does not necessarily indicate higher quality however; take a closer look at each post's body field to get an idea for its content quality before making assumptions about its value based solely off of its high score. Having said that, top scoring posts could be considered further when doing research analysis into popular topics or trends in media consumption behavior across Reddit’s userbase (e.g., trending topics among young adults). The url field provides you with links to directly access videos so you can review them yourself before sharing them or forwarding them onto friends or colleagues for their feedback/insight as well (something that could be done further depending on how detailed your research project requires). The comms_num column represents how many comments each video has received which may give insight into how engaged viewers have been when viewing stories submitted by this particular sub-reddit’s members - useful information if interactions/conversations surrounding particular types of content are part of your research objective too! Finally make sure to check out timestamp column as this records when each story was created - important information whenever attempting to draw conclusive insights from time-oriented data points (a time series analysis would serve very handy here!).
Knowing all these features listed above should give researchers an easily accessible source into exploring popularity and quality levels amongst Reddit’s shared media channels – uncovering potentially useful insights related specifically those moving image stories found within subredditvideos are made available via this dataset here!
- Identifying and tracking trends in the popularity of different genres of videos posted on Reddit, such as interviews, music videos, or educational content.
- Investigating audience engagement with certain types of content to determine the types of posts that resonate most with users on Reddit.
- Examining correlations between video score or comment count and specific video characteristics such as length, topic or visual style
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: videos.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of ...
Facebook
TwitterTrends in nutrient fluxes and streamflow for selected tributaries in the Lake Erie watershed were calculated using monitoring data at 10 locations. Trends in flow-normalized nutrient fluxes were determined by applying a weighted regression approach called WRTDS (Weighted Regression on Time, Discharge, and Season). Site information and streamflow and water-quality records are contained in 3 zipped files named as follows: INFO (site information), Daily (daily streamflow records), and Sample (water-quality records). The INFO, Daily (flow), and Sample files contain the input data, by water-quality parameter and by site as .csv files, used to run trend analyses. These files were generated by the R (version 3.1.2) software package called EGRET - Exploration and Graphics for River Trends (version 2.5.1) (Hirsch and DeCicco, 2015), and can be used directly as input to run graphical procedures and WRTDS trend analyses using EGRET R software. The .csv files are identified according to water-quality parameter (TP, SRP, TN, NO23, and TKN) and site reference number (e.g. TPfiles.1.INFO.csv, SRPfiles.1.INFO.csv, TPfiles.2.INFO.csv, etc.). Water-quality parameter abbreviations and site reference numbers are defined in the file "Site-summary_table.csv" on the landing page, where there is also a site-location map ("Site_map.pdf"). Parameter information details, including abbreviation definitions, appear in the abstract on the Landing Page. SRP data records were available at only 6 of the 10 trend sites, which are identified in the file "site-summary_table.csv" (see landing page) as monitored by the organization NCWQR (National Center for Water Quality Research). The SRP sites are: RAIS, MAUW, SAND, HONE, ROCK, and CUYA. The model-input dataset is presented in 3 parts: 1. INFO.zip (site information) 2. Daily.zip (daily streamflow records) 3. Sample.zip (water-quality records) Reference: Hirsch, R.M., and De Cicco, L.A., 2015 (revised). User Guide to Exploration and Graphics for RivEr Trends (EGRET) and dataRetrieval: R Packages for Hydrologic Data, Version 2.0, U.S. Geological Survey Techniques Methods, 4-A10. U.S. Geological Survey, Reston, VA., 93 p. (at: http://dx.doi.org/10.3133/tm4A10).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Analyzing metabolites using mass spectrometry provides valuable insight into an individual’s health or disease status. However, various sources of experimental variation can be introduced during sample handling, preparation, and measurement, which can negatively affect the data. Quality assurance and quality control practices are essential to ensuring accurate and reproducible metabolomics data. These practices include measuring reference samples to monitor instrument stability, blank samples to evaluate the background signal, and strategies to correct for changes in instrumental performance. In this context, we introduce mzQuality, a user-friendly, open-source R-Shiny app designed to assess and correct technical variations in mass spectrometry-based metabolomics data. It processes peak-integrated data independently of vendor software and provides essential quality control features, including batch correction, outlier detection, and background signal assessment, and it visualizes trends in signal or retention time. We demonstrate its functionality using a data set of 419 samples measured across six batches, including quality control samples. mzQuality visualizes data through sample plots, PCA plots, and violin plots, which illustrate its ability to reduce the effect of experiment variation. Compound quality is further assessed by evaluating the relative standard deviation of quality control samples and the background signal from blank samples. Based on these quality metrics, compounds are classified into confidence levels. mzQuality provides an accessible solution to improve the data quality without requiring prior programming skills. Its customizable settings integrate seamlessly into research workflows, enhancing the accuracy and reproducibility of the metabolomics data. Additionally, with an R-compatible output, the data are ready for statistical analysis and biological interpretation.