CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This paper demonstrates the flexibility of a general approach for the analysis of discrete time competing risks data that can accommodate complex data structures, different time scales for different causes, and nonstandard sampling schemes. The data may involve a single data source where all individuals contribute to analyses of both cause-specific hazard functions, overlapping datasets where some individuals contribute to the analysis of the cause-specific hazard function of only one cause while other individuals contribute to analyses of both cause-specific hazard functions, or separate data sources where each individual contributes to the analysis of the cause-specific hazard function of only a single cause. The approach is modularized into estimation and prediction. For the estimation step, the parameters and the variance-covariance matrix can be estimated using widely available software. The prediction step utilizes a generic program with plug-in estimates from the estimation step. The approach is illustrated with three prognostic models for stage IV male oral cancer using different data structures. The first model uses only men with stage IV oral cancer from population-based registry data. The second model strategically extends the cohort to improve the efficiency of the estimates. The third model improves the accuracy for those with a lower risk of other causes of death, by bringing in an independent data source collected under a complex sampling design with additional other-cause covariates. These analyses represent novel extensions of existing methodology, broadly applicable for the development of prognostic models capturing both the cancer and non-cancer aspects of a patient's health.
The statistic shows the number of internal and external data sources used for decision-making in organizations worldwide as of 2018. Around 56 percent of respondents stated that their organization used less that five external data sources in its decision-making process as of 2018.
There has been a tremendous increase in the volume of Earth Science data over the last decade from modern satellites, in-situ sensors and different climate models. All these datasets need to be co-analyzed for finding interesting patterns or for searching for extremes or outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations. Moving these petabytes of data over the network to a single location may waste a lot of bandwidth, and can take days to finish. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the global data without moving all the data to one location. The algorithm is highly accurate (close to 99%) and requires centralizing less than 5% of the entire dataset. We demonstrate the performance of the algorithm using data obtained from the NASA MODerate-resolution Imaging Spectroradiometer (MODIS) satellite images.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FOUVELhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FOUVEL
We introduce a method for scaling two data sets from different sources. The proposed method estimates a latent factor common to both datasets as well as an idiosyncratic factor unique to each. In addition, it offers a flexible modeling strategy that permits the scaled locations to be a function of covariates, and efficient implementation allows for inference through resampling. A simulation study shows that our proposed method improves over existing alternatives in capturing the variation common to both datasets, as well as the latent factors specific to each. We apply our proposed method to vote and speech data from the 112th U.S. Senate. We recover a shared subspace that aligns with a standard ideological dimension running from liberals to conservatives while recovering the words most associated with each senator's location. In addition, we estimate a word-specific subspace that ranges from national security to budget concerns, and a vote-specific subspace with Tea Party senators on one extreme and senior committee leaders on the other.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for Big Data and Data Engineering Services was valued at approximately USD 45.6 billion in 2023 and is expected to reach USD 136.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 13.2% during the forecast period. This robust growth is primarily driven by the increasing volume of data being generated across industries, advancements in data analytics technologies, and the rising importance of data-driven decision-making. Enterprises of all sizes are progressively leveraging big data solutions to gain strategic insights and maintain competitive advantage, thereby fueling market growth.
One of the pivotal growth factors for the Big Data and Data Engineering Services market is the exponential rise in data generation. With the advent of the Internet of Things (IoT), social media, and digital interactions, the volume of data generated daily is staggering. This data, if harnessed effectively, can provide invaluable insights into consumer behaviors, market trends, and operational efficiencies. Companies are increasingly investing in data engineering services to streamline and manage this data effectively. Additionally, the adoption of advanced analytics and machine learning techniques is enabling organizations to derive actionable insights, further driving the market's expansion.
Another significant growth driver is the technological advancements in data processing and analytics. The development of sophisticated data engineering tools and platforms has made it easier to collect, store, and analyze large datasets. Cloud computing has played a crucial role in this regard, offering scalable and cost-effective solutions for data management. The integration of artificial intelligence (AI) and machine learning (ML) in data analytics is enhancing the ability to predict trends and make informed decisions, thereby contributing to the market's growth. Furthermore, continuous innovations in data security and privacy measures are instilling confidence among businesses to invest in big data solutions.
The increasing emphasis on regulatory compliance and data governance is also propelling the market forward. Industries such as BFSI, healthcare, and government are subject to stringent regulatory requirements for data management and protection. Big Data and Data Engineering Services are essential in ensuring compliance with these regulations by maintaining data accuracy, integrity, and security. The implementation of data governance frameworks is becoming a top priority for organizations to mitigate risks associated with data breaches and ensure ethical data usage. This regulatory landscape is creating a conducive environment for the adoption of comprehensive data engineering services.
Regionally, North America dominates the Big Data and Data Engineering Services market, owing to the presence of major technology companies, high adoption of advanced analytics, and significant investments in R&D. However, the Asia Pacific region is expected to exhibit the highest growth rate due to rapid digital transformation, increasing internet penetration, and growing awareness about the benefits of data-driven decision-making among businesses. Europe also represents a significant market share, driven by the strong presence of industrial and technological sectors that rely heavily on data analytics.
Data Integration is a critical component of Big Data and Data Engineering Services, encompassing the process of combining data from different sources to provide a unified view. This service type is instrumental for organizations aiming to harness data from various departments, applications, and geographies. The increasing complexity of data landscapes, characterized by disparate data sources and formats, necessitates efficient data integration solutions. Companies are investing heavily in data integration technologies to consolidate their data, improve accessibility, and enhance the quality of insights derived from analytical processes. This segment's growth is further fueled by advancements in integration tools that support real-time data processing and seamless connectivity.
Data Quality services ensure the accuracy, completeness, and reliability of data, which is essential for effective decision-making. Poor data quality can lead to misinformed decisions, operational inefficiencies, and regulatory non-compliance. As organizations increasingly recognize the criticality of data quality, there is a growing demand for robust data quality solutions. These services include da
https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy
The global Data Integration Machines market is experiencing robust growth, driven by the increasing need for real-time data processing and analysis across diverse sectors. The market, currently estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This expansion is fueled by several key factors: the proliferation of IoT devices generating massive data volumes, the rising adoption of cloud-based data integration solutions, and the growing demand for advanced analytics capabilities in industries like healthcare, e-commerce, and industrial automation. The Federated Database Mode segment currently holds a significant market share, owing to its ability to integrate data from disparate sources without requiring data migration. However, Middleware and Data Warehouse modes are also gaining traction, driven by their scalability and flexibility for handling large datasets. Geographically, North America and Europe currently dominate the market, but the Asia-Pacific region is poised for significant growth, fueled by rapid technological advancements and increasing digitalization efforts. The market's growth is not without its challenges. High implementation costs and the complexity of integrating diverse data formats pose significant restraints. Furthermore, data security and privacy concerns, along with the lack of skilled professionals capable of managing these complex systems, can hinder wider adoption. However, the ongoing development of user-friendly interfaces and robust security protocols is expected to mitigate these challenges. The increasing focus on data-driven decision-making across all industries will be a key driver for market expansion in the coming years. Leading players like SICK AG, Oracle, IBM, and Microsoft are actively investing in research and development to enhance their offerings and maintain their competitive edge. The market is expected to witness increased consolidation as companies seek to expand their market reach and capabilities.
The Predictive Analytics market is experiencing robust growth, projected to reach $15.05 billion in 2025 and exhibiting a remarkable Compound Annual Growth Rate (CAGR) of 28.97%. This expansion is driven by several key factors. Firstly, the increasing availability of vast datasets from various sources, coupled with advancements in machine learning and artificial intelligence, fuels the development of more sophisticated and accurate predictive models. Secondly, businesses across diverse sectors—including BFSI (Banking, Financial Services, and Insurance), retail and e-commerce, telecom and IT, and transportation and logistics—are increasingly adopting predictive analytics to gain a competitive edge. This adoption is motivated by the need for improved decision-making, enhanced operational efficiency, optimized resource allocation, and proactive risk management. The growing need for fraud detection, personalized customer experiences, and supply chain optimization further contributes to market growth. Despite the significant growth trajectory, the market faces some challenges. Data security and privacy concerns remain paramount, requiring robust data governance and compliance measures. Furthermore, the successful implementation of predictive analytics necessitates significant investments in infrastructure, skilled personnel, and data integration capabilities, potentially hindering adoption for smaller businesses. The integration complexity of disparate data sources and the need for experienced professionals skilled in data science and analytics can also act as restraints. However, the increasing accessibility of cloud-based solutions and the emergence of user-friendly analytics platforms are mitigating these challenges, driving wider adoption across sectors and company sizes. The market is expected to continue its strong growth, fueled by ongoing technological innovations and the increasing demand for data-driven decision-making across all sectors. Geographical expansion, particularly in developing economies with burgeoning digital infrastructure, further strengthens this positive outlook.
https://doi.org/10.5061/dryad.qfttdz0jm
To perform and replicate this study, this dataset provides all needed files (as tables) to fit SDMs: i) the Iberian bird species occurrences at 10km UTM square as a response or dependent variable; ii) the geographic layers of environmental information at 10km UTM square for the Iberian Peninsula as predictors or independent variables, such as climate data, ecosystem functioning attributes (EFAs) and the combined climate and EFA data. The dataset is provided by four **.csv* files named as:
1) The_Iberian_bird_species_occurrences_dataset_10km.csv
2) CHELSA_bioclimate_variables_IP10km.csv
3) MODIS_EVI-based_EFAs_IP10km.csv
4) Combined_bioclimate_EFA_dataset_IP10km.csv
Recommended citation for this dataset: Arenas-Castro, S. et al. (2024), Data from: Effects ...
This data release is a compilation of construction depth information for 12,383 active and inactive public-supply wells (PSWs) in California from various data sources. Construction data from multiple sources were indexed by the California State Water Resources Control Board Division of Drinking Water (DDW) primary station code (PS Code). Five different data sources were compared with the following priority order: 1, Local sources from select municipalities and water purveyors (Local); 2, Local DDW district data (DDW); 3, The United States Geological Survey (USGS) National Water Information System (NWIS); 4, The California State Water Resources Control Board Groundwater Ambient Monitoring and Assessment Groundwater Information System (SWRCB); and 5, USGS attribution of California Department of Water Resources well completion report data (WCR). For all data sources, the uppermost depth to the well's open or perforated interval was attributed as depth to top of perforations (ToP). The composite depth to bottom of well (Composite BOT) field was attributed from available construction data in the following priority order: 1, Depth to bottom of perforations (BoP); 2, Depth of completed well (Well Depth); 3; Borehole depth (Hole Depth). PSW ToPs and Composite BOTs from each of the five data sources were then compared and summary construction depths for both fields were selected for wells with multiple data sources according to the data-source priority order listed above. Case-by-case modifications to the final selected summary construction depths were made after priority order-based selection to ensure internal logical consistency (for example, ToP must not exceed Composite BOT). This data release contains eight tab-delimited text files. WellConstructionSourceData_Local.txt contains well construction-depth data, Composite BOT data-source attribution, and local agency data-source attribution for the Local data. WellConstructionSourceData_DDW.txt contains well construction-depth data and Composite BOT data-source attribution for the DDW data. WellConstructionSourceData_NWIS.txt contains well construction-depth data, Composite BOT data-source attribution, and USGS site identifiers for the NWIS data. WellConstructionSourceData_SWRCB.txt contains well construction-depth data and Composite BOT data-source attribution for the SWRCB data. WellConstructionSourceData_WCR.txt contains contains well construction depth data and Composite BOT data-source attribution for the WCR data. WellConstructionCompilation_ToP.txt contains all ToP data listed by data source. WellConstructionCompilation_BOT.txt contains all Composite BOT data listed by data source. WellConstructionCompilation_Summary.txt contains summary ToP and Composite BOT values for each well with data-source attribution for both construction fields. All construction depths are in units of feet below land surface and are reported to the nearest foot.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An updated and improved version of a global, vertically resolved, monthly mean zonal mean ozone database has been calculated – hereafter referred to as the BSVertOzone database, the BSVertOzone database. Like its predecessor, it combines measurements from several satellite-based instruments and ozone profile measurements from the global ozonesonde network. Monthly mean zonal mean ozone concentrations in mixing ratio and number density are provided in 5 latitude zones, spanning 70 altitude levels (1 to 70km), or 70 pressure 5 levels that are approximately 1km apart (878.4hPa to 0.046hPa). Different data sets or "Tiers" are provided: "Tier 0" is based only on the available measurements and therefore does not completely cover the whole globe or the full vertical range uniformly; the "Tier 0.5" monthly mean zonal means are calculated from a filled version of the Tier 0 database where missing monthly mean zonal mean values are estimated from correlations at level 20 against a total column ozone database and then at levels above and below on correlations with lower and upper levels respectively. The Tier 10 0.5 database includes the full range of measurement variability and is created as an intermediate step for the calculation of the "Tier 1" data where a least squares regression model is used to attribute variability to various known forcing factors for ozone. Regression model fit coefficients are expanded in Fourier series and Legendre polynomials (to account for seasonality and latitudinal structure, respectively). Four different combinations of contributions from selected regression model basis functions result in four different "Tier 1" data set that can be used for comparisons with chemistry-climate model simulations that do not 15 exhibit the same unforced variability as reality (unless they are nudged towards reanalyses). Compared to previous versions of the database, this update includes additional satellite data sources and ozonesonde measurements to extend the database period to 2016. Additional improvements over the previous version of the database include: (i) Adjustments of measurements to account for biases and drifts between different data sources (using a chemistry-transport model simulation as a transfer standard), (ii) a more objective way to determine the optimum number of Fourier and Legendre expansions for the basis 20 function fit coefficients, and (iii) the derivation of methodological and measurement uncertainties on each database value are traced through all data modification steps. Comparisons with the ozone database from SWOOSH (Stratospheric Water and OzOne Satellite Homogenized data set) show excellent agreements in many regions of the globe, and minor differences caused by different bias adjustment procedures for the two databases. However, compared to SWOOSH, BSVertOzone additionally covers the troposphere.
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists. There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files). The document Dataset_summary includes a detailed description of the dataset. Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.
Lightning Talk at the International Digital Curation Conference 2025. The presentation examines OpenAIRE's solution to the “entity disambiguation” problem, presenting a hybrid data curation method that combines deduplication algorithms with the expertise of human curators to ensure high-quality, interoperable scholarly information. Entity disambiguation is invaluable to building a robust and interconnected open scholarly communication system. It involves accurately identifying and differentiating entities such as authors, organisations, data sources and research results across various entity providers. This task is particularly complex in contexts like the OpenAIRE Graph, where metadata is collected from over 100,000 data sources. Different metadata describing the same entity can be collected multiple times, potentially providing different information, such as different Persistent Identifiers (PIDs) or names, for the same entity. This heterogeneity poses several challenges to the disambiguation process. For example, the same organisation may be referenced using different names in different languages, or abbreviations. In some cases, even the use of PIDs might not be effective, as different identifiers may be assigned by different data providers. Therefore, accurate entity disambiguation is essential for ensuring data quality, improving search and discovery, facilitating knowledge graph construction, and supporting reliable research impact assessment. To address this challenge, OpenAIRE employs a deduplication algorithm to identify and merge duplicate entities, configured to handle different entity types. While the algorithm proves effective for research results, when applied to organisations and data sources, it needs to be complemented with human curation and validation since additional information may be needed. OpenAIRE's data source disambiguation relies primarily on the OpenAIRE technical team overseeing the deduplication process and ensuring accurate matches across DRIS, FAIRSharing, re3data, and OpenDOAR registries. While the algorithm automates much of the process, human experts verify matches, address discrepancies and actively search for matches not proposed by the algorithm. External stakeholders, such as data source managers, can also contribute by submitting suggestions through a dedicated ticketing system. So far OpenAIRE curated almost 3 935 groups for a total of 8 140 data sources. To address organisational disambiguation, OpenAIRE developed OpenOrgs, a hybrid system combining automated processes and human expertise. The tool works on organisational data aggregated from multiple sources (ROR registry, funders databases, CRIS systems, and others) by the OpenAIRE infrastructure, automatically compares metadata, and suggests potential merged entities to human curators. These curators, authorised experts in their respective research landscapes, validate merged entities, identify additional duplicates, and enrich organisational records with missing information such as PIDs, alternative names, and hierarchical relationships. With over 100 curators from 40 countries, OpenOrgs has curated more than 100,000 organisations to date. A dataset containing all the OpenOrgs organizations can be found on Zenodo (https://doi.org/10.5281/zenodo.13271358). This presentation demonstrates how OpenAIRE's entity disambiguation techniques and OpenOrgs aim to be game-changers for the research community by building and maintaining an integrated open scholarly communication system in the years to come.
This dataset was created for the following publication: Cheruvelil, K.S., S. Yuan, K.E. Webster, P.-N. Tan, J.-F. Lapierre, S.M. Collins, C.E. Fergus, C.E. Scott, E.N. Henry, P.A. Soranno, C.T. Filstrup, T. Wagner. Under review. Creating multi-themed ecological regions for macrosystems ecology: Testing a flexible, repeatable, and accessible clustering method. Submitted to Ecology and Evolution July 2016. This dataset includes lake total phosphorus (TP) and Secchi data from summer, epilimnetic water samples, as well as 52 geographic variables at the HU-12 scale; it is a subset of the larger LAGOS-NE database (Lake multi-scaled geospatial and temporal database, described in Soranno et al. 2015). LAGOS-NE compiles multiple, individual lake water chemistry datasets into an integrated database. We accessed LAGOSLIMNO version 1.054.1 for lake water chemistry data and LAGOSGEO version 1.03 for geographic data. In the LAGOSLIMNO database, lake water chemistry data were collected from individual state agency sampling and volunteer programs designed to monitor lake water quality. Water chemistry analyses follow standard lab methods. In the LAGOSGEO database geographic data were collected from national scale geographic information systems (GIS) data layers. The dataset is a subset of the following integrated databases: LAGOSLIMNO v.1.054.1 and LAGOSGEO v.1.03. For full documentation of these databases, please see the publication below: Soranno, P.A., E.G. Bissell, K.S. Cheruvelil, S.T. Christel, S.M. Collins, C.E. Fergus, C.T. Filstrup, J.F. Lapierre, N.R. Lottig, S.K. Oliver, C.E. Scott, N.J. Smith, S. Stopyak, S. Yuan, M.T. Bremigan, J.A. Downing, C. Gries, E.N. Henry, N.K. Skaff, E.H. Stanley, C.A. Stow, P.-N. Tan, T. Wagner, K.E. Webster. 2015. Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse. GigaScience 4:28 doi:10.1186/s13742-015-0067-4 .
There are a number of sources for estimates of the size and distribution of ethnic group populations in England. These estimates vary in quality, accuracy, timeliness, and detail; in some cases, the underlying definition of what constitutes the resident population is different. This document outlines in some detail the major sources of ethnic group information currently available at the national and regional level. It also gives a brief summary of the estimates themselves.
Time series of mean summer total nitrogen (TN), total phosphorus (TP), stoichiometry (TN:TP) and chlorophyll values from 2913 unique lakes in the Midwest and Northeast United States. Epilimnetic nutrient and chlorophyll observations were derived from the Lake Multi-Scaled Geospatial and Temporal Database LAGOS-NE LIMNO version 1.054.1, and come from 54 disparate data sources. These data were used to assess long-term monotonic changes in water quality from 1990-2013, and the potential drivers of those trends (Oliver et al., submitted). Summer was used to approximate the stratified period, which was defined as June 15 to September 15. The median number of observations per summer for a given lake was 2, but ranged from 1 to 83. The rules for inclusion in the database were that, for a given water quality parameter, a lake must have an observation in each period of 1990-2000 and 2001-2011. Additionally, observations must span at least 5 years. Each unique lake with nutrient or chlorophyll data also has supporting geophysical data, including climate, atmospheric deposition, land use, hydrology, and topography derived at the lake watershed (variable prefix “iws”) and HUC 4 (variable prefix “hu4”) scale. Lake-specific characteristics, such as depth and area, are also reported. The geospatial data came from LAGOS-NE GEO version 1.03. For more specific information on how LAGOS-NE was created, see Soranno et al. 2015. Soranno P.A., Bissell E.G., Cheruvelil K.S., Christel S.T., Collins S.M., Fergus C.E., Filstrup C.T., Lapierre J.-F., Lottig N.R., Oliver S.K., Scott C.E., Smith N.J., Stopyak S., Yuan S., Bremigan M.T., Downing J.A., Gries C., Henry E.N., Skaff N.K., Stanley E.H., Stow C.A., Tan P.-N., Wagner T., and Webster K.E. 2015. Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse. Gigascience 4: 28. doi: 10.1186/s13742-015-0067-4.
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro...
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global Data Fabric market size is USD 2.6521 billion in 2024 and will expand at a compound annual growth rate (CAGR) of 15.98% from 2024 to 2031. Market Dynamics of Data Fabric Market
Key Drivers for Data Fabric Market
Data Explosion and Complexity- One of the main reasons is the exponential growth of data volumes across various sources, including IoT devices, social media, and enterprise applications, fuels the demand for data fabric solutions. These solutions offer seamless data integration, management, and accessibility across heterogeneous environments, enabling organizations to harness the value of their data assets efficiently. As data becomes increasingly diverse and distributed, the need for cohesive data fabric architectures becomes paramount to facilitate data-driven decision-making and unlock new business insights.
Cloud adoption and hybrid multi-cloud strategies drive data fabric growth.
Key Restraints for Data Fabric Market
The Complexity of integration with existing infrastructure and applications poses a serious threat to the Data Fabric industry.
Concerns about data privacy, security, and regulatory compliance impact the data fabric market growth.
Introduction of the Data Fabric Market
Data Fabric also referred to growth by offering a comprehensive solution for managing and leveraging data across hybrid and multi-cloud environments. As organizations grapple with the challenges of data fragmentation, siloed systems, and disparate data sources, data fabric provides a unified architecture that seamlessly integrates data from various sources, formats, and locations. By providing a holistic view of data assets and enabling real-time access and analysis, data fabric empowers organizations to derive actionable insights, make informed decisions, and drive innovation. With the proliferation of data-driven initiatives, digital transformation efforts, and the increasing adoption of cloud technologies, the Data Fabric Market is poised for substantial growth as businesses recognize the value of a cohesive data strategy in unlocking the full potential of their data assets.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Integrated distribution models (IDMs), in which datasets with different properties are analysed together, are becoming widely used to model species distributions and abundance in space and time. To date, the IDM literature has focused on technical and statistical issues, such as the precision of parameter estimates and mitigation of biases arising from unstructured data sources. However, IDMs have an unrealised potential to estimate ecological properties that could not be properly derived from the source datasets if analysed separately. We present a model that estimates community alpha diversity metrics by integrating one species-level dataset of presence-absence records with a co-located dataset of group-level counts (i.e. lacking information about species identity). We illustrate the ability of community IDMs to capture the true alpha diversity through simulation studies and apply the model to data from the UK Pollinator Monitoring Scheme, to describe spatial variation in the diversity of solitary bees, bumblebees and hoverflies. The simulation and case studies showed that the proposed IDM produced more precise estimates of the community diversity than the single models, and the analysis of the real dataset further showed that the alpha diversity estimates from the IDM were averages of the single models. Our findings also revealed that IDMs had a higher prediction accuracy for all the insect groups in most cases, with this performance linked to the information provided by a data source into the IDM.
The explanations behind observations of global patterning in species diversity pre-date the field of ecology itself. The generation of new species-area theories, in particular, far outpaces their falsification, resulting in a centuries-old accumulation in species diversity theories. We use historical assessment and new data analysis to argue that one of the earliest recognized and most consistent patterns in species diversity is not strictly an ecological phenomenon and, when ecological mechanism is invoked, the range of potential mechanisms is too numerous for tractable hypothesis falsification. We provide a historical parallel in that the normal distribution once was treated as a pattern assuming a biological mechanism rather than a statistical distribution that can be generated by biological and non-biological forces. Similarly, power law distributions are ubiquitous in aggregated data, such as the species-area relationship. That nearly identical broad-scale aggregation patterns are ...
This data package, LAGOS-NE-GEO v1.05, is 1 of 5 data packages associated with the LAGOS-NE database-- the LAke multi-scaled GeOSpatial and temporal database. Three of the data packages each contain different types of data for 51,101 lakes and reservoirs larger than 4 ha in 17 lake-rich U.S. states to support research on thousands of lakes. These three package are: (1) LAGOS-NE-LOCUS: lake location and physical characteristics for all lakes. (2) LAGOS-NE-GEO: ecological context (i.e., the land use, geologic, climatic, and hydrologic setting of lakes) for all lakes. These geospatial data were created by processing national-scale and publicly-accessible datasets to quantify numerous metrics at multiple spatial resolutions. And, (3) LAGOS-NE-LIMNO: in-situ measurements of lake water quality from the past three decades for approximately 2,600-12,000 lakes, depending on the variable. This module was created by harmonizing 87 water quality datasets from federal, state, tribal, and non-profit agencies, university researchers, and citizen scientists. The other two data packages contain supporting data for the LAGOS-NE database: (4) LAGOS-NE-GIS v1.0: the GIS data layers for lakes, wetlands, and streams, as well as the spatial resolutions that were used to create the LAGOS-NEGEO module. (5) LAGOS-NE-RAWDATA: the original 87 datasets of lake water quality prior to processing, the R code that converts the original data formats into LAGOS-NE data format, and the log file from this procedure to create LAGOS-NE. This latter data package supports the reproducibility of LAGOS-NE-LIMNO.
The LAGOS-NE-GEO v1.05 module includes information on the ecological context of the census lakes, all lakes > 4 ha in the study extent, their watersheds, and their regions. The information provided in the data tables for this module is organized into three main themes: CHAG - climate, hydrology, atmospheric deposition of nitrogen and sulfur, and surficial geology; LULC - land use/cover, impervious cover, canopy cover, slope and terrain indices, and dam density; and CONN - lake, stream, and wetland abundance and connectivity measures.
Citation for the full documentation of this database:
Soranno, P.A., E.G. Bissell, K.S. Cheruvelil, S.T. Christel, S.M. Collins, C.E. Fergus, C.T. Filstrup, J.F. Lapierre, N.R. Lottig, S.K. Oliver, C.E. Scott, N.J. Smith, S. Stopyak, S. Yuan, M.T. Bremigan, J.A. Downing, C. Gries, E.N. Henry, N.K. Skaff, E.H. Stanley, C.A. Stow, P.-N. Tan, T. Wagner, K.E. Webster. 2015. Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science and data reuse. GigaScience 4:28 doi:10.1186/s13742-015-0067-4
Citation for the data paper for this database:
Soranno, P.A., L.C. Bacon, M. Beauchene, K.E. Bednar, E.G. Bissell, C.K. Boudreau, M.G. Boyer, M.T. Bremigan, S.R. Carpenter, J.W. Carr, K.S. Cheruvelil, S.T. Christel, M. Claucherty, S.M.Collins, J.D. Conroy, J.A. Downing, J. Dukett, C.E. Fergus, C.T. Filstrup, C. Funk, M.J. Gonzalez, L.T. Green, C. Gries, J.D. Halfman, S.K. Hamilton, P.C. Hanson, E.N. Henry, E.M. Herron, C. Hockings, J.R. Jackson, K. Jacobson-Hedin, L.L. Janus, W.W. Jones, J.R. Jones, C.M. Keson, K.B.S. King, S.A. Kishbaugh, J.-F. Lapierre, B. Lathrop, J.A. Latimore, Y. Lee, N.R. Lottig, J.A. Lynch, L.J. Matthews, W.H. McDowell, K.E.B. Moore, B.P. Neff, S.J. Nelson, S.K. Oliver, M.L. Pace, D.C. Pierson, A.C. Poisson, A.I. Pollard, D.M. Post, P.O. Reyes, D.O. Rosenberry, K.M. Roy, L.G. Rudstam, O. Sarnelle, N.J. Schuldt, C.E. Scott, N.K. Skaff, N.J. Smith, N.R. Spinelli, J.J. Stachelek, E.H. Stanley, J.L. Stoddard, S.B. Stopyak, C.A. Stow, J.M. Tallant, P.-N. Tan, A.P. Thorpe, M.J. Vanni, T. Wagner, G. Watkins, K.C. Weathers, K.E. Webster, J.D. White, M.K. Wilmes, S. Yuan. In Review. LAGOS-NE: A multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of U.S. lakes. In Review at GigaScience. Submitted April 2017.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This paper demonstrates the flexibility of a general approach for the analysis of discrete time competing risks data that can accommodate complex data structures, different time scales for different causes, and nonstandard sampling schemes. The data may involve a single data source where all individuals contribute to analyses of both cause-specific hazard functions, overlapping datasets where some individuals contribute to the analysis of the cause-specific hazard function of only one cause while other individuals contribute to analyses of both cause-specific hazard functions, or separate data sources where each individual contributes to the analysis of the cause-specific hazard function of only a single cause. The approach is modularized into estimation and prediction. For the estimation step, the parameters and the variance-covariance matrix can be estimated using widely available software. The prediction step utilizes a generic program with plug-in estimates from the estimation step. The approach is illustrated with three prognostic models for stage IV male oral cancer using different data structures. The first model uses only men with stage IV oral cancer from population-based registry data. The second model strategically extends the cohort to improve the efficiency of the estimates. The third model improves the accuracy for those with a lower risk of other causes of death, by bringing in an independent data source collected under a complex sampling design with additional other-cause covariates. These analyses represent novel extensions of existing methodology, broadly applicable for the development of prognostic models capturing both the cancer and non-cancer aspects of a patient's health.