Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data was reported at 45.958 NA in 2022. This records a decrease from the previous number of 49.075 NA for 2021. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data is updated yearly, averaging 49.892 NA from Dec 2016 (Median) to 2022, with 7 observations. The data reached an all-time high of 52.417 NA in 2018 and a record low of 45.958 NA in 2022. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Algeria – Table DZ.World Bank.WDI: Governance: Policy and Institutions. The data sources overall score is a composity measure of whether countries have data available from the following sources: Censuses and surveys, administrative data, geospatial data, and private sector/citizen generated data. The data sources (input) pillar is segmented by four types of sources generated by (i) the statistical office (censuses and surveys), and sources accessed from elsewhere such as (ii) administrative data, (iii) geospatial data, and (iv) private sector data and citizen generated data. The appropriate balance between these source types will vary depending on a country’s institutional setting and the maturity of its statistical system. High scores should reflect the extent to which the sources being utilized enable the necessary statistical indicators to be generated. For example, a low score on environment statistics (in the data production pillar) may reflect a lack of use of (and low score for) geospatial data (in the data sources pillar). This type of linkage is inherent in the data cycle approach and can help highlight areas for investment required if country needs are to be met.;Statistical Performance Indicators, The World Bank (https://datacatalog.worldbank.org/dataset/statistical-performance-indicators);Weighted average;
Facebook
TwitterA 2023 report on data breaches in the healthcare system in the United States revealed that in most incidents, the leaked data was located in the network server, with almost 70 percent of data breaches indicating this location. The second-most common location of breached data was e-mail, with over 18 percent of the cases, followed by paper or films, with nearly six percent of the cases.
Facebook
TwitterPredictLeads Job Openings Data provides high-quality hiring insights sourced directly from company websites - not job boards. Using advanced web scraping technology, our dataset offers real-time access to job trends, salaries, and skills demand, making it a valuable resource for B2B sales, recruiting, investment analysis, and competitive intelligence.
Key Features:
✅232M+ Job Postings Tracked – Data sourced from 92 Million company websites worldwide. ✅7,1M+ Active Job Openings – Updated in real-time to reflect hiring demand. ✅Salary & Compensation Insights – Extract salary ranges, contract types, and job seniority levels. ✅Technology & Skill Tracking – Identify emerging tech trends and industry demands. ✅Company Data Enrichment – Link job postings to employer domains, firmographics, and growth signals. ✅Web Scraping Precision – Directly sourced from employer websites for unmatched accuracy.
Primary Attributes:
Job Metadata:
Salary Data (salary_data)
Occupational Data (onet_data) (object, nullable)
Additional Attributes:
📌 Trusted by enterprises, recruiters, and investors for high-precision job market insights.
PredictLeads Dataset: https://docs.predictleads.com/v3/guide/job_openings_dataset
Facebook
TwitterThis report compares specific health conditions, overall health, and health care utilization prevalence estimates from the 2006 National Survey on Drug Use and Health (NSDUH) and other national data sources. Methodological differences among these data sources that may contribute to differences in estimates are described. In addition to NSDUH, three of the data sources use respondent self-reports to measure health characteristics and service utilization: the National Health Interview Survey (NHIS), the Behavioral Risk Factor Surveillance System (BRFSS), and the Medical Expenditure Panel Survey (MEPS). One survey, the National Health and Nutrition Examination Survey (NHANES), conducts initial interviews in respondents\' homes, collecting further data at nearby locations. Five data sources provide health care utilization data extracted from hospital records; these sources include the National Hospital Discharge Survey (NHDS), the Nationwide Inpatient Sample (NIS), the Nationwide Emergency Department Sample (NEDS), the National Health and Ambulatory Medical Care Survey (NHAMCS), and the Drug Abuse Warning Network (DAWN). Several methodological differences that could cause differences in estimates are discussed, including type and mode of data collection; weighting and representativeness of the sample; question placement, wording, and format; and use of proxy reporting for adolescents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionFollowing the identification of Local Area Energy Planning (LAEP) use cases, this dataset lists the data sources and/or information that could help facilitate this research. View our dedicated page to find out how we derived this list: Local Area Energy Plan — UK Power Networks (opendatasoft.com)
Methodological Approach Data upload: a list of datasets and ancillary details are uploaded into a static Excel file before uploaded onto the Open Data Portal.
Quality Control Statement
Quality Control Measures include: Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology
Assurance Statement The Open Data Team and Local Net Zero Team worked together to ensure data accuracy and consistency.
Other Download dataset information: Metadata (JSON)
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
Please note that "number of records" in the top left corner is higher than the number of datasets available as many datasets are indexed against multiple use cases leading to them being counted as multiple records.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sources’ characteristics*.
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
Twitterhttps://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Real World Evidence Solutions Market size was valued at USD 1.30 Billion in 2024 and is projected to reach USD 3.71 Billion by 2032, growing at a CAGR of 13.92% during the forecast period 2026-2032.Global Real World Evidence Solutions Market DriversThe market drivers for the Real World Evidence Solutions Market can be influenced by various factors. These may include:Growing Need for Evidence-Based Healthcare: Real-world evidence (RWE) is becoming more and more important in healthcare decision-making, according to stakeholders such as payers, providers, and regulators. In addition to traditional clinical trial data, RWE solutions offer important insights into the efficacy, safety, and value of healthcare interventions in real-world situations.Growing Use of RWE by Pharmaceutical Companies: RWE solutions are being used by pharmaceutical companies to assist with market entry, post-marketing surveillance, and drug development initiatives. Pharmaceutical businesses can find new indications for their current medications, improve clinical trial designs, and convince payers and providers of the worth of their products with the use of RWE.Increasing Priority for Value-Based Healthcare: The emphasis on proving the cost- and benefit-effectiveness of healthcare interventions in real-world settings is growing as value-based healthcare models gain traction. To assist value-based decision-making, RWE solutions are essential in evaluating the economic effect and real-world consequences of healthcare interventions.Technological and Data Analytics Advancements: RWE solutions are becoming more capable due to advances in machine learning, artificial intelligence, and big data analytics. With the use of these technologies, healthcare stakeholders can obtain actionable insights from the analysis of vast and varied datasets, including patient-generated data, claims data, and electronic health records.Regulatory Support for RWE Integration: RWE is being progressively integrated into regulatory decision-making processes by regulatory organisations including the European Medicines Agency (EMA) and the U.S. Food and Drug Administration (FDA). The FDA's Real-World Evidence Programme and the EMA's Adaptive Pathways and PRIority MEdicines (PRIME) programme are two examples of initiatives that are making it easier to incorporate RWE into regulatory submissions and drug development.Increasing Emphasis on Patient-Centric Healthcare: The value of patient-reported outcomes and real-world experiences in healthcare decision-making is becoming more widely acknowledged. RWE technologies facilitate the collection and examination of patient-centered data, offering valuable insights into treatment efficacy, patient inclinations, and quality of life consequences.Extension of RWE Use Cases: RWE solutions are being used in medication development, post-market surveillance, health economics and outcomes research (HEOR), comparative effectiveness research, and market access, among other healthcare fields. The necessity for a variety of RWE solutions catered to the needs of different stakeholders is being driven by the expansion of RWE use cases.
Facebook
TwitterMaximum Analysis Sample Sizes by Analysis Type and Data Source.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sources for 54 use cases, which were used to identify the impacts to the value creation system by Additive Manufacturing.
Facebook
TwitterThe Public Health Emergency (PHE) declaration for COVID-19 expired on May 11, 2023. As a result, the Aggregate Case and Death Surveillance System will be discontinued. Although these data will continue to be publicly available, this dataset will no longer be updated.
On October 20, 2022, CDC began retrieving aggregate case and death data from jurisdictional and state partners weekly instead of daily.
This dataset includes the URLs that were used by the aggregate county data collection process that compiled aggregate case and death counts by county. Within this file, each of the states (plus select jurisdictions and territories) are listed along with the county web sources which were used for pulling these numbers. Some states had a single statewide source for collecting the county data, while other states and local health jurisdictions may have had standalone sources for individual counties. In the cases where both local and state web sources were listed, a composite approach was taken so that the maximum value reported for a location from either source was used. The initial raw data were sourced from these links and ingested into the CDC aggregate county dataset before being published on the COVID Data Tracker.
Facebook
TwitterDoorda's UK Vulnerability Data provides a comprehensive database of over 1.8M postcodes sourced from 30 data sources, offering unparalleled insights for location intelligence and analytics purposes.
Volume and stats: - 1.8M Postcodes - 5 Vulnerability areas covered - 1-100 Vulnerability rating
Our Residential Real Estate Data offers a multitude of use cases: - Market Analysis - Identify Vulnerable Consumers - Mitigate Lead Generation Risk - Risk Management - Location Planning
The key benefits of leveraging our Residential Real Estate Data include: - Data Accuracy - Informed Decision-Making - Competitive Advantage - Efficiency - Single Source
Covering a wide range of industries and sectors, our data empowers organisations to make informed decisions, uncover market trends, and gain a competitive edge in the UK market.
Facebook
TwitterData source codeDescription10original FRIS sample based on cores for older forest (origin_year ≤ 1900)11original FRIS sample based on cores for areas lacking RS-FRIS12original FRIS sample non-core data (e.g., stratified sample, LULC) for areas lacking RS-FRIS21Leave trees, identified using remotely-sensed data, within LRM completed even-aged harvests. Origin year based on RS-FRIS predicted origin year.22RS-FRIS 2.0 predicted origin year.23RS-FRIS 3.0 predicted origin year.24RS-FRIS 4.0 predicted origin year.25RS-FRIS 5.0 predicted origin year.30LRM stand origin date or activity completion date within completed even-aged harvests.40Overrides. These are areas where new data were collected or other data sources override all existing sources.53, 54High severity (53) and very high severity (54) fires 2012-202161GNN71Establishment year from inherited inventory for Deep River Woods land transaction.This data set is periodically updated. Last update performed 2025-08-31
Facebook
TwitterObjectiveThis paper introduces a novel framework for evaluating phenotype algorithms (PAs) using the open-source tool, Cohort Diagnostics.Materials and methodsThe method is based on several diagnostic criteria to evaluate a patient cohort returned by a PA. Diagnostics include estimates of incidence rate, index date entry code breakdown, and prevalence of all observed clinical events prior to, on, and after index date. We test our framework by evaluating one PA for systemic lupus erythematosus (SLE) and two PAs for Alzheimer’s disease (AD) across 10 different observational data sources.ResultsBy utilizing CohortDiagnostics, we found that the population-level characteristics of individuals in the cohort of SLE closely matched the disease’s anticipated clinical profile. Specifically, the incidence rate of SLE was consistently higher in occurrence among females. Moreover, expected clinical events like laboratory tests, treatments, and repeated diagnoses were also observed. For AD, although one PA identified considerably fewer patients, absence of notable differences in clinical characteristics between the two cohorts suggested similar specificity.DiscussionWe provide a practical and data-driven approach to evaluate PAs, using two clinical diseases as examples, across a network of OMOP data sources. Cohort Diagnostics can ensure the subjects identified by a specific PA align with those intended for inclusion in a research study.ConclusionDiagnostics based on large-scale population-level characterization can offer insights into the misclassification errors of PAs.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..
I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.
After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.
This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.
UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.
Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)
See the column descriptions for more details on the dataset
COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)
https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information related to diet and energy flow is fundamental to a diverse range of Antarctic and Southern Ocean biological and ecosystem studies. This metadata record describes a database of such information being collated by the SCAR Expert Groups on Antarctic Biodiversity Informatics (EG-ABI) and Birds and Marine Mammals (EG-BAMM) to assist the scientific community in this work. It includes data related to diet and energy flow from conventional (e.g. gut content) and modern (e.g. molecular) studies, stable isotopes, fatty acids, and energetic content. It is a product of the SCAR community and open for all to participate in and use.
Data have been drawn from published literature, existing trophic data collections, and unpublished data. The database comprises five principal tables, relating to (i) direct sampling methods of dietary assessment (e.g. gut, scat, and bolus content analyses, stomach flushing, and observed predation), (ii) stable isotopes, (iii) lipids, (iv) DNA-based diet assessment, and (v) energetics values. The schemas of these tables are described below, and a list of the sources used to populate the tables is provided with the data.
A range of manual and automated checks were used to ensure that the entered data were as accurate as possible. These included visual checking of transcribed values, checking of row or column sums against known totals, and checking for values outside of allowed ranges. Suspicious entries were re-checked against original source.
Notes on names: Names have been validated against the World Register of Marine Species (http://www.marinespecies.org/). For uncertain taxa, the most specific taxonomic name has been used (e.g. prey reported in a study as "Pachyptila sp." will appear here as "Pachyptila"; "Cephalopods" will appear as "Cephalopoda"). Uncertain species identifications (e.g. "Notothenia rossii?" or "Gymnoscopelus cf. piabilis") have been assigned the genus name (e.g. "Notothenia", "Gymnoscopelus"). Original names have been retained in a separate column to allow future cross-checking. WoRMS identifiers (APHIA_ID numbers) are given where possible.
Grouped prey data in the diet sample table need to be handled with a bit of care. Papers commonly report prey statistics aggregated over groups of prey - e.g. one might give the diet composition by individual cephalopod prey species, and then an overall record for all cephalopod prey. The PREY_IS_AGGREGATE column identifies such records. This allows us to differentiate grouped data like this from unidentified prey items from a certain prey group - for example, an unidentifiable cephalopod record would be entered as Cephalopoda (the scientific name), with "N" in the PREY_IS_AGGREGATE column. A record that groups together a number of cephalopod records, possibly including some unidentifiable cephalopods, would also be entered as Cephalopoda, but with "Y" in the PREY_IS_AGGREGATE column. See the notes on PREY_IS_AGGREGATE, below.
There are two related R packages that provide data access and functionality for working with these data. See the package home pages for more information: https://github.com/SCAR/sohungry and https://github.com/SCAR/solong.
Data table schemas
Sources data table
SOURCE_ID: The unique identifier of this source
DETAILS: The bibliographic details for this source (e.g. "Hindell M (1988) The diet of the royal penguin Eudyptes schlegeli at Macquarie Island. Emu 88:219–226")
NOTES: Relevant notes about this source – if it’s a published paper, this is probably the abstract
DOI: The DOI of the source (paper or dataset), in the form "10.xxxx/yyyy"
Diet data table
RECORD_ID: The unique identifier of this record
SOURCE_ID: The identifier of the source study from which this record was obtained (see corresponding entry in the sources data table)
SOURCE_DETAILS, SOURCE_DOI: The details and DOI of the source, copied from the sources data table for convenience
ORIGINAL_RECORD_ID: The identifier of this data record in its original source, if it had one
LOCATION: The name of the location at which the data was collected
WEST: The westernmost longitude of the sampling region, in decimal degrees (negative values for western hemisphere longitudes)
EAST: The easternmost longitude of the sampling region, in decimal degrees (negative values for western hemisphere longitudes)
SOUTH: The southernmost latitude of the sampling region, in decimal degrees (negative values for southern hemisphere latitudes)
NORTH: The northernmost latitude of the sampling region, in decimal degrees (negative values for southern hemisphere latitudes)
ALTITUDE_MIN: The minimum altitude of the sampling region, in metres
ALTITUDE_MAX: The maximum altitude of the sampling region, in metres
DEPTH_MIN: The shallowest depth of the sampling, in metres
DEPTH_MAX: The deepest depth of the sampling, in metres
OBSERVATION_DATE_START: The start of the sampling period
OBSERVATION_DATE_END: The end of the sampling period. If sampling was carried out over multiple seasons (e.g. during January of 2002 and January of 2003), this will be the first and last dates (in this example, from 1-Jan-2002 to 31-Jan-2003)
PREDATOR_NAME: The name of the predator. This may differ from predator_name_original if, for example, taxonomy has changed since the original publication, if the original publication had spelling errors or used common (not scientific) names
PREDATOR_NAME_ORIGINAL: The name of the predator, as it appeared in the original source
PREDATOR_APHIA_ID: The numeric identifier of the predator in the WoRMS taxonomic register
PREDATOR_WORMS_RANK, PREDATOR_WORMS_KINGDOM, PREDATOR_WORMS_PHYLUM, PREDATOR_WORMS_CLASS, PREDATOR_WORMS_ORDER, PREDATOR_WORMS_FAMILY, PREDATOR_WORMS_GENUS: The taxonomic details of the predator, from the WoRMS taxonomic register
PREDATOR_GROUP_SOKI: A descriptive label of the group to which the predator belongs (currently used in the Southern Ocean Knowledge and Information wiki, http://soki.aq)
PREDATOR_LIFE_STAGE: Life stage of the predator, e.g. "adult", "chick", "larva", "juvenile". Note that if a food sample was taken from an adult animal, but that food was destined for a juvenile, then the life stage will be "juvenile" (this is common with seabirds feeding chicks)
PREDATOR_BREEDING_STAGE: Stage of the breeding season of the predator, if applicable, e.g. "brooding", "chick rearing", "nonbreeding", "posthatching"
PREDATOR_SEX: Sex of the predator: "male", "female", "both", or "unknown"
PREDATOR_SAMPLE_COUNT: The number of predators for which data are given. If (say) 50 predators were caught but only 20 analysed, this column will contain 20. For scat content studies, this will be the number of scats analysed
PREDATOR_SAMPLE_ID: The identifier of the predator(s). If predators are being reported at the individual level (i.e. PREDATOR_SAMPLE_COUNT = 1) then PREDATOR_SAMPLE_ID is the individual animal ID. Alternatively, if the data values being entered here are from a group of predators, then the PREDATOR_SAMPLE_ID identifies that group of predators. PREDATOR_SAMPLE_ID values are unique within a source (i.e. SOURCE_ID, PREDATOR_SAMPLE_ID pairs are globally unique). Rows with the same SOURCE_ID and PREDATOR_SAMPLE_ID values relate to the same predator individual or group of individuals, and so can be combined (e.g. for prey diversity analyses). Subsamples are indicated by a decimal number S.nnn, where S is the parent PREDATOR_SAMPLE_ID, and nnn (001-999) is the subsample number. Studies will sometimes report detailed prey information for a large sample, but then report prey information for various subsamples of that sample (e.g. broken down by predator sex, or sampling season). In the simplest case, the diet of each predator will be reported only once in the study, and in this scenario the PREDATOR_SAMPLE_ID values will simply be 1 to N (for N predators).
PREDATOR_SIZE_MIN, PREDATOR_SIZE_MAX, PREDATOR_SIZE_MEAN, PREDATOR_SIZE_SD: The minimum, maximum, mean, and standard deviation of the size of the predators in the sample
PREDATOR_SIZE_UNITS: The units of size (e.g. "mm")
PREDATOR_SIZE_NOTES: Notes on the predator size information, including a definition of what the size value represents (e.g. "total length", "standard length")
PREDATOR_MASS_MIN, PREDATOR_MASS_MAX, PREDATOR_MASS_MEAN, PREDATOR_MASS_SD: The minimum, maximum, mean, and standard deviation of the mass of the predators in the sample
PREDATOR_MASS_UNITS: The units of mass (e.g. "g", "kg")
PREDATOR_MASS_NOTES: Notes on the predator mass information, including a definition of what the mass value represents
PREY_NAME: The scientific name of the prey item (corrected, if necessary)
PREY_NAME_ORIGINAL: The name of the prey item, as it appeared in the original source
PREY_APHIA_ID: The numeric identifier of the prey in the WoRMS taxonomic register
PREY_WORMS_RANK, PREY_WORMS_KINGDOM, PREY_WORMS_PHYLUM, PREY_WORMS_CLASS, PREY_WORMS_ORDER, PREY_WORMS_FAMILY, PREY_WORMS_GENUS: The taxonomic details of the prey, from the WoRMS taxonomic register
PREY_GROUP_SOKI: A descriptive label of the group to which the prey belongs (currently used in the Southern Ocean Knowledge and Information wiki, http://soki.aq)
PREY_IS_AGGREGATE: "Y" indicates that this row is an aggregation of other rows in this data source. For example, a study might give a number of individual squid species records, and then an overall squid record that encompasses the individual records. Use the PREY_IS_AGGREGATE information to avoid double-counting during analyses
PREY_LIFE_STAGE: Life stage of the prey (e.g. "adult", "chick", "larva")
PREY_SEX: The sex of the prey ("male", "female", "both", or "unknown"). Note that this is generally "unknown"
PREY_SAMPLE_COUNT: The number of prey individuals from which size and mass measurements were made (note: this is NOT the total number of individuals of
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IncRML resources
This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.
Datasets
We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.
Benchmarks
GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions)
GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions)
GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions)
Real-world datasets
Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions)
Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions)
Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions)
Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions)
Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions)
Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions)
OpenStreetMap (World): geographical map data (1 day, 1440 versions)
Ingestion
Real-world datasets LDES output was converted into SPARQL UPDATE queries and executed against Virtuoso to have an estimate for non-LDES clients how incremental generation impacted ingestion into triplestores.
Remarks
The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation.
GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz
All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB).
Reproducing
By using our experiment tool, you can easily reproduce the experiments as followed:
Download one of the TAR.XZ archives and unpack them.
Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'.
Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder.
Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times.
Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'.
Testcases
Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394
Facebook
TwitterScripts and data acquired at the Mirror Lake Research Site, cited by the article submitted to Water Resources Research: Distributed Acoustic Sensing (DAS) as a Distributed Hydraulic Sensor in Fractured Bedrock M. W. Becker(1), T. I. Coleman(2), and C. C. Ciervo(1) 1 California State University, Long Beach, Geology Department, 1250 Bellflower Boulevard, Long Beach, California, 90840, USA. 2 Silixa LLC, 3102 W Broadway St, Suite A, Missoula MT 59808, USA. Corresponding author: Matthew W. Becker (matt.becker@csulb.edu).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Japan JP: SPI: Pillar 4 Data Sources Score: Scale 0-100 data was reported at 84.050 NA in 2024. This stayed constant from the previous number of 84.050 NA for 2023. Japan JP: SPI: Pillar 4 Data Sources Score: Scale 0-100 data is updated yearly, averaging 78.317 NA from Mar 2017 (Median) to 2024, with 8 observations. The data reached an all-time high of 84.050 NA in 2024 and a record low of 71.542 NA in 2017. Japan JP: SPI: Pillar 4 Data Sources Score: Scale 0-100 data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Japan – Table JP.World Bank.WDI: Governance: Policy and Institutions. The data sources overall score is a composity measure of whether countries have data available from the following sources: Censuses and surveys, administrative data, geospatial data, and private sector/citizen generated data. The data sources (input) pillar is segmented by four types of sources generated by (i) the statistical office (censuses and surveys), and sources accessed from elsewhere such as (ii) administrative data, (iii) geospatial data, and (iv) private sector data and citizen generated data. The appropriate balance between these source types will vary depending on a country’s institutional setting and the maturity of its statistical system. High scores should reflect the extent to which the sources being utilized enable the necessary statistical indicators to be generated. For example, a low score on environment statistics (in the data production pillar) may reflect a lack of use of (and low score for) geospatial data (in the data sources pillar). This type of linkage is inherent in the data cycle approach and can help highlight areas for investment required if country needs are to be met.;Statistical Performance Indicators, The World Bank (https://datacatalog.worldbank.org/dataset/statistical-performance-indicators);Weighted average;
Facebook
TwitterThis data release is a compilation of construction depth information for 12,383 active and inactive public-supply wells (PSWs) in California from various data sources. Construction data from multiple sources were indexed by the California State Water Resources Control Board Division of Drinking Water (DDW) primary station code (PS Code). Five different data sources were compared with the following priority order: 1, Local sources from select municipalities and water purveyors (Local); 2, Local DDW district data (DDW); 3, The United States Geological Survey (USGS) National Water Information System (NWIS); 4, The California State Water Resources Control Board Groundwater Ambient Monitoring and Assessment Groundwater Information System (SWRCB); and 5, USGS attribution of California Department of Water Resources well completion report data (WCR). For all data sources, the uppermost depth to the well's open or perforated interval was attributed as depth to top of perforations (ToP). The composite depth to bottom of well (Composite BOT) field was attributed from available construction data in the following priority order: 1, Depth to bottom of perforations (BoP); 2, Depth of completed well (Well Depth); 3; Borehole depth (Hole Depth). PSW ToPs and Composite BOTs from each of the five data sources were then compared and summary construction depths for both fields were selected for wells with multiple data sources according to the data-source priority order listed above. Case-by-case modifications to the final selected summary construction depths were made after priority order-based selection to ensure internal logical consistency (for example, ToP must not exceed Composite BOT). This data release contains eight tab-delimited text files. WellConstructionSourceData_Local.txt contains well construction-depth data, Composite BOT data-source attribution, and local agency data-source attribution for the Local data. WellConstructionSourceData_DDW.txt contains well construction-depth data and Composite BOT data-source attribution for the DDW data. WellConstructionSourceData_NWIS.txt contains well construction-depth data, Composite BOT data-source attribution, and USGS site identifiers for the NWIS data. WellConstructionSourceData_SWRCB.txt contains well construction-depth data and Composite BOT data-source attribution for the SWRCB data. WellConstructionSourceData_WCR.txt contains contains well construction depth data and Composite BOT data-source attribution for the WCR data. WellConstructionCompilation_ToP.txt contains all ToP data listed by data source. WellConstructionCompilation_BOT.txt contains all Composite BOT data listed by data source. WellConstructionCompilation_Summary.txt contains summary ToP and Composite BOT values for each well with data-source attribution for both construction fields. All construction depths are in units of feet below land surface and are reported to the nearest foot.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data was reported at 45.958 NA in 2022. This records a decrease from the previous number of 49.075 NA for 2021. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data is updated yearly, averaging 49.892 NA from Dec 2016 (Median) to 2022, with 7 observations. The data reached an all-time high of 52.417 NA in 2018 and a record low of 45.958 NA in 2022. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Algeria – Table DZ.World Bank.WDI: Governance: Policy and Institutions. The data sources overall score is a composity measure of whether countries have data available from the following sources: Censuses and surveys, administrative data, geospatial data, and private sector/citizen generated data. The data sources (input) pillar is segmented by four types of sources generated by (i) the statistical office (censuses and surveys), and sources accessed from elsewhere such as (ii) administrative data, (iii) geospatial data, and (iv) private sector data and citizen generated data. The appropriate balance between these source types will vary depending on a country’s institutional setting and the maturity of its statistical system. High scores should reflect the extent to which the sources being utilized enable the necessary statistical indicators to be generated. For example, a low score on environment statistics (in the data production pillar) may reflect a lack of use of (and low score for) geospatial data (in the data sources pillar). This type of linkage is inherent in the data cycle approach and can help highlight areas for investment required if country needs are to be met.;Statistical Performance Indicators, The World Bank (https://datacatalog.worldbank.org/dataset/statistical-performance-indicators);Weighted average;