Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unlike West Germany, high morbidity and mortality have been observed in East Germany over the last century. The regional population-based Study of Health in Pomerania (SHIP) therefore investigates the long-term progression of sub-clinical findings, their determinants and prognostic values, to acquire knowledge that facilitates early diagnosis and thus helps prevent the progression of disease. The SHIP covers various areas of patient health. Each SHIP data set is accompanied by a data dictionary (DD) which provides descriptions of variables and definitions.
This work shows the detailed mapping results of the semantic enrichment of the SHIP-START-4 medical laboratory data dictionary with LOINC codes. This work also provides detailed descriptions of the concepts applied in the semnatic enrichment. The results of this work serve as a critical step towards improving its interoperability and hence FAIRness for the SHIP laboratory-related measurements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in this dataset were collected in the result of the survey of Latvian society (2021) aimed at identifying high-value data set for Latvia, i.e. data sets that, in the view of Latvian society, could create the value for the Latvian economy and society. The survey is created for both individuals and businesses. It being made public both to act as supplementary data for "Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia" paper (author: Anastasija Nikiforova, University of Latvia) and in order for other researchers to use these data in their own work.
The survey was distributed among Latvian citizens and organisations. The structure of the survey is available in the supplementary file available (see Survey_HighValueDataSets.odt)
Description of the data in this data set: structure of the survey and pre-defined answers (if any) 1. Have you ever used open (government) data? - {(1) yes, once; (2) yes, there has been a little experience; (3) yes, continuously, (4) no, it wasn’t needed for me; (5) no, have tried but has failed} 2. How would you assess the value of open govenment data that are currently available for your personal use or your business? - 5-point Likert scale, where 1 – any to 5 – very high 3. If you ever used the open (government) data, what was the purpose of using them? - {(1) Have not had to use; (2) to identify the situation for an object or ab event (e.g. Covid-19 current state); (3) data-driven decision-making; (4) for the enrichment of my data, i.e. by supplementing them; (5) for better understanding of decisions of the government; (6) awareness of governments’ actions (increasing transparency); (7) forecasting (e.g. trendings etc.); (8) for developing data-driven solutions that use only the open data; (9) for developing data-driven solutions, using open data as a supplement to existing data; (10) for training and education purposes; (11) for entertainment; (12) other (open-ended question) 4. What category(ies) of “high value datasets” is, in you opinion, able to create added value for society or the economy? {(1)Geospatial data; (2) Earth observation and environment; (3) Meteorological; (4) Statistics; (5) Companies and company ownership; (6) Mobility} 5. To what extent do you think the current data catalogue of Latvia’s Open data portal corresponds to the needs of data users/ consumers? - 10-point Likert scale, where 1 – no data are useful, but 10 – fully correspond, i.e. all potentially valuable datasets are available 6. Which of the current data categories in Latvia’s open data portals, in you opinion, most corresponds to the “high value dataset”? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 7. Which of them form your TOP-3? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 8. How would you assess the value of the following data categories? 8.1. sensor data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.2. real-time data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.3. geospatial data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 9. What would be these datasets? I.e. what (sub)topic could these data be associated with? - open-ended question 10. Which of the data sets currently available could be valauble and useful for society and businesses? - open-ended question 11. Which of the data sets currently NOT available in Latvia’s open data portal could, in your opinion, be valauble and useful for society and businesses? - open-ended question 12. How did you define them? - {(1)Subjective opinion; (2) experience with data; (3) filtering out the most popular datasets, i.e. basing the on public opinion; (4) other (open-ended question)} 13. How high could be the value of these data sets value for you or your business? - 5-point Likert scale, where 1 – not valuable, 5 – highly valuable 14. Do you represent any company/ organization (are you working anywhere)? (if “yes”, please, fill out the survey twice, i.e. as an individual user AND a company representative) - {yes; no; I am an individual data user; other (open-ended)} 15. What industry/ sector does your company/ organization belong to? (if you do not work at the moment, please, choose the last option) - {Information and communication services; Financial and ansurance activities; Accommodation and catering services; Education; Real estate operations; Wholesale and retail trade; repair of motor vehicles and motorcycles; transport and storage; construction; water supply; waste water; waste management and recovery; electricity, gas supple, heating and air conditioning; manufacturing industry; mining and quarrying; agriculture, forestry and fisheries professional, scientific and technical services; operation of administrative and service services; public administration and defence; compulsory social insurance; health and social care; art, entertainment and recreation; activities of households as employers;; CSO/NGO; Iam not a representative of any company 16. To which category does your company/ organization belong to in terms of its size? - {small; medium; large; self-employeed; I am not a representative of any company} 17. What is the age group that you belong to? (if you are an individual user, not a company representative) - {11..15, 16..20, 21..25, 26..30, 31..35, 36..40, 41..45, 46+, “do not want to reveal”} 18. Please, indicate your education or a scientific degree that corresponds most to you? (if you are an individual user, not a company representative) - {master degree; bachelor’s degree; Dr. and/ or PhD; student (bachelor level); student (master level); doctoral candidate; pupil; do not want to reveal these data}
Format of the file .xls, .csv (for the first spreadsheet only), .odt
Licenses or restrictions CC-BY
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The availability of proteomics datasets in the public domain, and in the PRIDE database, in particular, has increased dramatically in recent years. This unprecedented large-scale availability of data provides an opportunity for combined analyses of datasets to get organism-wide protein abundance data in a consistent manner. We have reanalyzed 24 public proteomics datasets from healthy human individuals to assess baseline protein abundance in 31 organs. We defined tissue as a distinct functional or structural region within an organ. Overall, the aggregated dataset contains 67 healthy tissues, corresponding to 3,119 mass spectrometry runs covering 498 samples from 489 individuals. We compared protein abundances between different organs and studied the distribution of proteins across these organs. We also compared the results with data generated in analogous studies. Additionally, we performed gene ontology and pathway-enrichment analyses to identify organ-specific enriched biological processes and pathways. As a key point, we have integrated the protein abundance results into the resource Expression Atlas, where they can be accessed and visualized either individually or together with gene expression data coming from transcriptomics datasets. We believe this is a good mechanism to make proteomics data more accessible for life scientists.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents the results of the experimentation of a method for evaluating semantic similarity between concepts in a taxonomy. The method is based on the information-theoretic approach and allows senses of concepts in a given context to be considered. Relevance of senses is calculated in terms of semantic relatedness with the compared concepts. In a previous work [9], the adopted semantic relatedness method was the one described in [10], while in this work we also adopted the ones described in [11], [12], [13], [14], [15], and [16].
We applied our proposal by extending 7 methods for computing semantic similarity in a taxonomy, selected from the literature. The methods considered in the experiment are referred to as R[2], W&P[3], L[4], J&C[5], P&S[6], A[7], and A&M[8]
The experiment was run on the well-known Miller and Charles benchmark dataset [1] for assessing semantic similarity.
The results are organized in seven folders, each with the results related to one of the above semantic relatedness methods. In each folder there is a set of files, each referring to one pair of the Miller and Charles dataset. In fact, for each pair of concepts, all the 28 pairs are considered as possible different contexts.
REFERENCES [1] Miller G.A., Charles W.G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1). [2] Resnik P. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Int. Joint Conf. on Artificial Intelligence, Montreal. [3] Wu Z., Palmer M. 1994. Verb semantics and lexical selection. 32nd Annual Meeting of the Associations for Computational Linguistics. [4] Lin D. 1998. An Information-Theoretic Definition of Similarity. Int. Conf. on Machine Learning. [5] Jiang J.J., Conrath D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Inter. Conf. Research on Computational Linguistics. [6] Pirrò G. 2009. A Semantic Similarity Metric Combining Features and Intrinsic Information Content. Data Knowl. Eng, 68(11). [7] Adhikari A., Dutta B., Dutta A., Mondal D., Singh S. 2018. An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. J. Assoc. Inf. Sci. Technol. 69(8). [8] Adhikari A., Singh S., Mondal D., Dutta B., Dutta A. 2016. A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet. CoRR, arXiv:1607.05422, abs/1607.05422. [9] Formica A., Taglino F. 2021. An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, vol. 9. [10] Information Content-based approach [Schuhmacher and Ponzetto, 2014]. [11] Linked Data Semantic Distance (LDSD) [Passant, 2010]. [12] Wikipedia Link-based Measure (WLM ) [Witten and Milne, 2008]; [13] Linked Open Data Description Overlap-based approach (LODDO) [Zhou et al. 2012] [14] Exclusivity-based [Hulpuş et al 2015] [15] ASRMP [El Vaigh et al. 2020] [16] LDSDGN [Piao and Breslin, 2016]
A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but also on a range of additional sources: data elicited from complementary interviews with young Tunisians and lexical material taken from various published historical sources dating from the middle of the 20th century and earlier. See also: https://hdl.handle.net/11022/0000-0007-C265-C
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
The Relationship Display extension for CKAN appears to enhance the platform's ability to manage and visualize relationships between datasets or other entities managed within CKAN. While the provided documentation lacks specific details, the extension likely introduces functionalities to define, store, and display connections between different resources to improve data discovery and understanding. Key Features (Inferred/Potential): Relationship Definition: Enables administrators and users to define various types of relationships between datasets, such as "derived from," "related to," or "supersedes". Visual Display: Offers a user interface component to visualize these relationships, possibly through graphs, tables, or other interactive elements. This offers a way for users to intuitively comprehend how an entire set of datasets are related. Metadata Enrichment: Augments dataset metadata with relationship information, allowing users to find connected datasets based on their relationships. Enhanced Data Discovery: Aids in the discovery of related datasets, thus improving data exploration and enabling users to understand dataset context better. Technical Integration: The extension implements a CKAN plugin, as indicated by the ckan.plugins configuration setting. This ensures that the functionality is integrated within the core CKAN application workflow, including the way datasets can be managed. Thus, this extension modifies/adds to the platform's existing functionality. Specific integration details would require deeper examination of the extension's code, but it likely enhances the existing user interfaces and API endpoints. Benefits & Impact (Potential): By visualizing and managing relationships between datasets, this extension potentially improves data governance and data traceability. Organizations can also improve dataset understanding and potentially improve collaborative data usage. The enhancement to data discovery means that new insights can be gained and workflows improved.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
This dataset combines annual files from 2005 to 2017 published by the IRS. ZIP Code data show selected income and tax items classified by State, ZIP Code, and size of adjusted gross income. Data are based on individual income tax returns filed with the IRS. The data include items, such as:
Number of returns, which approximates the number of householdsNumber of personal exemptions, which approximates the populationAdjusted gross income (AGI)Wages and salariesDividends before exclusionInterest received Enrichment and notes:- the original data sheets (a column per variable, a line per year, zipcode and AGI group) have been transposed to get a record per year, zipcode, AGI group and variable- the data for Wyoming in 2006 was removed because AGI classes were not correctly defined, making the resulting data unfit for analysis.- the AGI groups have seen their definitions change: the variable "AGI Class" was used until 2008, with various intervals of AGI; "AGI Stub" replaced it in 2009. We provided the literal intervals (eg. "$50,000 under $75,000") as "AGI Group" in each case to help the analysis.- the codes for each tax item have been joined with a dataset of variables to provide full names.- some tax items are available since 2005, others since more recent years, depending on their introduction date (available in the dataset of variables); as a consequence, the time range of the plots or graphs may vary.- the unit for amounts and AGIs is a thousand dollars.
https://cdla.io/permissive-1-0https://cdla.io/permissive-1-0
The IBM CSTInsight dataset is a relational database along with schema metadata, useful for benchmarking, evaluation, and demonstration purposes related to relational data analysis, semantic enrichment, and business intelligence use cases.
Contents:
- cstinsight.db – SQLite database file containing the CSTInsight dataset.
- cstinsight_schema_db2.ddl – Database schema definition in IBM DB2 DDL format.
- cstinsight_schema_sqlite.ddl – Database schema definition in SQLite DDL format.
- cstinsight-schema.json – JSON file describing the schema of the CSTInsight database.
- tables/ – Directory containing CSV files with table data.
License:
Data is shared under the Community Data License Agreement (CDLA) 2.0: https://cdla.dev/permissive-2-0/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
experiments_phyto_12_26_15cell densities of phytoplankton taxa in initial lake water and in each experimental replicate. column definitions are provided in ReadMe.txt.park_lakes_db675 dissolved inorganic nitrogen (NO3-N and NH3-N) observations from lakes above 1,200 m sampled between 1988 and 2014 across Mount Rainier, North Cascades, and Olympic National Parks. The database was compiled by Jason Williams. Column definitions and information about data origins and compilation are provided in ReadMe.txtChlachlorophyll a concentrations in initial lake water and in experimental replicates. column definitions are provided in ReadMe.txt
The orleans extension for CKAN enhances dataset metadata by providing the capability to add and manage a dataset's geographical extent. By offering the functionality to define the spatial coverage of a dataset, this extension improves discoverability and usability, particularly for geospatial datasets. This enhanced metadata can aid users in understanding the geographic scope of the data before accessing it, improving data selection and application. Key Features: Dataset Extent Management: Enables administrators and data publishers to define the spatial extent, or geographic boundaries, of their datasets. This functionality ensures that users can easily understand the geographic coverage of the data being provided. Spatial Metadata Enrichment: Integrates spatial metadata directly into CKAN's dataset descriptions to improve the searchability and understanding of geospatial data holdings within the catalog. Geospatial Context: Adds valuable geospatial context to datasets, thereby improving the overall quality of metadata and allows users filter search based on geographical coverage. Technical Integration: Although the readme provides limited details regarding the exact technical integration, one can assume that, the orleans extension likely leverages CKAN's plugin architecture to introduce new fields or sections in the dataset editing form to accommodate the extent information. Benefits & Impact: Implementing the orleans extension can significantly improve the discoverability of spatially referenced datasets within a CKAN catalog, ensuring that users have an easy way to assess the geographical relevance of the data. This improvement will helps users to easily determine the relevance of a dataset before spending time downloading and processing it. Overall, adding dataset extent data through this extension enhances the utility of CKAN as a geospatial data catalog.
The Tax Parcel Boundaries feature layer was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
The Extrafield extension for CKAN likely enables users to add custom fields to CKAN entities, such as datasets or organizations. This enhances the flexibility of CKAN, allowing users to store and manage additional information specific to their data and organizational needs. The extension likely provides a straightforward way to extend the metadata schema without requiring extensive code modifications for core CKAN components. Key Features (Inferred): Custom Field Addition: Allows administrators to define and add extra fields to datasets, organizations, or other CKAN entities. Metadata Enrichment: Provides the means to extend existing metadata options, enabling thorough data handling and descriptive accuracy. User Interface Integration: Integrates into the CKAN user interface, likely allowing users to input and view extra field data through the platform’s web portal. Configurable: Provides configuration options for field types using forms, ensuring the fields fit project requirements. Extensible Metadata Schema: Enables the addition of structured data using custom fields to the standard schema, allowing the schema to evolve to the real-world needs of any environment. Technical Integration (Inferred): The Extrafield extension probably integrates with CKAN through plugins, which extend the existing CKAN data model and user interface elements. Configuration would typically involve editing CKAN's configuration file to register the plugin and define the available extra fields and their data types. Changes may necessitate CKAN's database schema to be updated to accommodate new data fields. Benefits & Impact (Inferred): By using the Extrafield extension, users will gain the ability to capture and make available rich, domain-specific information inside their CKAN instance. This results in better data discovery, higher data quality, and improved overall experience across operations.
Collection of single miRNAs that regulate pathways, gene ontologies and other categories, hence complementing available miRNA target enrichment programs, tailored for miRNA sets. New dictionary on microRNAs and target pathways. Database to augment available target pathway web-servers by providing researches access to information which pathways are regulated by miRNA, which miRNAs target pathway and how specific regulations are.
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The detection of founder pathogenic variants, those observed in high frequency only in a group of individuals with increased inter-relatedness, can help improve delivery of health care for that community. We identified 16 groups with shared ancestry, based on genomic segments that are shared through identity by descent (IBD), in New York City using the genomic data of 25,366 residents from the All Of Us Research Program and the Mount Sinai BioMe biobank. From these groups we defined 7 as founder populations, mostly communities currently under-represented in medical genomics research, such as Puerto Rican and Garifuna. The enrichment analysis of ClinVar pathogenic or likely pathogenic (P/LP) variants in each group identified 201 of these damaging variants across the seven founder populations. We confirmed disease-causing variants previously reported to occur at increased frequencies in Ashkenazi Jewish and Puerto Rican genetic ancestry groups, but most of the damaging variants identified have not been previously associated with any such founder populations, and most of these founder populations have not been described to have increased prevalence of the associated rare disease. Twenty-two of 47 variants meeting Tier 2 prenatal screening criteria (1/100 carrier frequency within these founder groups) have never previously been reported. We show how population structure studies can provide insights into rare diseases disproportionately affecting under-represented founder populations, delivering a health care benefit but also a potential source of stigmatization of these communities, who should be part of the decision-making about implementation into health care delivery.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sources used for this study.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Unlike West Germany, high morbidity and mortality have been observed in East Germany over the last century. The regional population-based Study of Health in Pomerania (SHIP) therefore investigates the long-term progression of sub-clinical findings, their determinants and prognostic values, to acquire knowledge that facilitates early diagnosis and thus helps prevent the progression of disease. The SHIP covers various areas of patient health. Each SHIP data set is accompanied by a data dictionary (DD) which provides descriptions of variables and definitions.
This work shows the detailed mapping results of the semantic enrichment of the SHIP-START-4 medical laboratory data dictionary with LOINC codes. This work also provides detailed descriptions of the concepts applied in the semnatic enrichment. The results of this work serve as a critical step towards improving its interoperability and hence FAIRness for the SHIP laboratory-related measurements.