Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents the results of the experimentation of a method for evaluating semantic similarity between concepts in a taxonomy. The method is based on the information-theoretic approach and allows senses of concepts in a given context to be considered. Relevance of senses is calculated in terms of semantic relatedness with the compared concepts. In a previous work [9], the adopted semantic relatedness method was the one described in [10], while in this work we also adopted the ones described in [11], [12], [13], [14], [15], and [16].
We applied our proposal by extending 7 methods for computing semantic similarity in a taxonomy, selected from the literature. The methods considered in the experiment are referred to as R[2], W&P[3], L[4], J&C[5], P&S[6], A[7], and A&M[8]
The experiment was run on the well-known Miller and Charles benchmark dataset [1] for assessing semantic similarity.
The results are organized in seven folders, each with the results related to one of the above semantic relatedness methods. In each folder there is a set of files, each referring to one pair of the Miller and Charles dataset. In fact, for each pair of concepts, all the 28 pairs are considered as possible different contexts.
REFERENCES [1] Miller G.A., Charles W.G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1). [2] Resnik P. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Int. Joint Conf. on Artificial Intelligence, Montreal. [3] Wu Z., Palmer M. 1994. Verb semantics and lexical selection. 32nd Annual Meeting of the Associations for Computational Linguistics. [4] Lin D. 1998. An Information-Theoretic Definition of Similarity. Int. Conf. on Machine Learning. [5] Jiang J.J., Conrath D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Inter. Conf. Research on Computational Linguistics. [6] Pirrò G. 2009. A Semantic Similarity Metric Combining Features and Intrinsic Information Content. Data Knowl. Eng, 68(11). [7] Adhikari A., Dutta B., Dutta A., Mondal D., Singh S. 2018. An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. J. Assoc. Inf. Sci. Technol. 69(8). [8] Adhikari A., Singh S., Mondal D., Dutta B., Dutta A. 2016. A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet. CoRR, arXiv:1607.05422, abs/1607.05422. [9] Formica A., Taglino F. 2021. An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, vol. 9. [10] Information Content-based approach [Schuhmacher and Ponzetto, 2014]. [11] Linked Data Semantic Distance (LDSD) [Passant, 2010]. [12] Wikipedia Link-based Measure (WLM ) [Witten and Milne, 2008]; [13] Linked Open Data Description Overlap-based approach (LODDO) [Zhou et al. 2012] [14] Exclusivity-based [Hulpuş et al 2015] [15] ASRMP [El Vaigh et al. 2020] [16] LDSDGN [Piao and Breslin, 2016]
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
This dataset combines annual files from 2005 to 2017 published by the IRS. ZIP Code data show selected income and tax items classified by State, ZIP Code, and size of adjusted gross income. Data are based on individual income tax returns filed with the IRS. The data include items, such as:
Number of returns, which approximates the number of householdsNumber of personal exemptions, which approximates the populationAdjusted gross income (AGI)Wages and salariesDividends before exclusionInterest received Enrichment and notes:- the original data sheets (a column per variable, a line per year, zipcode and AGI group) have been transposed to get a record per year, zipcode, AGI group and variable- the data for Wyoming in 2006 was removed because AGI classes were not correctly defined, making the resulting data unfit for analysis.- the AGI groups have seen their definitions change: the variable "AGI Class" was used until 2008, with various intervals of AGI; "AGI Stub" replaced it in 2009. We provided the literal intervals (eg. "$50,000 under $75,000") as "AGI Group" in each case to help the analysis.- the codes for each tax item have been joined with a dataset of variables to provide full names.- some tax items are available since 2005, others since more recent years, depending on their introduction date (available in the dataset of variables); as a consequence, the time range of the plots or graphs may vary.- the unit for amounts and AGIs is a thousand dollars.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in this dataset were collected in the result of the survey of Latvian society (2021) aimed at identifying high-value data set for Latvia, i.e. data sets that, in the view of Latvian society, could create the value for the Latvian economy and society. The survey is created for both individuals and businesses. It being made public both to act as supplementary data for "Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia" paper (author: Anastasija Nikiforova, University of Latvia) and in order for other researchers to use these data in their own work.
The survey was distributed among Latvian citizens and organisations. The structure of the survey is available in the supplementary file available (see Survey_HighValueDataSets.odt)
Description of the data in this data set: structure of the survey and pre-defined answers (if any) 1. Have you ever used open (government) data? - {(1) yes, once; (2) yes, there has been a little experience; (3) yes, continuously, (4) no, it wasn’t needed for me; (5) no, have tried but has failed} 2. How would you assess the value of open govenment data that are currently available for your personal use or your business? - 5-point Likert scale, where 1 – any to 5 – very high 3. If you ever used the open (government) data, what was the purpose of using them? - {(1) Have not had to use; (2) to identify the situation for an object or ab event (e.g. Covid-19 current state); (3) data-driven decision-making; (4) for the enrichment of my data, i.e. by supplementing them; (5) for better understanding of decisions of the government; (6) awareness of governments’ actions (increasing transparency); (7) forecasting (e.g. trendings etc.); (8) for developing data-driven solutions that use only the open data; (9) for developing data-driven solutions, using open data as a supplement to existing data; (10) for training and education purposes; (11) for entertainment; (12) other (open-ended question) 4. What category(ies) of “high value datasets” is, in you opinion, able to create added value for society or the economy? {(1)Geospatial data; (2) Earth observation and environment; (3) Meteorological; (4) Statistics; (5) Companies and company ownership; (6) Mobility} 5. To what extent do you think the current data catalogue of Latvia’s Open data portal corresponds to the needs of data users/ consumers? - 10-point Likert scale, where 1 – no data are useful, but 10 – fully correspond, i.e. all potentially valuable datasets are available 6. Which of the current data categories in Latvia’s open data portals, in you opinion, most corresponds to the “high value dataset”? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 7. Which of them form your TOP-3? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 8. How would you assess the value of the following data categories? 8.1. sensor data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.2. real-time data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.3. geospatial data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 9. What would be these datasets? I.e. what (sub)topic could these data be associated with? - open-ended question 10. Which of the data sets currently available could be valauble and useful for society and businesses? - open-ended question 11. Which of the data sets currently NOT available in Latvia’s open data portal could, in your opinion, be valauble and useful for society and businesses? - open-ended question 12. How did you define them? - {(1)Subjective opinion; (2) experience with data; (3) filtering out the most popular datasets, i.e. basing the on public opinion; (4) other (open-ended question)} 13. How high could be the value of these data sets value for you or your business? - 5-point Likert scale, where 1 – not valuable, 5 – highly valuable 14. Do you represent any company/ organization (are you working anywhere)? (if “yes”, please, fill out the survey twice, i.e. as an individual user AND a company representative) - {yes; no; I am an individual data user; other (open-ended)} 15. What industry/ sector does your company/ organization belong to? (if you do not work at the moment, please, choose the last option) - {Information and communication services; Financial and ansurance activities; Accommodation and catering services; Education; Real estate operations; Wholesale and retail trade; repair of motor vehicles and motorcycles; transport and storage; construction; water supply; waste water; waste management and recovery; electricity, gas supple, heating and air conditioning; manufacturing industry; mining and quarrying; agriculture, forestry and fisheries professional, scientific and technical services; operation of administrative and service services; public administration and defence; compulsory social insurance; health and social care; art, entertainment and recreation; activities of households as employers;; CSO/NGO; Iam not a representative of any company 16. To which category does your company/ organization belong to in terms of its size? - {small; medium; large; self-employeed; I am not a representative of any company} 17. What is the age group that you belong to? (if you are an individual user, not a company representative) - {11..15, 16..20, 21..25, 26..30, 31..35, 36..40, 41..45, 46+, “do not want to reveal”} 18. Please, indicate your education or a scientific degree that corresponds most to you? (if you are an individual user, not a company representative) - {master degree; bachelor’s degree; Dr. and/ or PhD; student (bachelor level); student (master level); doctoral candidate; pupil; do not want to reveal these data}
Format of the file .xls, .csv (for the first spreadsheet only), .odt
Licenses or restrictions CC-BY
The Tax Parcel Boundaries feature layer was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
The Relationship Display extension for CKAN appears to enhance the platform's ability to manage and visualize relationships between datasets or other entities managed within CKAN. While the provided documentation lacks specific details, the extension likely introduces functionalities to define, store, and display connections between different resources to improve data discovery and understanding. Key Features (Inferred/Potential): Relationship Definition: Enables administrators and users to define various types of relationships between datasets, such as "derived from," "related to," or "supersedes". Visual Display: Offers a user interface component to visualize these relationships, possibly through graphs, tables, or other interactive elements. This offers a way for users to intuitively comprehend how an entire set of datasets are related. Metadata Enrichment: Augments dataset metadata with relationship information, allowing users to find connected datasets based on their relationships. Enhanced Data Discovery: Aids in the discovery of related datasets, thus improving data exploration and enabling users to understand dataset context better. Technical Integration: The extension implements a CKAN plugin, as indicated by the ckan.plugins configuration setting. This ensures that the functionality is integrated within the core CKAN application workflow, including the way datasets can be managed. Thus, this extension modifies/adds to the platform's existing functionality. Specific integration details would require deeper examination of the extension's code, but it likely enhances the existing user interfaces and API endpoints. Benefits & Impact (Potential): By visualizing and managing relationships between datasets, this extension potentially improves data governance and data traceability. Organizations can also improve dataset understanding and potentially improve collaborative data usage. The enhancement to data discovery means that new insights can be gained and workflows improved.
The Tax Parcel Boundaries feature layer was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
The Extrafield extension for CKAN likely enables users to add custom fields to CKAN entities, such as datasets or organizations. This enhances the flexibility of CKAN, allowing users to store and manage additional information specific to their data and organizational needs. The extension likely provides a straightforward way to extend the metadata schema without requiring extensive code modifications for core CKAN components. Key Features (Inferred): Custom Field Addition: Allows administrators to define and add extra fields to datasets, organizations, or other CKAN entities. Metadata Enrichment: Provides the means to extend existing metadata options, enabling thorough data handling and descriptive accuracy. User Interface Integration: Integrates into the CKAN user interface, likely allowing users to input and view extra field data through the platform’s web portal. Configurable: Provides configuration options for field types using forms, ensuring the fields fit project requirements. Extensible Metadata Schema: Enables the addition of structured data using custom fields to the standard schema, allowing the schema to evolve to the real-world needs of any environment. Technical Integration (Inferred): The Extrafield extension probably integrates with CKAN through plugins, which extend the existing CKAN data model and user interface elements. Configuration would typically involve editing CKAN's configuration file to register the plugin and define the available extra fields and their data types. Changes may necessitate CKAN's database schema to be updated to accommodate new data fields. Benefits & Impact (Inferred): By using the Extrafield extension, users will gain the ability to capture and make available rich, domain-specific information inside their CKAN instance. This results in better data discovery, higher data quality, and improved overall experience across operations.
The orleans extension for CKAN enhances dataset metadata by providing the capability to add and manage a dataset's geographical extent. By offering the functionality to define the spatial coverage of a dataset, this extension improves discoverability and usability, particularly for geospatial datasets. This enhanced metadata can aid users in understanding the geographic scope of the data before accessing it, improving data selection and application. Key Features: Dataset Extent Management: Enables administrators and data publishers to define the spatial extent, or geographic boundaries, of their datasets. This functionality ensures that users can easily understand the geographic coverage of the data being provided. Spatial Metadata Enrichment: Integrates spatial metadata directly into CKAN's dataset descriptions to improve the searchability and understanding of geospatial data holdings within the catalog. Geospatial Context: Adds valuable geospatial context to datasets, thereby improving the overall quality of metadata and allows users filter search based on geographical coverage. Technical Integration: Although the readme provides limited details regarding the exact technical integration, one can assume that, the orleans extension likely leverages CKAN's plugin architecture to introduce new fields or sections in the dataset editing form to accommodate the extent information. Benefits & Impact: Implementing the orleans extension can significantly improve the discoverability of spatially referenced datasets within a CKAN catalog, ensuring that users have an easy way to assess the geographical relevance of the data. This improvement will helps users to easily determine the relevance of a dataset before spending time downloading and processing it. Overall, adding dataset extent data through this extension enhances the utility of CKAN as a geospatial data catalog.
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of differentially expressed genes upon MLL and SETD1A knockdown for GO term—Cell Cycle. (XLS)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents the results of the experimentation of a method for evaluating semantic similarity between concepts in a taxonomy. The method is based on the information-theoretic approach and allows senses of concepts in a given context to be considered. Relevance of senses is calculated in terms of semantic relatedness with the compared concepts. In a previous work [9], the adopted semantic relatedness method was the one described in [10], while in this work we also adopted the ones described in [11], [12], [13], [14], [15], and [16].
We applied our proposal by extending 7 methods for computing semantic similarity in a taxonomy, selected from the literature. The methods considered in the experiment are referred to as R[2], W&P[3], L[4], J&C[5], P&S[6], A[7], and A&M[8]
The experiment was run on the well-known Miller and Charles benchmark dataset [1] for assessing semantic similarity.
The results are organized in seven folders, each with the results related to one of the above semantic relatedness methods. In each folder there is a set of files, each referring to one pair of the Miller and Charles dataset. In fact, for each pair of concepts, all the 28 pairs are considered as possible different contexts.
REFERENCES [1] Miller G.A., Charles W.G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1). [2] Resnik P. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Int. Joint Conf. on Artificial Intelligence, Montreal. [3] Wu Z., Palmer M. 1994. Verb semantics and lexical selection. 32nd Annual Meeting of the Associations for Computational Linguistics. [4] Lin D. 1998. An Information-Theoretic Definition of Similarity. Int. Conf. on Machine Learning. [5] Jiang J.J., Conrath D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Inter. Conf. Research on Computational Linguistics. [6] Pirrò G. 2009. A Semantic Similarity Metric Combining Features and Intrinsic Information Content. Data Knowl. Eng, 68(11). [7] Adhikari A., Dutta B., Dutta A., Mondal D., Singh S. 2018. An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. J. Assoc. Inf. Sci. Technol. 69(8). [8] Adhikari A., Singh S., Mondal D., Dutta B., Dutta A. 2016. A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet. CoRR, arXiv:1607.05422, abs/1607.05422. [9] Formica A., Taglino F. 2021. An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, vol. 9. [10] Information Content-based approach [Schuhmacher and Ponzetto, 2014]. [11] Linked Data Semantic Distance (LDSD) [Passant, 2010]. [12] Wikipedia Link-based Measure (WLM ) [Witten and Milne, 2008]; [13] Linked Open Data Description Overlap-based approach (LODDO) [Zhou et al. 2012] [14] Exclusivity-based [Hulpuş et al 2015] [15] ASRMP [El Vaigh et al. 2020] [16] LDSDGN [Piao and Breslin, 2016]