Empower your machine learning models with a curated dataset of Warehouses, Distribution Centers, Fulfillment Centers, and Logistics Centers in the USA.
Featured attributes of the data - Visualization data: Number of devices detected, time spent, income level, possible workers, trucks, among others. - Structure data: Built areas, parking lots areas, POIs. - Activity data: Commercial/Industrial activities at each address. - Device count data: Device count for every location. - Visitation data: Visitation data for each address.
This dataset has proven successful in enriching machine learning models for defining POI/land parcel activity, sales prediction, site selection, red flags systems, procurement, and commercial/industrial activity measurement, among others.
The dataset includes +54 attributes for each of the +5 million addresses. Some of them are:
How have our clients used this dataset?
Cold Storage Company: - Data requirement: Our client needed more data about company locations that provided cold storage solutions, as part of their sales & marketing strategy. - Solution: They ingested the US address dataset in their ML model to enhance the process of identifying all facilities within the US with cold storage solutions (From Specialized Cold Storage facilities to Distribution Centers that handle refrigerated products).
The DCAT extension for CKAN enhances data portals by enabling the exposure and consumption of metadata using the DCAT vocabulary, facilitating interoperability with other data catalogs. It provides tools for serializing CKAN datasets as RDF documents and harvesting RDF data from external sources, promoting data sharing and reuse. The extension supports various DCAT Application Profiles, and includes features for adapting schemas, validating data, and integrating with search engines like Google Dataset Search. Key Features: DCAT Schemas: Offers pre-built CKAN schemas for common Application Profiles (DCAT AP v1, v2, and v3), which can be customized to align with site-specific requirements. These schemas include tailored form fields and validation rules to ensure DCAT compatibility. DCAT Endpoints: Exposes catalog datasets in different RDF serializations, allowing external systems to easily consume CKAN metadata in a standardized format. RDF Harvester: Enables the import of RDF serializations from other catalogs, automatically creating CKAN datasets based on the harvested metadata. This promotes data aggregation and discovery across different data sources. DCAT-CKAN Mapping: Establishes a base mapping between DCAT and CKAN datasets, facilitating bidirectional transformation of metadata. The mapping is compatible with DCAT-AP v1.1, v2.1, and v3. RDF Parser and Serializer: Includes an RDF parser for extracting CKAN dataset dictionaries from RDF serializations and an RDF serializer for transforming CKAN dataset metadata into different semantic formats. Both components are customizable through profiles. Command Line Interface (CLI): Provides a command-line interface for managing and interacting with the extension's features, such as harvesting and data transformation tasks. Google Dataset Search Integration: Offers support for indexing datasets in Google Dataset Search, improving the visibility of CKAN datasets to a wider audience. Technical Integration: The ckanext-dcat extension extends CKAN's functionality by adding new plugins for RDF harvesting and serialization, allowing users to expose and consume DCAT metadata through the portal and enabling dataset enrichment from external sources. This integration can be customized through profiles that define custom data mappings. Benefits & Impact: By implementing the DCAT extension, CKAN-based data portals can significantly improve their interoperability with other data catalogs and data repositories that support DCAT. This facilitates data sharing, reuse, and discovery, as well as improves the visibility of datasets through indexing in services like Google Dataset Search. The extension's built-in schemas and validation rules ensure that CKAN metadata conforms to DCAT standards, while the RDF harvester simplifies the process of importing data from external sources. Funded by organizations like the Government of Sweden, Vinnova, and FIWARE, the extension has been developed for production use cases and promotes a data-driven ecosystem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in this dataset were collected in the result of the survey of Latvian society (2021) aimed at identifying high-value data set for Latvia, i.e. data sets that, in the view of Latvian society, could create the value for the Latvian economy and society. The survey is created for both individuals and businesses. It being made public both to act as supplementary data for "Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia" paper (author: Anastasija Nikiforova, University of Latvia) and in order for other researchers to use these data in their own work.
The survey was distributed among Latvian citizens and organisations. The structure of the survey is available in the supplementary file available (see Survey_HighValueDataSets.odt)
Description of the data in this data set: structure of the survey and pre-defined answers (if any) 1. Have you ever used open (government) data? - {(1) yes, once; (2) yes, there has been a little experience; (3) yes, continuously, (4) no, it wasn’t needed for me; (5) no, have tried but has failed} 2. How would you assess the value of open govenment data that are currently available for your personal use or your business? - 5-point Likert scale, where 1 – any to 5 – very high 3. If you ever used the open (government) data, what was the purpose of using them? - {(1) Have not had to use; (2) to identify the situation for an object or ab event (e.g. Covid-19 current state); (3) data-driven decision-making; (4) for the enrichment of my data, i.e. by supplementing them; (5) for better understanding of decisions of the government; (6) awareness of governments’ actions (increasing transparency); (7) forecasting (e.g. trendings etc.); (8) for developing data-driven solutions that use only the open data; (9) for developing data-driven solutions, using open data as a supplement to existing data; (10) for training and education purposes; (11) for entertainment; (12) other (open-ended question) 4. What category(ies) of “high value datasets” is, in you opinion, able to create added value for society or the economy? {(1)Geospatial data; (2) Earth observation and environment; (3) Meteorological; (4) Statistics; (5) Companies and company ownership; (6) Mobility} 5. To what extent do you think the current data catalogue of Latvia’s Open data portal corresponds to the needs of data users/ consumers? - 10-point Likert scale, where 1 – no data are useful, but 10 – fully correspond, i.e. all potentially valuable datasets are available 6. Which of the current data categories in Latvia’s open data portals, in you opinion, most corresponds to the “high value dataset”? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 7. Which of them form your TOP-3? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 8. How would you assess the value of the following data categories? 8.1. sensor data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.2. real-time data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.3. geospatial data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 9. What would be these datasets? I.e. what (sub)topic could these data be associated with? - open-ended question 10. Which of the data sets currently available could be valauble and useful for society and businesses? - open-ended question 11. Which of the data sets currently NOT available in Latvia’s open data portal could, in your opinion, be valauble and useful for society and businesses? - open-ended question 12. How did you define them? - {(1)Subjective opinion; (2) experience with data; (3) filtering out the most popular datasets, i.e. basing the on public opinion; (4) other (open-ended question)} 13. How high could be the value of these data sets value for you or your business? - 5-point Likert scale, where 1 – not valuable, 5 – highly valuable 14. Do you represent any company/ organization (are you working anywhere)? (if “yes”, please, fill out the survey twice, i.e. as an individual user AND a company representative) - {yes; no; I am an individual data user; other (open-ended)} 15. What industry/ sector does your company/ organization belong to? (if you do not work at the moment, please, choose the last option) - {Information and communication services; Financial and ansurance activities; Accommodation and catering services; Education; Real estate operations; Wholesale and retail trade; repair of motor vehicles and motorcycles; transport and storage; construction; water supply; waste water; waste management and recovery; electricity, gas supple, heating and air conditioning; manufacturing industry; mining and quarrying; agriculture, forestry and fisheries professional, scientific and technical services; operation of administrative and service services; public administration and defence; compulsory social insurance; health and social care; art, entertainment and recreation; activities of households as employers;; CSO/NGO; Iam not a representative of any company 16. To which category does your company/ organization belong to in terms of its size? - {small; medium; large; self-employeed; I am not a representative of any company} 17. What is the age group that you belong to? (if you are an individual user, not a company representative) - {11..15, 16..20, 21..25, 26..30, 31..35, 36..40, 41..45, 46+, “do not want to reveal”} 18. Please, indicate your education or a scientific degree that corresponds most to you? (if you are an individual user, not a company representative) - {master degree; bachelor’s degree; Dr. and/ or PhD; student (bachelor level); student (master level); doctoral candidate; pupil; do not want to reveal these data}
Format of the file .xls, .csv (for the first spreadsheet only), .odt
Licenses or restrictions CC-BY
The Tax Parcel Boundaries feature layer was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The availability of proteomics datasets in the public domain, and in the PRIDE database, in particular, has increased dramatically in recent years. This unprecedented large-scale availability of data provides an opportunity for combined analyses of datasets to get organism-wide protein abundance data in a consistent manner. We have reanalyzed 24 public proteomics datasets from healthy human individuals to assess baseline protein abundance in 31 organs. We defined tissue as a distinct functional or structural region within an organ. Overall, the aggregated dataset contains 67 healthy tissues, corresponding to 3,119 mass spectrometry runs covering 498 samples from 489 individuals. We compared protein abundances between different organs and studied the distribution of proteins across these organs. We also compared the results with data generated in analogous studies. Additionally, we performed gene ontology and pathway-enrichment analyses to identify organ-specific enriched biological processes and pathways. As a key point, we have integrated the protein abundance results into the resource Expression Atlas, where they can be accessed and visualized either individually or together with gene expression data coming from transcriptomics datasets. We believe this is a good mechanism to make proteomics data more accessible for life scientists.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents the results of the experimentation of a method for evaluating semantic similarity between concepts in a taxonomy. The method is based on the information-theoretic approach and allows senses of concepts in a given context to be considered. Relevance of senses is calculated in terms of semantic relatedness with the compared concepts. In a previous work [9], the adopted semantic relatedness method was the one described in [10], while in this work we also adopted the ones described in [11], [12], [13], [14], [15], and [16].
We applied our proposal by extending 7 methods for computing semantic similarity in a taxonomy, selected from the literature. The methods considered in the experiment are referred to as R[2], W&P[3], L[4], J&C[5], P&S[6], A[7], and A&M[8]
The experiment was run on the well-known Miller and Charles benchmark dataset [1] for assessing semantic similarity.
The results are organized in seven folders, each with the results related to one of the above semantic relatedness methods. In each folder there is a set of files, each referring to one pair of the Miller and Charles dataset. In fact, for each pair of concepts, all the 28 pairs are considered as possible different contexts.
REFERENCES [1] Miller G.A., Charles W.G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1). [2] Resnik P. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Int. Joint Conf. on Artificial Intelligence, Montreal. [3] Wu Z., Palmer M. 1994. Verb semantics and lexical selection. 32nd Annual Meeting of the Associations for Computational Linguistics. [4] Lin D. 1998. An Information-Theoretic Definition of Similarity. Int. Conf. on Machine Learning. [5] Jiang J.J., Conrath D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Inter. Conf. Research on Computational Linguistics. [6] Pirrò G. 2009. A Semantic Similarity Metric Combining Features and Intrinsic Information Content. Data Knowl. Eng, 68(11). [7] Adhikari A., Dutta B., Dutta A., Mondal D., Singh S. 2018. An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. J. Assoc. Inf. Sci. Technol. 69(8). [8] Adhikari A., Singh S., Mondal D., Dutta B., Dutta A. 2016. A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet. CoRR, arXiv:1607.05422, abs/1607.05422. [9] Formica A., Taglino F. 2021. An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, vol. 9. [10] Information Content-based approach [Schuhmacher and Ponzetto, 2014]. [11] Linked Data Semantic Distance (LDSD) [Passant, 2010]. [12] Wikipedia Link-based Measure (WLM ) [Witten and Milne, 2008]; [13] Linked Open Data Description Overlap-based approach (LODDO) [Zhou et al. 2012] [14] Exclusivity-based [Hulpuş et al 2015] [15] ASRMP [El Vaigh et al. 2020] [16] LDSDGN [Piao and Breslin, 2016]
A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but also on a range of additional sources: data elicited from complementary interviews with young Tunisians and lexical material taken from various published historical sources dating from the middle of the 20th century and earlier. See also: https://hdl.handle.net/11022/0000-0007-C265-C
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
The Relationship Display extension for CKAN appears to enhance the platform's ability to manage and visualize relationships between datasets or other entities managed within CKAN. While the provided documentation lacks specific details, the extension likely introduces functionalities to define, store, and display connections between different resources to improve data discovery and understanding. Key Features (Inferred/Potential): Relationship Definition: Enables administrators and users to define various types of relationships between datasets, such as "derived from," "related to," or "supersedes". Visual Display: Offers a user interface component to visualize these relationships, possibly through graphs, tables, or other interactive elements. This offers a way for users to intuitively comprehend how an entire set of datasets are related. Metadata Enrichment: Augments dataset metadata with relationship information, allowing users to find connected datasets based on their relationships. Enhanced Data Discovery: Aids in the discovery of related datasets, thus improving data exploration and enabling users to understand dataset context better. Technical Integration: The extension implements a CKAN plugin, as indicated by the ckan.plugins configuration setting. This ensures that the functionality is integrated within the core CKAN application workflow, including the way datasets can be managed. Thus, this extension modifies/adds to the platform's existing functionality. Specific integration details would require deeper examination of the extension's code, but it likely enhances the existing user interfaces and API endpoints. Benefits & Impact (Potential): By visualizing and managing relationships between datasets, this extension potentially improves data governance and data traceability. Organizations can also improve dataset understanding and potentially improve collaborative data usage. The enhancement to data discovery means that new insights can be gained and workflows improved.
This feature service was created as part of an effort to provide a single source of truth for accessing City of Chelsea Tax Parcels. This dataset can be used in a variety of ways including by application, staff across the city, as well as citizens as the authoritative tax parcel dataset enriched with computer-assisted mass appraisal (CAMA) information.Data Dictionary: Data dictionary for the sole source of authoritative and maintained city parcel dataset.LayersTax Parcel Addresses (0): Centroids of the tax parcel boundaries.Tax Parcel Boundaries (1): Polygon geography of the tax parcels.Unjoined_Parcels_NoCAMA (2): Records in this layer represent tax parcels with a geography (polygon), but with no information present in CAMA.Rejected_Features (3): Records in this layer represent tax parcels with a geography (polygon), but does not have attribute information such as a valid Map Parcel ID.TablesUnjoined_CAMA_NoGeometry (100): This table represents the curated data provided from CAMA that was used to join to the tax parcel boundaries for enrichment.Latest_CAMA_Input (101): This table represents an export of the latest CAMA information at the time of update.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
This dataset combines annual files from 2005 to 2017 published by the IRS. ZIP Code data show selected income and tax items classified by State, ZIP Code, and size of adjusted gross income. Data are based on individual income tax returns filed with the IRS. The data include items, such as:
Number of returns, which approximates the number of householdsNumber of personal exemptions, which approximates the populationAdjusted gross income (AGI)Wages and salariesDividends before exclusionInterest received Enrichment and notes:- the original data sheets (a column per variable, a line per year, zipcode and AGI group) have been transposed to get a record per year, zipcode, AGI group and variable- the data for Wyoming in 2006 was removed because AGI classes were not correctly defined, making the resulting data unfit for analysis.- the AGI groups have seen their definitions change: the variable "AGI Class" was used until 2008, with various intervals of AGI; "AGI Stub" replaced it in 2009. We provided the literal intervals (eg. "$50,000 under $75,000") as "AGI Group" in each case to help the analysis.- the codes for each tax item have been joined with a dataset of variables to provide full names.- some tax items are available since 2005, others since more recent years, depending on their introduction date (available in the dataset of variables); as a consequence, the time range of the plots or graphs may vary.- the unit for amounts and AGIs is a thousand dollars.
experiments_phyto_12_26_15cell densities of phytoplankton taxa in initial lake water and in each experimental replicate. column definitions are provided in ReadMe.txt.park_lakes_db675 dissolved inorganic nitrogen (NO3-N and NH3-N) observations from lakes above 1,200 m sampled between 1988 and 2014 across Mount Rainier, North Cascades, and Olympic National Parks. The database was compiled by Jason Williams. Column definitions and information about data origins and compilation are provided in ReadMe.txtChlachlorophyll a concentrations in initial lake water and in experimental replicates. column definitions are provided in ReadMe.txt
The orleans extension for CKAN enhances dataset metadata by providing the capability to add and manage a dataset's geographical extent. By offering the functionality to define the spatial coverage of a dataset, this extension improves discoverability and usability, particularly for geospatial datasets. This enhanced metadata can aid users in understanding the geographic scope of the data before accessing it, improving data selection and application. Key Features: Dataset Extent Management: Enables administrators and data publishers to define the spatial extent, or geographic boundaries, of their datasets. This functionality ensures that users can easily understand the geographic coverage of the data being provided. Spatial Metadata Enrichment: Integrates spatial metadata directly into CKAN's dataset descriptions to improve the searchability and understanding of geospatial data holdings within the catalog. Geospatial Context: Adds valuable geospatial context to datasets, thereby improving the overall quality of metadata and allows users filter search based on geographical coverage. Technical Integration: Although the readme provides limited details regarding the exact technical integration, one can assume that, the orleans extension likely leverages CKAN's plugin architecture to introduce new fields or sections in the dataset editing form to accommodate the extent information. Benefits & Impact: Implementing the orleans extension can significantly improve the discoverability of spatially referenced datasets within a CKAN catalog, ensuring that users have an easy way to assess the geographical relevance of the data. This improvement will helps users to easily determine the relevance of a dataset before spending time downloading and processing it. Overall, adding dataset extent data through this extension enhances the utility of CKAN as a geospatial data catalog.
Collection of single miRNAs that regulate pathways, gene ontologies and other categories, hence complementing available miRNA target enrichment programs, tailored for miRNA sets. New dictionary on microRNAs and target pathways. Database to augment available target pathway web-servers by providing researches access to information which pathways are regulated by miRNA, which miRNAs target pathway and how specific regulations are.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We demonstrate that semantic modeling with ontologies provides a robust and enduring approach to achieving FAIR data in our experimental environment. By endowing data with self‑describing semantics through ontological definitions and inference, we enable them to ‘speak’ for themselves. Building on PaNET, we define techniques in ESRFET by their characteristic building blocks. The outcome is a standards‑based framework (RDF, OWL, SWRL, SPARQL, SHACL) that encodes experimental techniques’ semantics and underpins a broader facility ontology. Our approach illustrates that by using differential definitions, semantic enrichment through linking to multiple ontologies, and documented semantic negotiation, we standardize experimental techniques' descriptions and annotations—ensuring enhanced discoverability, reproducibility, and integration within the FAIR data ecosystem. This talk was held in the course of the DAPHNE4NFDI TA1 Data for science lecture series on April, 29 2025
Gene set enrichment analysis of association of nischarin expression with defined gene data sets in pancreatic ductal adenocarcinoma
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Recent advances in targeted covalent inhibitors have aroused significant interest for their potential in drug development for difficult therapeutic targets. Proteome-wide profiling of functional residues is an integral step of covalent drug discovery aimed at defining actionable sites and evaluating compound selectivity in cells. A classical workflow for this purpose is called IsoTOP-ABPP, which employs an activity-based probe and two isotopically labeled azide-TEV-biotin tags to mark, enrich, and quantify proteome from two samples. Here we report a novel isobaric 11plex-AzidoTMT reagent and a new workflow, named AT-MAPP, that significantly expands multiplexing power as compared to the original isoTOP-ABPP. We demonstrate its application in identifying cysteine on- and off-targets using a KRAS G12C covalent inhibitor ARS-1620. However, changes in some of these hits can be explained by modulation at the protein and post-translational levels. Thus, it would be crucial to interrogate site-level bona fide changes in concurrence to proteome-level changes for corroboration. In addition, we perform a multiplexed covalent fragment screening using four acrylamide-based compounds as a proof-of-concept. This study identifies a diverse set of liganded cysteine residues in a compound-dependent manner with an average hit rate of 0.07% in intact cell. Lastly, we screened 20 sulfonyl fluoride-based compounds to demonstrate that the AT-MAPP assay is flexible for noncysteine functional residues such as tyrosine and lysine. Overall, we envision that 11plex-AzidoTMT will be a useful addition to the current toolbox for activity-based protein profiling and covalent drug development.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Empower your machine learning models with a curated dataset of Warehouses, Distribution Centers, Fulfillment Centers, and Logistics Centers in the USA.
Featured attributes of the data - Visualization data: Number of devices detected, time spent, income level, possible workers, trucks, among others. - Structure data: Built areas, parking lots areas, POIs. - Activity data: Commercial/Industrial activities at each address. - Device count data: Device count for every location. - Visitation data: Visitation data for each address.
This dataset has proven successful in enriching machine learning models for defining POI/land parcel activity, sales prediction, site selection, red flags systems, procurement, and commercial/industrial activity measurement, among others.
The dataset includes +54 attributes for each of the +5 million addresses. Some of them are:
How have our clients used this dataset?
Cold Storage Company: - Data requirement: Our client needed more data about company locations that provided cold storage solutions, as part of their sales & marketing strategy. - Solution: They ingested the US address dataset in their ML model to enhance the process of identifying all facilities within the US with cold storage solutions (From Specialized Cold Storage facilities to Distribution Centers that handle refrigerated products).