This dataset contains the historical Unidata Internet Data Distribution (IDD) Global Observational Data that are derived from real-time Global Telecommunications System (GTS) reports distributed via the Unidata Internet Data Distribution System (IDD). Reports include surface station (SYNOP) reports at 3-hour intervals, upper air (RAOB) reports at 3-hour intervals, surface station (METAR) reports at 1-hour intervals, and marine surface (BUOY) reports at 1-hour intervals. Select variables found in all report types include pressure, temperature, wind speed, and wind direction. Data may be available at mandatory or significant levels from 1000 millibars to 1 millibar, and at surface levels. Online archives are populated daily with reports generated two days prior to the current date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The table reports for each dataset: the reference to the journal article/book where the study was published, the type of data (LBSN stands for Location Based Social Networks, CDR for Call Detail Record), the number of individuals (or vehicles in the case of car/taxi data) involved in the data collection, the duration of the data collection (M → months, Y → years, D → days, W → weeks), the minimum and maximum length of spatial displacements, the shape of the probability distribution of displacements with the corresponding parameters, the temporal sampling, the shape of the distribution of waiting times with the corresponding parameters. Power-law (T), indicates a truncated power-law. The table can also be found at http://lauraalessandretti.weebly.com/plosmobilityreview.html.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Related article: Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39.
In this dataset:
We present temporally dynamic population distribution data from the Helsinki Metropolitan Area, Finland, at the level of 250 m by 250 m statistical grid cells. Three hourly population distribution datasets are provided for regular workdays (Mon – Thu), Saturdays and Sundays. The data are based on aggregated mobile phone data collected by the biggest mobile network operator in Finland. Mobile phone data are assigned to statistical grid cells using an advanced dasymetric interpolation method based on ancillary data about land cover, buildings and a time use survey. The data were validated by comparing population register data from Statistics Finland for night-time hours and a daytime workplace registry. The resulting 24-hour population data can be used to reveal the temporal dynamics of the city and examine population variations relevant to for instance spatial accessibility analyses, crisis management and planning.
Please cite this dataset as:
Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39. https://doi.org/10.1038/s41597-021-01113-4
Organization of data
The dataset is packaged into a single Zipfile Helsinki_dynpop_matrix.zip which contains following files:
HMA_Dynamic_population_24H_workdays.csv represents the dynamic population for average workday in the study area.
HMA_Dynamic_population_24H_sat.csv represents the dynamic population for average saturday in the study area.
HMA_Dynamic_population_24H_sun.csv represents the dynamic population for average sunday in the study area.
target_zones_grid250m_EPSG3067.geojson represents the statistical grid in ETRS89/ETRS-TM35FIN projection that can be used to visualize the data on a map using e.g. QGIS.
Column names
YKR_ID : a unique identifier for each statistical grid cell (n=13,231). The identifier is compatible with the statistical YKR grid cell data by Statistics Finland and Finnish Environment Institute.
H0, H1 ... H23 : Each field represents the proportional distribution of the total population in the study area between grid cells during a one-hour period. In total, 24 fields are formatted as “Hx”, where x stands for the hour of the day (values ranging from 0-23). For example, H0 stands for the first hour of the day: 00:00 - 00:59. The sum of all cell values for each field equals to 100 (i.e. 100% of total population for each one-hour period)
In order to visualize the data on a map, the result tables can be joined with the target_zones_grid250m_EPSG3067.geojson data. The data can be joined by using the field YKR_ID as a common key between the datasets.
License Creative Commons Attribution 4.0 International.
Related datasets
Järv, Olle; Tenkanen, Henrikki & Toivonen, Tuuli. (2017). Multi-temporal function-based dasymetric interpolation tool for mobile phone data. Zenodo. https://doi.org/10.5281/zenodo.252612
Tenkanen, Henrikki, & Toivonen, Tuuli. (2019). Helsinki Region Travel Time Matrix [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3247564
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network and loading data for a real-world distribution network in the North-East of England.
Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. In this work, we present MetaShift—a collection of 12,868 sets of natural images across 410 classes—to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. cats with cars or cats in bathroom) that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. Importantly, to support evaluating ImageNet trained models on MetaShift, we match MetaShift with ImageNet hierarchy. The matched version covers 867 out of 1,000 classes in ImageNet-1k. Each class in the ImageNet-matched Metashift contains 2301.6 images on average, and 19.3 subsets capturing images in different contexts. We also propose a method to construct tasks on the matched version, giving an example to construct 19,024 binary classification tasks on it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Number of in situ measurements obtained from instruments carried aboard oceanographic research and merchant ships. This is of annual data distribution. The spatial and temporal coverage of nitrates data in the Gulf of Mexico is not uniform, and most of the historical data were collected over the continental shelf near shallow intertidal areas (<200 m depth).
Simulated transmission curves illustrating an efficient new calculation method. Data was produced for a publication, and is indexed by figure.
November 2022 VersionThis dataset represents the "Observed Distribution" for coho salmon in California by using observations made only between 1990 and the present. It was developed for the express purpose of assisting with species recovery planning efforts. The process for developing this dataset was to collect as many observations of the species as possible and derive the stream-based geographic distribution for the species based solely on these positive observations.For the purpose of this dataset an observation is defined as a report of a sighting or other evidence of the presence of the species at a given place and time. As such, observations are modeled by year observed as point locations in the GIS. All such observations were collected with information regarding who reported the observation, their agency/organization/affiliation, the date that they observed the species, who compiled the information, etc. This information is maintained in the developers file geodatabase (©Environmental Science Research Institute (ESRI) 2016).To develop this distribution dataset, the species observations were applied to California Streams, a CDFW derivative of USGS National Hydrography Dataset (NHD) High Resolution hydrography. For each observation, a path was traced down the hydrography from the point of observation to the ocean, thereby deriving the shortest migration route from the point of observation to the sea. By appending all of these migration paths together, the "Observed Distribution" for the species is developed.It is important to note that this layer does not attempt to model the entire possible distribution of the species. Rather, it only represents the known distribution based on where the species has been observed and reported. While some observations indeed represent the upstream extent of the species (e.g., an observation made at a hard barrier), the majority of observations only indicate where the species was sampled for or otherwise observed. Because of this, this dataset likely underestimates the absolute geographic distribution of the species.It is also important to note that the species may not be found on an annual basis in all indicated reaches due to natural variations in run size, water conditions, and other environmental factors. As such, the information in this dataset should not be used to verify that the species are currently present in a given stream. Conversely, the absence of distribution linework for a given stream does not necessarily indicate that the species does not occur in that stream. The observation data were compiled from a variety of disparate sources including but not limited to CDFW, USFS, NMFS, timber companies, and the public. Forms of documentation include CDFW administrative reports, personal communications with biologists, observation reports, and literature reviews. The source of each feature (to the best available knowledge) is included in the data attributes for the observations in the geodatabase, but not for the resulting linework. The spatial data has been referenced to California Streams, a CDFW derivative of USGS National Hydrography Dataset (NHD) High Resolution hydrography.Usage of this dataset:Examples of appropriate uses include:- species recovery planning- Evaluation of future survey sites for the species- Validating species distribution modelsExamples of inappropriate uses include:- Assuming absence of a line feature means that the species are not present in that stream.- Using this data to make parcel or ground level land use management decisions.- Using this dataset to prove or support non-existence of the species at any spatial scale.- Assuming that the line feature represents the maximum possible extent of species distribution.All users of this data should seek the assistance of qualified professionals such as surveyors, hydrologists, or fishery biologists as needed to ensure that such users possess complete, precise, and up to date information on species distribution and water body location.Any copy of this dataset is considered to be a snapshot of the species distribution at the time of release. It is impingent upon the user to ensure that they have the most recent version prior to making management or planning decisions.Please refer to "Use Constraints" section below.
Dataset Card for "DSR-Bench-spatial"
DSR-Bench-spatial extends 3 data structures in DSR-Bench (K-D Heap, K-D Tree, Geometric Graphs) into variants in terms of dimensionality and data distribution. It contains the 1D, 2D, 3D, and 5D data versions of all three data structures, and 3 non-uniform data distributions (moons, circles, blobs) versions of K-D Tree, all containing short, medium, and long prompts, yielding a total of 450 questions. DSR-Bench-spatial is designed to highlight… See the full description on the dataset page: https://huggingface.co/datasets/vitercik-lab/DSR-Bench-spatial.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Big Data Processing and Distribution Software market is experiencing robust growth, driven by the exponential increase in data volume across industries and the rising need for efficient data management and analytics. The market, estimated at $50 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $150 billion by 2033. This growth is fueled by several key factors, including the increasing adoption of cloud-based solutions, the proliferation of Internet of Things (IoT) devices generating massive data streams, and the growing demand for real-time analytics and data-driven decision-making across various sectors like finance, healthcare, and retail. Large enterprises are leading the adoption, followed by a rapidly growing segment of Small and Medium-sized Enterprises (SMEs) leveraging cloud-based solutions for cost-effectiveness and scalability. The market is characterized by a competitive landscape with both established players like Google, Amazon Web Services, and Microsoft, and emerging niche providers offering specialized solutions. While the North American market currently holds a significant share, regions like Asia-Pacific are showing exceptional growth potential, driven by rapid digitalization and increasing investments in data infrastructure. However, the market also faces certain restraints. These include the complexities associated with data integration and management, the high costs of implementing and maintaining big data solutions, and the need for skilled professionals to manage and analyze the data effectively. Furthermore, ensuring data security and compliance with evolving regulations poses a challenge for organizations. Despite these hurdles, the overall market outlook remains positive, fueled by continuous technological advancements, increasing data generation, and the growing understanding of the value of data-driven insights. The shift towards cloud-based solutions continues to be a significant trend, facilitating easier access, scalability, and reduced infrastructure costs. The market's future hinges on the continued development of innovative solutions addressing security, scalability, and ease of use, catering to the diverse needs of various industry segments and geographical locations.
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
This is a MaxEnt model map of the global distribution of the seagrass biome. Species occurrence records were extracted from the Global Biodiversity Information Facility (GBIF), United Nations Environment Programme-World Conservation Monitoring Centre (UNEP-WCMC) Ocean Data Viewer and Ocean biogeographic information system (OBIS). This map shows the suitable habitats for the seagrass distribution at global scale.
Citation: Jayathilake D.R.M., Costello M.J. 2018. A modelled global distribution of the seagrass biome. Biological Conservation. https://doi.org/10.1016/j.biocon.2018.07.009
Use Constraints: Creative Commons Attribution 4.0 Unported (CC BY 4.0). https://creativecommons.org/licenses/by/4.0/.
Free to (1) copy and redistribute the material in any medium or format, (2) remix, transform, and build upon the material for any purpose, even commercially. You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Hazardous Substances Data Bank (HSDB) was a toxicology database that focused on the toxicology of potentially hazardous chemicals. It provided information on human exposure, industrial hygiene, emergency handling procedures, environmental fate, regulatory requirements, nanomaterials, and related areas. The information in HSDB has been assessed by a Scientific Review Panel.
This version of HSDB data includes a subset of HSDB for downloading, but is no longer updated. HSDB data has been incorporated into PubChem.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The National Health and Nutrition Examination Surveys (NHANES), conducted by the National Center for Health Statistics, Centers for Disease Control (NCHS/CDC), were designed to assess the health and nutritional status of adults and children in the United States through interviews and direct physical examinations. The NHANES radiographs were scanned by Dr. Bernie Huang at the University of California at Los Angeles and the University of California at San Francisco. Dr. Huang’s group used a Lumysis 100 with a 175 micron spot to scan the first 6000 radiographs. The remaining radiographs were scanned on the Lumysis 150 again with a 175 micron spot size. NOTE: This dataset is no-longer updated with new content.
We introduce misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets.
Constructed on top of the various NELA corpora (2017, 2018, 2019, 2020, 2021, 2022), misinfo-general is a large, diverse dataset consisting of news articles from reliable and unreliable publishers. Unlike NELA, we apply several rounds of deduplication and filtering to ensure all articles are of reasonable quality.
We use distant labelling to provide each publisher with rich metadata annotations. These annotations allow for simulating various generalisation splits that misinformation models are confronted with during deployment. We focus on 6 such splits-time, event, topic, publisher, political bias, misinformation type-but more are possible.
By releasing this dataset publicly, we hope to encourage future works that design misinformation models specifically with out-of-distribution generalisation in mind.
The Clinical Questions Collection is a repository of questions that have been collected between 1991 – 2003 from healthcare providers in clinical settings across the country. The questions have been submitted by investigators who wish to share their data with other researchers. This dataset is no-longer updated with new content. The collection is used in developing approaches to clinical and consumer-health question answering, as well as researching information needs of clinicians and the language they use to express their information needs. All files are formatted in XML.
The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of 10 datasets from diverse domains:
ILSVRC-2012 (the ImageNet dataset, consisting of natural images with 1000 categories) Omniglot (hand-written characters, 1623 classes) Aircraft (dataset of aircraft images, 100 classes) CUB-200-2011 (dataset of Birds, 200 classes) Describable Textures (different kinds of texture images with 43 categories) Quick Draw (black and white sketches of 345 different categories) Fungi (a large dataset of mushrooms with 1500 categories) VGG Flower (dataset of flower images with 102 categories), Traffic Signs (German traffic sign images with 43 classes) MSCOCO (images collected from Flickr, 80 classes).
All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into 70%, 15%, 15%). The datasets Traffic Signs and MSCOCO are reserved for testing only.
This dataset contains the historical Unidata Internet Data Distribution (IDD) Global Observational Data that are derived from real-time Global Telecommunications System (GTS) reports distributed via the Unidata Internet Data Distribution System (IDD). Reports include surface station (SYNOP) reports at 3-hour intervals, upper air (RAOB) reports at 3-hour intervals, surface station (METAR) reports at 1-hour intervals, and marine surface (BUOY) reports at 1-hour intervals. Select variables found in all report types include pressure, temperature, wind speed, and wind direction. Data may be available at mandatory or significant levels from 1000 millibars to 1 millibar, and at surface levels. Online archives are populated daily with reports generated two days prior to the current date.