Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks.
The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 6 different segments:
Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction.
Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification.
Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification.
Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction.
Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies.
DEEP-EST Dataset: this medium-granularity dataset was collected on the modular DEEP-EST HPC system and consists of three parts.These were collected on 16 compute nodes each, while running several MPI applications under different warm-water cooling configurations. This dataset can be used for CPU and GPU temperature prediction, or for thermal characterization.
The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Different data sources and their characteristics.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined in the README. Data included/extracted from the examples included the discipline and field of study, author, institutional affiliation and funding information, location, date modified, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications, grant pages, or French language versions. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
This data release is a compilation of construction depth information for 12,383 active and inactive public-supply wells (PSWs) in California from various data sources. Construction data from multiple sources were indexed by the California State Water Resources Control Board Division of Drinking Water (DDW) primary station code (PS Code). Five different data sources were compared with the following priority order: 1, Local sources from select municipalities and water purveyors (Local); 2, Local DDW district data (DDW); 3, The United States Geological Survey (USGS) National Water Information System (NWIS); 4, The California State Water Resources Control Board Groundwater Ambient Monitoring and Assessment Groundwater Information System (SWRCB); and 5, USGS attribution of California Department of Water Resources well completion report data (WCR). For all data sources, the uppermost depth to the well's open or perforated interval was attributed as depth to top of perforations (ToP). The composite depth to bottom of well (Composite BOT) field was attributed from available construction data in the following priority order: 1, Depth to bottom of perforations (BoP); 2, Depth of completed well (Well Depth); 3; Borehole depth (Hole Depth). PSW ToPs and Composite BOTs from each of the five data sources were then compared and summary construction depths for both fields were selected for wells with multiple data sources according to the data-source priority order listed above. Case-by-case modifications to the final selected summary construction depths were made after priority order-based selection to ensure internal logical consistency (for example, ToP must not exceed Composite BOT). This data release contains eight tab-delimited text files. WellConstructionSourceData_Local.txt contains well construction-depth data, Composite BOT data-source attribution, and local agency data-source attribution for the Local data. WellConstructionSourceData_DDW.txt contains well construction-depth data and Composite BOT data-source attribution for the DDW data. WellConstructionSourceData_NWIS.txt contains well construction-depth data, Composite BOT data-source attribution, and USGS site identifiers for the NWIS data. WellConstructionSourceData_SWRCB.txt contains well construction-depth data and Composite BOT data-source attribution for the SWRCB data. WellConstructionSourceData_WCR.txt contains contains well construction depth data and Composite BOT data-source attribution for the WCR data. WellConstructionCompilation_ToP.txt contains all ToP data listed by data source. WellConstructionCompilation_BOT.txt contains all Composite BOT data listed by data source. WellConstructionCompilation_Summary.txt contains summary ToP and Composite BOT values for each well with data-source attribution for both construction fields. All construction depths are in units of feet below land surface and are reported to the nearest foot.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global Data Preparation Tools market size will be USD XX million in 2025. It will expand at a compound annual growth rate (CAGR) of XX% from 2025 to 2031.
North America held the major market share for more than XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Europe accounted for a market share of over XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Asia Pacific held a market share of around XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Latin America had a market share of more than XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Middle East and Africa had a market share of around XX% of the global revenue and was estimated at a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. KEY DRIVERS
Increasing Volume of Data and Growing Adoption of Business Intelligence (BI) and Analytics Driving the Data Preparation Tools Market
As organizations grow more data-driven, the integration of data preparation tools with Business Intelligence (BI) and advanced analytics platforms is becoming a critical driver of market growth. Clean, well-structured data is the foundation for accurate analysis, predictive modeling, and data visualization. Without proper preparation, even the most advanced BI tools may deliver misleading or incomplete insights. Businesses are now realizing that to fully capitalize on the capabilities of BI solutions such as Power BI, Qlik, or Looker, their data must first be meticulously prepared. Data preparation tools bridge this gap by transforming disparate raw data sources into harmonized, analysis-ready datasets. In the financial services sector, for example, firms use data preparation tools to consolidate customer financial records, transaction logs, and third-party market feeds to generate real-time risk assessments and portfolio analyses. The seamless integration of these tools with analytics platforms enhances organizational decision-making and contributes to the widespread adoption of such solutions. The integration of advanced technologies such as artificial intelligence (AI) and machine learning (ML) into data preparation tools has significantly improved their efficiency and functionality. These technologies automate complex tasks like anomaly detection, data profiling, semantic enrichment, and even the suggestion of optimal transformation paths based on patterns in historical data. AI-driven data preparation not only speeds up workflows but also reduces errors and human bias. In May 2022, Alteryx introduced AiDIN, a generative AI engine embedded into its analytics cloud platform. This innovation allows users to automate insights generation and produce dynamic documentation of business processes, revolutionizing how businesses interpret and share data. Similarly, platforms like DataRobot integrate ML models into the data preparation stage to improve the quality of predictions and outcomes. These innovations are positioning data preparation tools as not just utilities but as integral components of the broader AI ecosystem, thereby driving further market expansion. Data preparation tools address these needs by offering robust solutions for data cleaning, transformation, and integration, enabling telecom and IT firms to derive real-time insights. For example, Bharti Airtel, one of India’s largest telecom providers, implemented AI-based data preparation tools to streamline customer data and automate insights generation, thereby improving customer support and reducing operational costs. As major market players continue to expand and evolve their services, the demand for advanced data analytics powered by efficient data preparation tools will only intensify, propelling market growth. The exponential growth in global data generation is another major catalyst for the rise in demand for data preparation tools. As organizations adopt digital technologies and connected devices proliferate, the volume of data produced has surged beyond what traditional tools can handle. This deluge of information necessitates modern solutions capable of preparing vast and complex datasets efficiently. According to a report by the Lin...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lightning Talk at the International Digital Curation Conference 2025. The presentation examines OpenAIRE's solution to the “entity disambiguation” problem, presenting a hybrid data curation method that combines deduplication algorithms with the expertise of human curators to ensure high-quality, interoperable scholarly information. Entity disambiguation is invaluable to building a robust and interconnected open scholarly communication system. It involves accurately identifying and differentiating entities such as authors, organisations, data sources and research results across various entity providers. This task is particularly complex in contexts like the OpenAIRE Graph, where metadata is collected from over 100,000 data sources. Different metadata describing the same entity can be collected multiple times, potentially providing different information, such as different Persistent Identifiers (PIDs) or names, for the same entity. This heterogeneity poses several challenges to the disambiguation process. For example, the same organisation may be referenced using different names in different languages, or abbreviations. In some cases, even the use of PIDs might not be effective, as different identifiers may be assigned by different data providers. Therefore, accurate entity disambiguation is essential for ensuring data quality, improving search and discovery, facilitating knowledge graph construction, and supporting reliable research impact assessment. To address this challenge, OpenAIRE employs a deduplication algorithm to identify and merge duplicate entities, configured to handle different entity types. While the algorithm proves effective for research results, when applied to organisations and data sources, it needs to be complemented with human curation and validation since additional information may be needed. OpenAIRE's data source disambiguation relies primarily on the OpenAIRE technical team overseeing the deduplication process and ensuring accurate matches across DRIS, FAIRSharing, re3data, and OpenDOAR registries. While the algorithm automates much of the process, human experts verify matches, address discrepancies and actively search for matches not proposed by the algorithm. External stakeholders, such as data source managers, can also contribute by submitting suggestions through a dedicated ticketing system. So far OpenAIRE curated almost 3 935 groups for a total of 8 140 data sources. To address organisational disambiguation, OpenAIRE developed OpenOrgs, a hybrid system combining automated processes and human expertise. The tool works on organisational data aggregated from multiple sources (ROR registry, funders databases, CRIS systems, and others) by the OpenAIRE infrastructure, automatically compares metadata, and suggests potential merged entities to human curators. These curators, authorised experts in their respective research landscapes, validate merged entities, identify additional duplicates, and enrich organisational records with missing information such as PIDs, alternative names, and hierarchical relationships. With over 100 curators from 40 countries, OpenOrgs has curated more than 100,000 organisations to date. A dataset containing all the OpenOrgs organizations can be found on Zenodo (https://doi.org/10.5281/zenodo.13271358). This presentation demonstrates how OpenAIRE's entity disambiguation techniques and OpenOrgs aim to be game-changers for the research community by building and maintaining an integrated open scholarly communication system in the years to come.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Different sources of data for the uncovering of failures in reporting of safety and effectiveness of some examples of new drugs.
GapMaps Live is an easy-to-use location intelligence platform available across 25 countries globally that allows you to visualise your own store data, combined with the latest demographic, economic and population movement intel right down to the micro level so you can make faster, smarter and surer decisions when planning your network growth strategy.
With one single login, you can access the latest estimates on resident and worker populations, census metrics (eg. age, income, ethnicity), consuming class, retail spend insights and point-of-interest data across a range of categories including fast food, cafe, fitness, supermarket/grocery and more.
Some of the world's biggest brands including McDonalds, Subway, Burger King, Anytime Fitness and Dominos use GapMaps Live Map Data as a vital strategic tool where business success relies on up-to-date, easy to understand, location intel that can power business case validation and drive rapid decision making.
Primary Use Cases for GapMaps Live Map Data include:
Some of features our clients love about GapMaps Live Map Data include: - View business locations, competitor locations, demographic, economic and social data around your business or selected location - Understand consumer visitation patterns (“where from” and “where to”), frequency of visits, dwell time of visits, profiles of consumers and much more. - Save searched locations and drop pins - Turn on/off all location listings by category - View and filter data by metadata tags, for example hours of operation, contact details, services provided - Combine public data in GapMaps with views of private data Layers - View data in layers to understand impact of different data Sources - Share maps with teams - Generate demographic reports and comparative analyses on different locations based on drive time, walk time or radius. - Access multiple countries and brands with a single logon - Access multiple brands under a parent login - Capture field data such as photos, notes and documents using GapMaps Connect and integrate with GapMaps Live to get detailed insights on existing and proposed store locations.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
City of Austin Open Data Terms of Use https://data.austintexas.gov/stories/s/ranj-cccq
This data set contains information about the site plan case applications submitted for review to the City of Austin. The data set includes information about case status in the permit review system, case number, proposed use, applicant, owner, and location.
Our Geospatial Dataset connects people's movements to over 200M physical locations globally. These are aggregated and anonymized data that are only used to offer context for the volume and patterns of visits to certain locations. This data feed is compiled from different data sources around the world.
It includes information such as the name, address, coordinates, and category of these locations, which can range from restaurants and hotels to parks and tourist attractions
Location Intelligence Data Reach: Location Intelligence data brings the POI/Place/OOH level insights calculated on the basis of Factori’s Mobility & People Graph data aggregated from multiple data sources globally. In order to achieve the desired foot-traffic attribution, specific attributes are combined to bring forward the desired reach data. For instance, in order to calculate the foot traffic for a specific location, a combination of location ID, day of the week, and part of the day can be combined to give specific location intelligence data. There can be a maximum of 56 data records possible for one POI based on the combination of these attributes.
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).
Use Cases: Credit Scoring: Financial services can use alternative data to score an underbanked or unbanked customer by validating locations and persona. Retail Analytics: Analyze footfall trends in various locations and gain an understanding of customer personas. Market Intelligence: Study various market areas, the proximity of points or interests, and the competitive landscape Urban Planning: Build cases for urban development, public infrastructure needs, and transit planning based on fresh population data. Marketing Campaign Strategy: Analyzing visitor demographics and behavior patterns around POIs, businesses can tailor their marketing strategies to effectively reach their target audience. OOH/DOOH Campaign Planning: Identify high-traffic locations and understand consumer behavior in specific areas, to execute targeted advertising strategies effectively. Geofencing: Geofencing involves creating virtual boundaries around physical locations, enabling businesses to trigger actions when users enter or exit these areas
Data Attributes Included:
LocationID
name
website
BrandID
Phone
streetAddress
city
state
country_code
zip
lat
lng
poi_status
geoHash8
poi_id
category
category_id
full_address
address
additional_categories
url
domain
rating
price_level
rating_distribution
is_claimed
photo_url
attributes
brand_name
brand_id
status
total_photos
popular_times
places_topics
people_also_search
work_hours
local_business_links
contact_info
reviews_count
naics_code
naics_code_description
sis_code
sic_code_description
shape_polygon
building_id
building_type
building_name
geometry_location_type
geometry_viewport_northeast_lat
geometry_viewport_northeast_lng
geometry_viewport_southwest_lat
geometry_viewport_southwest_lng
geometry_location_lat
geometry_location_lng
calculated_geo_hash_8
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7910/DVN/NQSHQDhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7910/DVN/NQSHQD
This contains data and software for the following paper: Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616 This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/ In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers. Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account. Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper. This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Project Atlas - São Paulo is a Data Science and Engineering initiative that aims at developing relevant and curated Geospatial features about the city of São Paulo, Brazil. It's ultimate use is varied, but it is mainly focused on Machine Learning tasks, such as Real State price prediction.
It aggregates several attributes from many public data sources at different levels of interest, which can be used to match geospatially referenced data (lat
,long
pairs for example).
A breakdown of the data sources currently used and their original references can be found below, but the official documentation of the project contains the full list of data sources.
tb_district.parquet
: the dataset with all derived features aggregated at the District level;tb_neighborhood.parquet
: the dataset with all derived features aggregated at the Neighborhood level;tb_zipcode.parquet
: the dataset with all derived features aggregated at the Zipcode level;tb_area_of_ponderation
: the dataset with all derived features aggregated at the Area of Ponderation level;This project had various inspirations, such as the Boston Housing Dataset. While I was studying relevant features for the real state market, I noticed that the classic Boston Housing dataset included several sociodemographic variables, which gave me the idea to do the same for São Paulo using the Brazilian Census data.
Photo by Lucas Marcomini on Unsplash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Enhanced Modeling
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Location Intelligence Analytics market is experiencing robust growth, driven by the increasing need for businesses to leverage location data for strategic decision-making. The market's expansion is fueled by several key factors. Firstly, the proliferation of readily available location data from various sources, including GPS, mobile devices, and IoT sensors, provides rich insights for businesses across diverse sectors. Secondly, advancements in technologies like AI and machine learning are enhancing the analytical capabilities of location intelligence platforms, enabling more sophisticated predictions and optimized resource allocation. This is further amplified by the growing adoption of cloud-based solutions offering scalability and cost-effectiveness. Finally, the demand for real-time insights and personalized experiences is driving companies to incorporate location intelligence into their operations, ranging from supply chain optimization and targeted marketing to risk management and urban planning. We estimate the market size in 2025 to be approximately $15 billion, considering the rapid technological advancements and high adoption rates across various industries. A compound annual growth rate (CAGR) of 15% from 2025 to 2033 is projected, indicating significant market potential. However, despite the positive growth trajectory, the market faces certain challenges. Data privacy and security concerns are paramount, requiring robust compliance measures. The complexity of integrating disparate data sources and the need for skilled professionals to interpret the analytical outputs can hinder adoption for some businesses. Furthermore, the high initial investment costs associated with implementing location intelligence solutions may deter smaller organizations. Nevertheless, the strategic advantages of location intelligence are undeniable, and we expect the market to continue expanding significantly over the forecast period, with continued innovation in analytics technologies and expanding use cases driving its future growth. The competitive landscape is marked by a blend of established players like SAP, IBM, and Oracle, alongside emerging technology firms. This fosters innovation and provides a diverse range of solutions for businesses of all sizes.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
City of Austin Open Data Terms of Use https://data.austintexas.gov/stories/s/ranj-cccq
This data set contains information about the subdivision case applications submitted for review to the City of Austin. The data set includes information about case status in the permit review system, case number, proposed use, applicant, owner, and location.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NBA anba WNBA dataset is a large-scale play-by-play and shot-detail dataset covering both NBA and WNBA games, collected from multiple public sources (e.g., official league APIs and stats sites). It provides every in-game event—from period starts, jump balls, fouls, turnovers, rebounds, and field-goal attempts through free throws—along with detailed shot metadata (shot location, distance, result, assisting player, etc.).
Also you can download dataset from github or GoogleDrive
Tutorials
I will be grateful for ratings and stars on github, but the best gratitude is use of dataset for your projects.
Useful links:
I made this dataset because I want to simplify and speed up work with play-by-play data so that researchers spend their time studying data, not collecting it. Due to the limits on requests on the NBA and WNBA website, and also because you can get play-by-play of only one game per request, collecting this data is a very long process.
Using this dataset, you can reduce the time to get information about one season from a few hours to a couple of seconds and spend more time analyzing data or building models.
I also added play-by-play information from other sources: pbpstats.com, data.nba.com, cdnnba.com. This data will enrich information about the progress of each game and hopefully add opportunities to do interesting things.
If you have any questions or suggestions about the dataset, you can write to me in a convenient channel for you:
Our POI Data connects people's movements to over 14M physical locations globally. These are aggregated and anonymized data that are only used to offer context for the volume and patterns of visits to certain locations. This data feed is compiled from different data sources around the world. Reach: Location Intelligence data brings the POI/Place/OOH level insights calculated based on Factori’s Mobility & People Graph data aggregated from multiple data sources globally. To achieve the desired foot-traffic attribution, specific attributes are combined to bring forward the desired reach data. For instance, to calculate the foot traffic for a specific location, a combination of location ID, day of the week, and part of the day can be combined to give specific location intelligence data. There can be a maximum of 40 data records possible for one POI based on the combination of these attributes. Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly). Use Cases: Credit Scoring: Financial services can use alternative data to score an underbanked or unbanked customer by validating locations and persona. Retail Analytics: Analyze footfall trends in various locations and gain an understanding of customer personas. Market Intelligence: Study various market areas, the proximity of points or interests, and the competitive landscape Urban Planning: Build cases for urban development, public infrastructure needs, and transit planning based on fresh population data. Data Attributes: Location ID n_visitors day_of_week distance_from_home do_date month part_of_day travelled_countries Visitor_country_origin Visitor_home_origin Visitor_work_origin year
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks.
The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 6 different segments:
Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction.
Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification.
Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification.
Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction.
Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies.
DEEP-EST Dataset: this medium-granularity dataset was collected on the modular DEEP-EST HPC system and consists of three parts.These were collected on 16 compute nodes each, while running several MPI applications under different warm-water cooling configurations. This dataset can be used for CPU and GPU temperature prediction, or for thermal characterization.
The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.