Facebook
TwitterThe Means of Transportation to Work dataset was compiled using information from December 31, 2023 and updated December 12, 2024 from the Bureau of Transportation Statistics (BTS) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). The Means of Transportation to Work table from the 2023 American Community Survey (ACS) 5-year estimates was joined to 2023 tract-level geographies for all 50 States, District of Columbia and Puerto Rico provided by the Census Bureau. A new file was created that combines the demographic variables from the former with the cartographic boundaries of the latter. The national level census tract layer contains data on the number and percentage of commuters (workers 16 years and over) that used various transportation modes to get to work. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529037
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionTransparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.MethodsWe propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.ResultsWe implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for “predicting complications after cardiac surgeries”.DiscussionThrough the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.
Facebook
TwitterField Name Data Type Description
Statefp Number US Census Bureau unique identifier of the state
Countyfp Number US Census Bureau unique identifier of the county
Countynm Text County name
Tractce Number US Census Bureau unique identifier of the census tract
Geoid Number US Census Bureau unique identifier of the state + county + census tract
Aland Number US Census Bureau defined land area of the census tract
Awater Number US Census Bureau defined water area of the census tract
Asqmi Number Area calculated in square miles from the Aland
MSSAid Text ID of the Medical Service Study Area (MSSA) the census tract belongs to
MSSAnm Text Name of the Medical Service Study Area (MSSA) the census tract belongs to
Definition Text Type of MSSA, possible values are urban, rural and frontier.
TotalPovPop Number US Census Bureau total population for whom poverty status is determined of the census tract, taken from the 2020 ACS 5 YR S1701
Facebook
TwitterThe virtual R/Pharma Conference is happening this week! To celebrate, we're exploring Patient Risk Profiles. Thank you to Jenna Reps for preparing this data!
This dataset contains 100 simulated patient's medical history features and the predicted 1-year risk of 14 outcomes based on each patient's medical history features. The predictions used real logistic regression models developed on a large real world healthcare dataset.
patient_risk_profiles.csv| variable | class | description |
|---|---|---|
| personId | integer | A unique identifier for the simulated patient |
| age group: 10 - 14 | integer | A binary column where 1 means the patient is aged between 10-14 (inclusive) and 0 means the patient is not in that age group |
| age group: 15 - 19 | integer | A binary column where 1 means the patient is aged between 15-19 (inclusive) and 0 means the patient is not in that age group |
| age group: 20 - 24 | integer | A binary column where 1 means the patient is aged between 20-24 (inclusive) and 0 means the patient is not in that age group |
| age group: 65 - 69 | integer | A binary column where 1 means the patient is aged between 65-69 (inclusive) and 0 means the patient is not in that age group |
| age group: 40 - 44 | integer | A binary column where 1 means the patient is aged between 40-44 (inclusive) and 0 means the patient is not in that age group |
| age group: 45 - 49 | integer | A binary column where 1 means the patient is aged between 45-49 (inclusive) and 0 means the patient is not in that age group |
| age group: 55 - 59 | integer | A binary column where 1 means the patient is aged between 55-59 (inclusive) and 0 means the patient is not in that age group |
| age group: 85 - 89 | integer | A binary column where 1 means the patient is aged between 85-89 (inclusive) and 0 means the patient is not in that age group |
| age group: 75 - 79 | integer | A binary column where 1 means the patient is aged between 75-79 (inclusive) and 0 means the patient is not in that age group |
| age group: 5 - 9 | integer | A binary column where 1 means the patient is aged between 5-9 (inclusive) and 0 means the patient is not in that age group |
| age group: 25 - 29 | integer | A binary column where 1 means the patient is aged between 25-29 (inclusive) and 0 means the patient is not in that age group |
| age group: 0 - 4 | integer | A binary column where 1 means the patient is aged between 0-4 (inclusive) and 0 means the patient is not in that age group |
| age group: 70 - 74 | integer | A binary column where 1 means the patient is aged between 70-74 (inclusive) and 0 means the patient is not in that age group |
| age group: 50 - 54 | integer | A binary column where 1 means the patient is aged between 50-54 (inclusive) and 0 means the patient is not in that age group |
| age group: 60 - 64 | integer | A binary column where 1 means the patient is aged between 60-64 (inclusive) and 0 means the patient is not in that age group |
| age group: 35 - 39 | integer | A binary column where 1 means the patient is aged between 35-39 (inclusive) and 0 means the patient is not in that age group |
| age group: 30 - 34 | integer | A binary column where 1 means the patient is aged between 30-34 (inclusive) and 0 means the patient is not in that age group |
| age group: 80 - 84 | integer | A binary column where 1 means the patient is aged between 80-84 (inclusive) and 0 means the patient is not in that age group |
| age group: 90 - 94 | integer | A binary column where 1 means the patient is aged between 90-94 (inclusive) and 0 means the patient is not in that age group |
| Sex = FEMALE | integer | A binary column where 1 means the patient has a female sex |
| sex = MALE | integer | A binary column where 1 means the patient has a male sex |
| Acetaminophen exposures in prior year | integer | A binary column where 1 means the patient had a record for acetaminophen in the prior year and 0 means they did not |
| Occurrence of Alcoholism in prior year | integer | A binary column where 1 means the patient had a record for alcoholism in the prior year and 0 means they did not |
| Anemia in prior year | integer | A binary column where 1 means the patient had a record for anemia in the prior year and 0 means they did not |
| Angina events in prior year | integer | A binary column where 1 means the patient had a record for angina in the prior year and 0 means they did not |
| ANTIEPILEPTICS in prior year | integer | A binary column where 1 means the patient had a record for a drug in the category ANTIEPILEPTICS in the prior year and 0 means they did not |
| Occurrence of Anxiety in prior year | integer | A binary column where 1 means the patient had a record for anxiety in the prior year and 0 means... |
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
By Health [source]
This dataset is a valuable resource for gaining insight into Inpatient Prospective Payment System (IPPS) utilization, average charges and average Medicare payments across the top 100 Diagnosis-Related Groups (DRG). With column categories such as DRG Definition, Hospital Referral Region Description, Total Discharges, Average Covered Charges, Average Medicare Payments and Average Medicare Payments 2 this dataset enables researchers to discover and assess healthcare trends in areas such as provider payment comparsons by geographic location or compare service cost across hospital. Visualize the data using various methods to uncover unique information and drive further hospital research
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a provider level summary of Inpatient Prospective Payment System (IPPS) discharges, average charges and average Medicare payments for the Top 100 Diagnosis-Related Groups (DRG). This data can be used to analyze cost and utilization trends across hospital DRGs.
To make the most use of this dataset, here are some steps to consider:
- Understand what each column means in the table: Each column provides different information from the DRG Definition to Hospital Referral Region Description and Average Medicare Payments.
- Analyze the data by looking for patterns amongst the relevant columns: Compare different aspects such as total discharges or average Medicare payments by hospital referral region or DRG Definition. This can help identify any potential trends amongst different categories within your analysis.
- Generate visualizations: Create charts, graphs, or maps that display your data in an easy-to-understand format using tools such as Microsoft Excel or Tableau. Such visuals may reveal more insights into patterns within your data than simply reading numerical values on a spreadsheet could provide alone.
- Identifying potential areas of cost savings by drilling down to particular DRGs and hospital regions with the highest average covered charges compared to average Medicare payments.
- Establishing benchmarks for typical charges and payments across different DRGs and hospital regions to help providers set market-appropriate prices.
- Analyzing trends in total discharges, charges and Medicare payments over time, allowing healthcare organizations to measure their performance against regional peers
If you use this dataset in your research, please credit the original authors. Data Source
License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
File: 97k6-zzx3.csv | Column name | Description | |:-----------------------------------------|:------------------------------------------------------| | drg_definition | Diagnosis-Related Group (DRG) definition. (String) | | average_medicare_payments | Average Medicare payments for each DRG. (Numeric) | | hospital_referral_region_description | Description of the hospital referral region. (String) | | total_discharges | Total number of discharges for each DRG. (Numeric) | | average_covered_charges | Average covered charges for each DRG. (Numeric) | | average_medicare_payments_2 | Average Medicare payments for each DRG. (Numeric) |
**File: Inpatient_Prospective_Payment_System_IPPS_Provider_Summary_for_the_Top_100_Diagnosis-Related_Groups_DRG...
Facebook
TwitterAll code and input files used in k-means clustering analysis of Opportunity Atlas data. This dataset is associated with the following publication: Zelasky, S., C. Martin, C. Weaver, L. Baxter, and K. Rappazzo. Identifying groups of children's social mobility opportunity for public health applications using k-means clustering. Heliyon. Elsevier B.V., Amsterdam, NETHERLANDS, 9(9): E20250, (2023).
Facebook
TwitterACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.
The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.
Methodology
To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).
These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.
To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.
Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.
Description of the data in this data set
Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies
The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information
Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet
2) Complete reference - the complete source information to refer to the study
3) Year of publication - the year in which the study was published
4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter}
5) DOI / Website- a link to the website where the study can be found
6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science
7) Availability in OA - availability of an article in the Open Access
8) Keywords - keywords of the paper as indicated by the authors
9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}
Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?
Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)?
18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))
HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term?
20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output")
21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description)
22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles?
23) Data - what data do HVD cover?
24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)
Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx
Licenses or restrictions CC-BY
For more info, see README.txt
Facebook
TwitterCERES_EBAF_Edition4.1 is the Clouds and the Earth's Radiant Energy System (CERES) Energy Balanced and Filled (EBAF) Top-of-Atmosphere (TOA) and surface monthly means data in netCDF format Edition 4.1 data product. Data was collected using the CERES Scanner instruments on both the Terra and Aqua platforms. Data collection for this product is ongoing.CERES_EBAF_Edition4.1 data are monthly and climatological averages of TOA clear-sky (spatially complete) fluxes and all-sky fluxes, where the TOA net flux is constrained to the ocean heat storage. It also provides computed monthly mean surface radiative fluxes consistent with the CERES EBAF-TOA product and some basic cloud properties derived from MODIS. Cloud Radiative Effects are provided at both the TOA and surface as determined using a cloud-free profile in the Fu-Liou Radiative Transfer Model (RTM). Observed fluxes are obtained using cloud properties derived from narrow-band imagers onboard both EOS Terra and Aqua satellites, as well as geostationary satellites, to fully model the diurnal cycle of clouds. The computations are also based on meteorological assimilation data from the Goddard Earth Observing System (GEOS) Versions 5.4.1 models. Unlike other CERES Level 3 clear-sky regional data sets that contain clear-sky data gaps, the clear-sky fluxes in the EBAF-TOA product are regionally complete. The EBAF-TOA product is the CERES project's best estimate of the fluxes based on all available satellite platforms and input data. CERES is a key Earth Observing System (EOS) program component. The CERES instruments provide radiometric measurements of the Earth's atmosphere from three broadband channels. The CERES missions follow the successful Earth Radiation Budget Experiment (ERBE) mission. The first CERES instrument, the proto flight model (PFM), was launched on November 27, 1997, as part of the Tropical Rainfall Measuring Mission (TRMM). Two CERES instruments (FM1 and FM2) were launched into polar orbit on board the Earth Observing System (EOS) flagship Terra on December 18, 1999. Two additional CERES instruments (FM3 and FM4) were launched on board Earth Observing System (EOS) Aqua on May 4, 2002. The CERES FM5 instrument was launched on board the Suomi National Polar-orbiting Partnership (NPP) satellite on October 28, 2011. The newest CERES instrument (FM6) was launched on board the Joint Polar-Orbiting Satellite System 1 (JPSS-1) satellite, now called NOAA-20, on November 18, 2017.
Facebook
TwitterThe NOAA National Centers for Environmental Information (formerly National Geophysical Data Center) / World Data Center, Boulder maintains an active database of worldwide geomagnetic observatory data. Historically, magnetic observatories were established to monitor the secular change (variation), of the Earth's magnetic field, and this remains one of their most important functions. This generally involves absolute measurements sufficient in number to monitor instrumental drift and to produce annual means. While the current global network of geomagnetic observatories involves over 70 countries operating more than 200 observatories, the historic database includes observations from more than 600 observatories since the early 1800s. The magnetic observatory data are crucial to the studies of secular change, investigations into the Earth's interior, navigation, communication, and to global modeling efforts. The Earth's magnetic field is described by seven parameters. These are declination (D), inclination (I), horizontal intensity (H), vertical intensity (Z), total intensity (F) and the north (X) and east (Y) components of the horizontal intensity. By convention, declination is considered positive when measured east of north, inclination and vertical intensity positive down, X positive north, and Y positive east. The magnetic field observed on Earth is constantly changing.
Facebook
TwitterThis is an update to the MSSA geometries and demographics to reflect the new 2020 Census tract data. The Medical Service Study Area (MSSA) polygon layer represents the best fit mapping of all new 2020 California census tract boundaries to the original 2010 census tract boundaries used in the construction of the original 2010 MSSA file. Each of the state's new 9,129 census tracts was assigned to one of the previously established medical service study areas (excluding tracts with no land area), as identified in this data layer. The MSSA Census tract data is aggregated by HCAI, to create this MSSA data layer. This represents the final re-mapping of 2020 Census tracts to the original 2010 MSSA geometries. The 2010 MSSA were based on U.S. Census 2010 data and public meetings held throughout California.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This dataset is a mapping between MEANS-InOut input data and Life Cycle Inventories from reference databases (Agribalyse, ecoinvent). The MEANS-InOut input data are agricultural production system inputs (fertilisers, plant protection products, agricultural operations, livestock feed, ingredients to be incorporated into livestock feed, etc.). Each input is associated with one or more LCI, which represent(s) the impacts of the production of this input, and the database from which the LCI(s) is from. This version of the dataset corresponds to the following versions of the databases: Agribalyse v3.1.1 and ecoinvent v3.9. The correspondence file (named mapping_data.tab) is associated with : a document describing the input types in the MEANS-InOut software (file: Input_type_description.pdf), a document describing how the value of the input flow of a LCI for an agricultural system studied in MEANS-InOut is obtained from the value taken by this input in MEANS-InOut. (file: LCI_value_construction.pdf) Ce jeu de données établit la correspondance entre les référentiels de MEANS-InOut et des Inventaires de Cycle de Vie de base de données de référence (Agribalyse, ecoinvent). Les référentiels de MEANS-InOut sont des intrants des systèmes de production agricole (engrais, produits phytosanitaires, opérations agricoles, aliments du bétail, ingrédients à incorporer dans les aliments composés...). A chaque intrant est associé un ou plusieurs ICV, qui représentent les impacts de la production de cet intrant, et la base de données dont le ou les ICV sont issus. Cette version du jeu de données fait la correspondance avec les versions suivantes des bases de données : Agribalyse v3.1.1 et ecoinvent v3.9. Au fichier de correspondances (fichier : mapping_data.tab), sont associés : un document qui décrit les types d'intrants du logiciel MEANS-InOut (fichier : Input_type_description.pdf), un document qui décrit comment est obtenue la valeur du flux des intrants d'un ICV d'un système agricole étudié dans MEANS-InOut à partir de la valeur prise par cet un intrant dans MEANS-InOut. (fichier : LCI_value_construction.pdf)
Facebook
TwitterMOP03JM_9 is the Measurements Of Pollution In The Troposphere (MOPITT) Carbon Monoxide (CO) gridded monthly means (Near and Thermal Infrared Radiances) version 9 data product. It contains monthly mean-gridded daily Level 2 CO profile versions and total column retrievals. For this data product, the averaging kernels associated with each retrieval are also gridded and included in the Level 3 files. For a description of the file contents, refer to the File Spec Document. The MOPITT Level 2 Data Quality Statement contains additional information about the retrievals' quality and limitations. MOPITT was successfully launched into sun-synchronous polar orbit aboard Terra, NASA's first Earth Observing System spacecraft, on December 18, 1999. The MOPITT instrument was constructed by a consortium of Canadian companies and funded by the Space Science Division of the Canadian Space Agency. Data collection for this product is ongoing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems" conducted by Martin Lnenicka (University of Hradec Králové, Czech Republic), Anastasija Nikiforova (University of Tartu, Estonia), Mariusz Luterek (University of Warsaw, Warsaw, Poland), Petar Milic (University of Pristina - Kosovska Mitrovica, Serbia), Daniel Rudmark (Swedish National Road and Transport Research Institute, Sweden), Sebastian Neumaier (St. Pölten University of Applied Sciences, Austria), Karlo Kević (University of Zagreb, Croatia), Anneke Zuiderwijk (Delft University of Technology, Delft, the Netherlands), Manuel Pedro Rodríguez Bolívar (University of Granada, Granada, Spain).
As there is a lack of understanding of the elements that constitute different types of value-adding public data ecosystems and how these elements form and shape the development of these ecosystems over time, which can lead to misguided efforts to develop future public data ecosystems, the aim of the study is: (1) to explore how public data ecosystems have developed over time and (2) to identify the value-adding elements and formative characteristics of public data ecosystems. Using an exploratory retrospective analysis and a deductive approach, we systematically review 148 studies published between 1994 and 2023. Based on the results, this study presents a typology of public data ecosystems and develops a conceptual model of elements and formative characteristics that contribute most to value-adding public data ecosystems, and develops a conceptual model of the evolutionary generation of public data ecosystems represented by six generations called Evolutionary Model of Public Data Ecosystems (EMPDE). Finally, three avenues for a future research agenda are proposed.
This dataset is being made public both to act as supplementary data for "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems ", Telematics and Informatics*, and its Systematic Literature Review component that informs the study.
Description of the data in this data set
PublicDataEcosystem_SLR provides the structure of the protocol
Spreadsheet#1 provides the list of results after the search over three indexing databases and filtering out irrelevant studies
Spreadsheets #2 provides the protocol structure.
Spreadsheets #3 provides the filled protocol for relevant studies.
The information on each selected study was collected in four categories:(1) descriptive information,(2) approach- and research design- related information,(3) quality-related information,(4) HVD determination-related information
Descriptive Information
Article number
A study number, corresponding to the study number assigned in an Excel worksheet
Complete reference
The complete source information to refer to the study (in APA style), including the author(s) of the study, the year in which it was published, the study's title and other source information.
Year of publication
The year in which the study was published.
Journal article / conference paper / book chapter
The type of the paper, i.e., journal article, conference paper, or book chapter.
Journal / conference / book
Journal article, conference, where the paper is published.
DOI / Website
A link to the website where the study can be found.
Number of words
A number of words of the study.
Number of citations in Scopus and WoS
The number of citations of the paper in Scopus and WoS digital libraries.
Availability in Open Access
Availability of a study in the Open Access or Free / Full Access.
Keywords
Keywords of the paper as indicated by the authors (in the paper).
Relevance for our study (high / medium / low)
What is the relevance level of the paper for our study
Approach- and research design-related information
Approach- and research design-related information
Objective / Aim / Goal / Purpose & Research Questions
The research objective and established RQs.
Research method (including unit of analysis)
The methods used to collect data in the study, including the unit of analysis that refers to the country, organisation, or other specific unit that has been analysed such as the number of use-cases or policy documents, number and scope of the SLR etc.
Study’s contributions
The study’s contribution as defined by the authors
Qualitative / quantitative / mixed method
Whether the study uses a qualitative, quantitative, or mixed methods approach?
Availability of the underlying research data
Whether the paper has a reference to the public availability of the underlying research data e.g., transcriptions of interviews, collected data etc., or explains why these data are not openly shared?
Period under investigation
Period (or moment) in which the study was conducted (e.g., January 2021-March 2022)
Use of theory / theoretical concepts / approaches? If yes, specify them
Does the study mention any theory / theoretical concepts / approaches? If yes, what theory / concepts / approaches? If any theory is mentioned, how is theory used in the study? (e.g., mentioned to explain a certain phenomenon, used as a framework for analysis, tested theory, theory mentioned in the future research section).
Quality-related information
Quality concerns
Whether there are any quality concerns (e.g., limited information about the research methods used)?
Public Data Ecosystem-related information
Public data ecosystem definition
How is the public data ecosystem defined in the paper and any other equivalent term, mostly infrastructure. If an alternative term is used, how is the public data ecosystem called in the paper?
Public data ecosystem evolution / development
Does the paper define the evolution of the public data ecosystem? If yes, how is it defined and what factors affect it?
What constitutes a public data ecosystem?
What constitutes a public data ecosystem (components & relationships) - their "FORM / OUTPUT" presented in the paper (general description with more detailed answers to further additional questions).
Components and relationships
What components does the public data ecosystem consist of and what are the relationships between these components? Alternative names for components - element, construct, concept, item, helix, dimension etc. (detailed description).
Stakeholders
What stakeholders (e.g., governments, citizens, businesses, Non-Governmental Organisations (NGOs) etc.) does the public data ecosystem involve?
Actors and their roles
What actors does the public data ecosystem involve? What are their roles?
Data (data types, data dynamism, data categories etc.)
What data do the public data ecosystem cover (is intended / designed for)? Refer to all data-related aspects, including but not limited to data types, data dynamism (static data, dynamic, real-time data, stream), prevailing data categories / domains / topics etc.
Processes / activities / dimensions, data lifecycle phases
What processes, activities, dimensions and data lifecycle phases (e.g., locate, acquire, download, reuse, transform, etc.) does the public data ecosystem involve or refer to?
Level (if relevant)
What is the level of the public data ecosystem covered in the paper? (e.g., city, municipal, regional, national (=country), supranational, international).
Other elements or relationships (if any)
What other elements or relationships does the public data ecosystem consist of?
Additional comments
Additional comments (e.g., what other topics affected the public data ecosystems and their elements, what is expected to affect the public data ecosystems in the future, what were important topics by which the period was characterised etc.).
New papers
Does the study refer to any other potentially relevant papers?
Additional references to potentially relevant papers that were found in the analysed paper (snowballing).
Format of the file.xls, .csv (for the first spreadsheet only), .docx
Licenses or restrictionsCC-BY
For more info, see README.txt
Facebook
TwitterPlease note, this dataset has been superseded by a newer version (see below). Users should not use this version except in rare cases (e.g., when reproducing previous studies that used this version). Integrated Global Radiosonde Archive is a digital data set archived at the former National Climatic Data Center (NCDC), now National Centers for Environmental Information (NCEI). This dataset contains monthly means of geopotential height, temperature, zonal wind, and meridional wind derived from the Integrated Global Radiosonde Archive (IGRA). IGRA consists of radiosonde and pilot balloon observations at over 1500 globally distributed stations, and monthly means are available for the surface and mandatory levels at many of these stations. The period of record varies from station to station, with many extending from 1970 to 2016. Monthly means are computed separately for the nominal times of 0000 and 1200 UTC, considering data within two hours of each nominal time. A mean is provided, along with the number of values used to calculate it, whenever there are at least 10 values for a particular station, month, nominal time, and level.
Facebook
TwitterThis dataset is a custom reference of Google Analytics field definitions.
It was specifically compiled to enhance datasets like the Google Analytics 360 data from the Google Merchandise Store, which lacks field descriptions in its original BigQuery schema. By providing detailed definitions for each field, this reference aims to improve the interpretability of the data—especially when used by language models or analytics tools that rely on contextual understanding to process and answer queries effectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".
Facebook
TwitterKaggle Card: FHIR Profiles-Resources JSON File Overview Fast Healthcare Interoperability Resources (FHIR, pronounced "fire") is a standard developed by Health Level Seven International (HL7) for transferring electronic health records. The FHIR Profiles-Resources JSON file is an essential part of this standard. It provides a schema that defines the structure of FHIR resource types, including their properties and attributes.
Dataset Structure This file is structured in the JSON format, known for its versatility and human-readable nature. Each JSON object corresponds to a unique FHIR resource type, outlining its structure and providing a blueprint for the properties and attributes each resource type should contain.
Fields Description While the precise properties and attributes differ for each FHIR resource type, the typical elements you may encounter in this file include:
Id: The unique identifier for the resource type. Url: A global identifier URI for the resource type. Version: The business version of the resource. Name: The human-readable name for the resource type. Status: The publication status of the resource (draft, active, retired). Experimental: A boolean value indicating whether this resource type is experimental. Date: The date of the resource type's last change. Publisher: The individual or organization that published the resource type. Contact: Contact details for the publishers. Description: A natural language description of the resource type. UseContext: A list outlining the usability context for the resource type. Jurisdiction: Identifies the region/country where the resource type is defined. Purpose: An explanation of why the resource type is necessary. Element: A list defining the structure of the properties for the resource type, including data types and relationships with other resource types. Potential Use Cases Schema Validation: Use the schema to validate FHIR data and ensure it aligns with the defined structure and types for each resource. Interoperability: Facilitate the exchange of healthcare information with other FHIR-compatible systems by providing a standardized structure. Data Mapping: Utilize the schema to map data from other formats into the FHIR format, or vice versa. System Design: Aid the design and development of healthcare systems by offering a template for data structure.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22121490%2F7189944f8fc292a094c90daa799d08ca%2FChatGPT%20Image%2015%20Kas%202025%2014_07_37.png?generation=1763204959770660&alt=media" alt="">
This synthetic dataset simulates 300 global cities across 6 major geographic regions, designed specifically for unsupervised machine learning and clustering analysis. It explores how economic status, environmental quality, infrastructure, and digital access shape urban lifestyles worldwide.
| Feature | Description | Range |
|---|---|---|
| 10 Features | Economic, environmental & social indicators | Realistically scaled |
| 300 Cities | Europe, Asia, Americas, Africa, Oceania | Diverse distributions |
| Strong Correlations | Income ↔ Rent (+0.8), Density ↔ Pollution (+0.6) | ML-ready |
| No Missing Values | Clean, preprocessed data | Ready for analysis |
| 4-5 Natural Clusters | Metropolitan hubs, eco-towns, developing centers | Pre-validated |
✅ Realistic Correlations: Income strongly predicts rent (+0.8), internet access (+0.7), and happiness (+0.6)
✅ Regional Diversity: Each region has distinct economic and environmental characteristics
✅ Clustering-Ready: Naturally separable into 4-5 lifestyle archetypes
✅ Beginner-Friendly: No data cleaning required, includes example code
✅ Documented: Comprehensive README with methodology and use cases
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load and prepare
df = pd.read_csv('city_lifestyle_dataset.csv')
X = df.drop(['city_name', 'country'], axis=1)
X_scaled = StandardScaler().fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze
print(df.groupby('cluster').mean())
After working with this dataset, you will be able to: 1. Apply K-Means, DBSCAN, and Hierarchical Clustering 2. Use PCA for dimensionality reduction and visualization 3. Interpret correlation matrices and feature relationships 4. Create geographic visualizations with cluster assignments 5. Profile and name discovered clusters based on characteristics
| Cluster | Characteristics | Example Cities |
|---|---|---|
| Metropolitan Tech Hubs | High income, density, rent | Silicon Valley, Singapore |
| Eco-Friendly Towns | Low density, clean air, high happiness | Nordic cities |
| Developing Centers | Mid income, high density, poor air | Emerging markets |
| Low-Income Suburban | Low infrastructure, income | Rural areas |
| Industrial Mega-Cities | Very high density, pollution | Manufacturing hubs |
Unlike random synthetic data, this dataset was carefully engineered with: - ✨ Realistic correlation structures based on urban research - 🌍 Regional characteristics matching real-world patterns - 🎯 Optimal cluster separability (validated via silhouette scores) - 📚 Comprehensive documentation and starter code
✓ Learn clustering without data cleaning hassles
✓ Practice PCA and dimensionality reduction
✓ Create beautiful geographic visualizations
✓ Understand feature correlation in real-world contexts
✓ Build a portfolio project with clear business insights
This dataset was designed for educational purposes in machine learning and data science. While synthetic, it reflects real patterns observed in global urban development research.
Happy Clustering! 🎉
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Data Details
Each row in your shift data is a shift; the following are helpful descriptions of columns within that dataset: ● “Agent ID”: HCP ID ● “Facility ID”: HCF ID ● “Start”: The shift start time ● Agent Req”: the type of HCP that is being requested for this shift ● “End”: The shift end time ● “Shift Type”: specifies if the shift is in the morning (AM), afternoon (PM), overnight (NOC), or custom (CUSTOM) ● “Deleted”: Whether the shift was deleted ○ Note “deleted” means “canceled by facility” ● “Created At”: When the shift was created ● “Charge”: Per hour charge rate ● “Time”: How many hours the shift lasts ● “Verified”: Indicates that the shift was worked, as confirmed by a signed timesheet
Each row in your cancellation logs is a unique cancellation event; the following are helpful descriptions of columns within that dataset: ● “Action”: The type of cancellation action ○ “WORKER_CANCEL”: The HCP canceled a shift they booked ○ “NO_CALL_NO_SHOW”: The HCP canceled a shift they booked after the shift commenced or otherwise did not show up to the shift and did not inform the facility about their absence ● “Created At”: When the action took place ● “Facility ID”: HCF ID ● “Worker ID”: The ID of the HCP that was previously associated with the shift ● “Shift ID”: The shift ID ● “Lead Time”: The time from “action” to “shift start” (in hours)
Each row in your shift claim logs is a unique booking event; the following are helpful descriptions for columns within that dataset:
Note that we only included claim actions for a subset of the date range in the "shifts" data. Thus, there are likely shifts that don't have associated claim actions. That's OK, we're only providing this data so you can observe HCP booking behavior. ● “Action”: The type of booking action ○ "SHIFT_CLAIM": The HCP instantly booked the shift. As soon as they booked the shift, it was theirs.
Business Problem
You’ll likely want to know more about how the marketplace is currently operating to form your own mental model.
Data ● In this “Data” folder, you can find the below: ○ Shift data for one of the metropolitan statistical areas in which we have a presence ○ A list of cancellation logs for shifts that were canceled by HCPs ○ A list of shift claim logs ● We define the fields in these files below
Assumptions and Business Context ● The most damaging type of cancellation for the HCF is one in which the HCP does what we call a “No-Call-No-Show”; this means they canceled the shift after the shift started or otherwise did not show up to the shift and did not inform the facility of their absence ● The top reasons why HCPs cancel shifts last minute are: sick, family emergency, transportation issue (e.g. car broke down), facility issue ● From interviews, the most important things to HCPs are: will there be shifts that fit my erratic schedule, that are close enough to home, that pay enough, and that pay on time? ● HCPs currently receive a set of notifications prior to their shift to remind them of their upcoming shift
Facebook
TwitterThe Means of Transportation to Work dataset was compiled using information from December 31, 2023 and updated December 12, 2024 from the Bureau of Transportation Statistics (BTS) and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). The Means of Transportation to Work table from the 2023 American Community Survey (ACS) 5-year estimates was joined to 2023 tract-level geographies for all 50 States, District of Columbia and Puerto Rico provided by the Census Bureau. A new file was created that combines the demographic variables from the former with the cartographic boundaries of the latter. The national level census tract layer contains data on the number and percentage of commuters (workers 16 years and over) that used various transportation modes to get to work. A data dictionary, or other source of attribute information, is accessible at https://doi.org/10.21949/1529037