100+ datasets found

Large Scale International Boundaries
catalog.data.gov
geodata.state.gov
+1more
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of State (Point of Contact) (2025). Large Scale International Boundaries [Dataset]. https://catalog.data.gov/dataset/large-scale-international-boundaries
Explore at:
Dataset updated
Jul 22, 2025
Dataset provided by
United States Department of Statehttp://state.gov/
Description
Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://data.geodata.state.gov/guidance/index.html Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of “Extension” represents a field containing data interoperability information. Other attributes not listed above include “FID”, “Shape_length” and “Shape.” These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields “CC1” and “CC2” fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. “CC1_GENC3” and “CC2_GENC3” fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes “Q2” or “QX2” denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The “COUNTRY1” and “COUNTRY2” fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the ‘"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. ATTRIBUTE NAME | | VALUE | RANK | 1 | 2 | 3 STATUS | International Boundary | Other Line of International Separation | Special Line A value of “1” in the “RANK” field corresponds to an "International Boundary" value in the “STATUS” field. Values of ”2” and “3” correspond to “Other Line of International Separation” and “Special Line,” respectively. The “LABEL” field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The “NOTES” field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: International Boundaries (Rank 1); Other Lines of International Separation (Rank 2); and Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the “CC1,” “CC1_GENC3,” “CC2,” “CC2_GENC3,” “RANK,” or “NOTES” fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields “CC1_GENC3” and “CC2_GENC” contain the corresponding three-character GENC code to the “CC1” and “CC2” attributes. The code “QX2” is the three-character counterpart of the code “Q2,” which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the “CC1_WPID” and “CC2_WPID” fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The “LSIB_ID” attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new “LSIB_ID.” The “ANTECIDS,” or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The “PREVIDS,” or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the
Z
MoreFixes: Largest CVE dataset with fixes
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhoundali, Jafar (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
Explore at:
Dataset updated
Oct 23, 2024
Dataset provided by
Rahim Nouri, Sajad
Rietveld, Kristian F. D.
GADYATSKAYA, Olga
Akhoundali, Jafar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

This product uses the NVD API but is not endorsed or certified by the NVD.

This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

Please use this for citation:

title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery}, author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga}, booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering}, pages={42--51}, year={2024} }
e
Multi-National Student Survey, 1967-1968 - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jun 17, 2023
Description
DOI Abstract copyright UK Data Service and data collection copyright owner. The aim of this survey was to map attitudes, opinions and values of the future elite in as many nations as possible on a number of problems raised in the youth revolt of the late sixties in order to discern national patterns in these attitudes, opinions and values. Main Topics: Attitudinal/Behavioural Questions Assessment of own personal academic performance/political position in relation to own classmates, membership of any boards of organisations, experience of speaking at meetings, participation in political demonstrations, whether ever written to a newspaper, experience of national service (type). Interest in politics (foreign, national and international), preferred political party, frequency of discussion and media contact with international affairs. Respondent's knowledge of foreign affairs, assessment of own personal knowledge of foreign affairs in relation to own classmates. Respondents were asked to rate their present, past and future situations, and their own country's situation in terms of lifestyle satisfaction. Most important priorities for respondent's own nation. Respondents were asked to agree/disagree with a number of statements about defence, drugs, social and technological progress, religion and human values. Famous personality admired most. Opinion on: major issues facing the world as a whole, major internal and external issues facing own country. Issues most important to respondent. Background Variables Age, sex, race, country of citizenship, college or university, major area of study, expected occupation. Father's occupation (International Standard Classification of Occupations and Hollingshead Index of Social Position). Income of respondent's family compared to range of income in country. Religious affiliation. Quota sample within each nation, quota samples of three or more groups of male students who held different opinions (i.e. representative of the student population)
VOTP Dataset
kaggle.com
Updated Apr 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sdorius (2017). VOTP Dataset [Dataset]. https://www.kaggle.com/datasets/sdorius/votpharm/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sdorius
Description
This is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.

These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].

The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.

These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.

The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.

Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.

The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.

**The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.
B
CPEDB (Comparative Political Economy Database) Main Dataset and...
borealisdata.ca
search.dataone.org
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wally Seccombe (2025). CPEDB (Comparative Political Economy Database) Main Dataset and Documentation [Dataset]. http://doi.org/10.5683/SP3/31QABS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/31QABS
Dataset updated
Apr 24, 2025
Dataset provided by
Borealis
Authors
Wally Seccombe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Comparative Political Economy Database (CPEDB) began at the Centre for Learning, Social Economy and Work (CLSEW) at the Ontario Institute for Studies in Education at the University of Toronto (OISE/UT) as part of the Changing Workplaces in a Knowledge Economy (CWKE) project. This data base was initially conceived and developed by Dr. Wally Seccombe (independent scholar) and Dr. D.W. Livingstone (Professor Emeritus at the University of Toronto). Seccombe has conducted internationally recognized historical research on evolving family structures of the labouring classes (A Millennium of Family Change: Feudalism to Capitalism in Northwestern Europe and Weathering the Storm: Working Class Families from the Industrial Revolution to the Fertility Decline). Livingstone has conducted decades of empirical research on class and labour relations. A major part of this research has used the Canadian Class Structure survey done at the Institute of Political Economy (IPE) at Carleton University in 1982 as a template for Canadian national surveys in 1998, 2004, 2010 and 2016, culminating in Tipping Point for Advanced Capitalism: Class, Class Consciousness and Activism in the ‘Knowledge Economy’ (https://fernwoodpublishing.ca/book/tipping-point-for-advanced-capitalism) and a publicly accessible data base including all five of these Canadian surveys (https://borealisdata.ca/dataverse/CanadaWorkLearningSurveys1998-2016). Seccombe and Livingstone have collaborated on a number of research studies that recognize the need to take account of expanded modes of production and reproduction. Both Seccombe and Livingstone are Research Associates of CLSEW at OISE/UT. The CPEDB Main File (an SPSS data file) covers the following areas (in order): demography, family/household, class/labour, government, electoral democracy, inequality (economic, political & gender), health, environment, internet, macro-economic and financial variables. In its present form, it contains annual data on 725 variables from 12 countries (alphabetically listed): Canada, Denmark, France, Germany, Greece, Italy, Japan, Norway, Spain, Sweden, United Kingdom and United States. A few of the variables date back to 1928, and the majority date from 1960 to 1990. Where these years are not covered in the source, a minority of variables begin with more recent years. All the variables end at the most recent available year (1999 to 2022). In the next version developed in 2025, the most recent years (2023 and 2024) will be added whenever they are present in the sources’ datasets. For researchers who are not using SPSS, refer to the Chart files for overviews, summaries and information on the dataset. For a current list of the variable names and their labels in the CPEDB data base, see the excel file: Outline of SPSS file Main CPEDB, Nov 6, 2023. At the end of each variable label in this file and the SPSS datafile, you will find the source of that variable in a bracket. If I have combined two variables from a given source, the bracket will begin with WS and then register the variables combined. In the 14 variables David created at the beginning of the Class Labour section, you will find DWL in these brackets with his description as to how it was derived. The CPEDB’s variables have been derived from many databases; the main ones are OECD (their Statistics and Family Databases), World Bank, ILO, IMF, WHO, WIID (World Income Inequality Database), OWID (Our World in Data), Parlgov (Parliaments and Governments Database), and V-Dem (Varieties of Democracy). The Institute for Political Economy at Carleton University is currently the main site for continuing refinement of the CPEDB. IPE Director Justin Paulson and other members are involved along with Seccombe and Livingstone in further development and safe storage of this updated database both at the IPE at Carleton and the UT dataverse. All those who explore the CPEDB are invited to share their perceptions of the entire database, or any of its sections, with Seccombe generally (wseccombe@sympatico.ca) and Livingstone for class/labour issues (davidlivingstone@utoronto.ca). They welcome any suggestions for additional variables together with their data sources. A new version CPEDB will be created in the spring of 2025 and installed as soon as the revision is completed. This revised version is intended to be a valuable resource for researchers in all of the included countries as well as Canada.
Data from: OSDG Community Dataset (OSDG-CD)
data.niaid.nih.gov
Updated Jun 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PPMI (2024). OSDG Community Dataset (OSDG-CD) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5550237
Explore at:
Dataset updated
Jun 3, 2024
Dataset provided by
United Nations Development Programmehttp://www.undp.org/
PPMI
OSDG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by over 1,400 OSDG Community Platform (OSDG-CP) citizen scientists from over 140 countries, with respect to the Sustainable Development Goals (SDGs).

Dataset Information

In support of the global effort to achieve the Sustainable Development Goals (SDGs), OSDG is realising a series of SDG-labelled text datasets. The OSDG Community Dataset (OSDG-CD) is the direct result of the work of more than 1,400 volunteers from over 130 countries who have contributed to our understanding of SDGs via the OSDG Community Platform (OSDG-CP). The dataset contains tens of thousands of text excerpts (henceforth: texts) which were validated by the Community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches.

📘 The file contains 43,0210 (+390) text excerpts and a total of 310,328 (+3,733) assigned labels.

To learn more about the project, please visit the OSDG website and the official GitHub page. Explore a detailed overview of the OSDG methodology in our recent paper "OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)".

Source Data

The dataset consists of paragraph-length text excerpts derived from publicly available documents, including reports, policy documents and publication abstracts. A significant number of documents (more than 3,000) originate from UN-related sources such as SDG-Pathfinder and SDG Library. These sources often contain documents that already have SDG labels associated with them. Each text is comprised of 3 to 6 sentences and is about 90 words on average.

Methodology

All the texts are evaluated by volunteers on the OSDG-CP. The platform is an ambitious attempt to bring together researchers, subject-matter experts and SDG advocates from all around the world to create a large and accurate source of textual information on the SDGs. The Community volunteers use the platform to participate in labelling exercises where they validate each text's relevance to SDGs based on their background knowledge.

In each exercise, the volunteer is shown a text together with an SDG label associated with it – this usually comes from the source – and asked to either accept or reject the suggested label.

There are 3 types of exercises:

All volunteers start with the mandatory introductory exercise that consists of 10 pre-selected texts. Each volunteer must complete this exercise before they can access 2 other exercise types. Upon completion, the volunteer reviews the exercise by comparing their answers with the answers of the rest of the Community using aggregated statistics we provide, i.e., the share of those who accepted and rejected the suggested SDG label for each of the 10 texts. This helps the volunteer to get a feel for the platform.

SDG-specific exercises where the volunteer validates texts with respect to a single SDG, e.g., SDG 1 No Poverty.

All SDGs exercise where the volunteer validates a random sequence of texts where each text can have any SDG as its associated label.

After finishing the introductory exercise, the volunteer is free to select either SDG-specific or All SDGs exercises. Each exercise, regardless of its type, consists of 100 texts. Once the exercise is finished, the volunteer can either label more texts or exit the platform. Of course, the volunteer can finish the exercise early. All progress is saved and recorded still.

To ensure quality, each text is validated by up to 9 different volunteers and all texts included in the public release of the data have been validated by at least 3 different volunteers.

It is worth keeping in mind that all exercises present the volunteers with a binary decision problem, i.e., either accept or reject a suggested label. The volunteers are never asked to select one or more SDGs that a certain text might relate to. The rationale behind this set-up is that asking a volunteer to select from 17 SDGs is extremely inefficient. Currently, all texts are validated against only one associated SDG label.

Column Description

doi - Digital Object Identifier of the original document

text_id - unique text identifier

text - text excerpt from the document

sdg - the SDG the text is validated against

labels_negative - the number of volunteers who rejected the suggested SDG label

labels_positive - the number of volunteers who accepted the suggested SDG label

agreement - agreement score based on the formula (agreement = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}})

Further Information

Do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. All queries can be directed to community@osdg.ai.
d
COVID Impact Survey - Public Data
data.world
csv, zip
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2024). COVID Impact Survey - Public Data [Dataset]. https://data.world/associatedpress/covid-impact-survey-public-data
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 16, 2024
Authors
The Associated Press
Description
Overview

The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.

Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).

The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.

The survey is focused on three core areas of research:

Physical Health: Symptoms related to COVID-19, relevant existing conditions and health insurance coverage.

Economic and Financial Health: Employment, food security, and government cash assistance.

Social and Mental Health: Communication with friends and family, anxiety and volunteerism. (Questions based on those used on the U.S. Census Bureau’s Current Population Survey.) ## Using this Data - IMPORTANT This is survey data and must be properly weighted during analysis: DO NOT REPORT THIS DATA AS RAW OR AGGREGATE NUMBERS!!

Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.

Queries

If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".

Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.

Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.

The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."

Margin of Error

The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:

At least twice the margin of error, you can report there is a clear difference.

At least as large as the margin of error, you can report there is a slight or apparent difference.

Less than or equal to the margin of error, you can report that the respondents are divided or there is no difference. ## A Note on Timing Survey results will generally be posted under embargo on Tuesday evenings. The data is available for release at 1 p.m. ET Thursdays.

About the Data

The survey data will be provided under embargo in both comma-delimited and statistical formats.

Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)

Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.

Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.

Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.

Attribution

Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.

AP Data Distributions

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
MetaGraspNet Single Class Mutiple Instance Dataset
kaggle.com
Updated Apr 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhao Chen (2022). MetaGraspNet Single Class Mutiple Instance Dataset [Dataset]. https://www.kaggle.com/metagrasp/metagraspnet-single-class-mutiple-instance-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yuhao Chen
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
MetaGraspNet dataset

This repository contains the MetaGraspNet Dataset described in the paper "MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis" (https://arxiv.org/abs/2112.14663 ).

There has been increasing interest in smart factories powered by robotics systems to tackle repetitive, laborious tasks. One particular impactful yet challenging task in robotics-powered smart factory applications is robotic grasping: using robotic arms to grasp objects autonomously in different settings. Robotic grasping requires a variety of computer vision tasks such as object detection, segmentation, grasp prediction, pick planning, etc. While significant progress has been made in leveraging of machine learning for robotic grasping, particularly with deep learning, a big challenge remains in the need for large-scale, high-quality RGBD datasets that cover a wide diversity of scenarios and permutations.

To tackle this big, diverse data problem, we are inspired by the recent rise in the concept of metaverse, which has greatly closed the gap between virtual worlds and the physical world. In particular, metaverses allow us to create digital twins of real-world manufacturing scenarios and to virtually create different scenarios from which large volumes of data can be generated for training models. We present MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types, and is split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance in a manner that is more appropriate for robotic grasp applications compared to existing general-purpose performance metrics. This repository contains the first phase of MetaGraspNet benchmark dataset which includes detailed object detection, segmentation, layout annotations, and a script for layout-weighted performance metric (https://github.com/y2863/MetaGraspNet ).

https://raw.githubusercontent.com/y2863/MetaGraspNet/main/.github/500.png">

Citing MetaGraspNet

If you use MetaGraspNet dataset or metric in your research, please use the following BibTeX entry. BibTeX @article{chen2021metagraspnet, author = {Yuhao Chen and E. Zhixuan Zeng and Maximilian Gilles and Alexander Wong}, title = {MetaGraspNet: a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis}, journal = {arXiv preprint arXiv:2112.14663}, year = {2021} }

File Structure

This dataset is arranged in the following file structure:

root |-- meta-grasp |-- scene0 |-- 0_camera_params.json |-- 0_depth.png |-- 0_rgb.png |-- 0_order.csv ... |-- scene1 ... |-- difficulty-n-coco-label.json

Each scene is an unique arrangement of objects, which we then display at various different angles. For each shot of a scene, we provide the camera parameters (x_camara_params.json), a depth image (x_depth.png), an rgb image (x_rgb.png), as well as a matrix representation of the ordering of each object (x_order.csv). The full label for the image are all available in difficulty-n-coco-label.json (where n is the difficulty level of the dataset) in the coco data format.

Understanding order.csv

The matrix describes a pairwise obstruction relationship between each object within the image. Given a "parent" object covering a "child" object: relationship_matrix[child_id, parent_id] = -1
f
datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...
frontiersin.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto (2023). datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.pdf [Dataset]. http://doi.org/10.3389/frai.2021.612551.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2021.612551.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Bradley Butcher; Vincent S. Huang; Christopher Robinson; Jeremy Reffin; Sema K. Sgaier; Grace Charles; Novi Quadrianto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.
Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]
earth.esa.int
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Space Agency (2024). Fundamental Data Record for Atmospheric Composition [ATMOS_L1B] [Dataset]. https://earth.esa.int/eogateway/catalog/fdr-for-atmospheric-composition
Explore at:
Dataset updated
Jul 1, 2024
Dataset authored and provided by
European Space Agencyhttp://www.esa.int/
License
https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf
Time period covered
Jun 28, 1995 - Apr 7, 2012
Description
The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...
A
‘World Bank WDI 2.12 - Health Systems’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘World Bank WDI 2.12 - Health Systems’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-world-bank-wdi-2-12-health-systems-6537/c001b7a7/?iid=006-754&v=presentation
Explore at:
Dataset updated
Nov 21, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘World Bank WDI 2.12 - Health Systems’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/danevans/world-bank-wdi-212-health-systems on 21 November 2021.

--- Dataset description provided by original source is as follows ---

World Bank - World Development Indicators: Health Systems

This is a digest of the information described at http://wdi.worldbank.org/table/2.12# It describes various health spending per capita by Country, as well as doctors, nurses and midwives, and specialist surgical staff per capita

Content

Notes, explanations, etc. 1. There are countries/regions in the World Bank data not in the Covid-19 data, and countries/regions in the Covid-19 data with no World Bank data. This is unavoidable. 2. There were political decisions made in both datasets that may cause problems. I chose to go forward with the data as presented, and did not attempt to modify the decisions made by the dataset creators (e.g., the names of countries, what is and is not a country, etc.).

Columns are as follows: 1. Country_Region: the region as used in Kaggle Covid-19 spread data challenges. 2. Province_State: the region as used in Kaggle Covid-19 spread data challenges. 3. World_Bank_Name: the name of the country used by the World Bank 4. Health_exp_pct_GDP_2016: Level of current health expenditure expressed as a percentage of GDP. Estimates of current health expenditures include healthcare goods and services consumed during each year. This indicator does not include capital health expenditures such as buildings, machinery, IT and stocks of vaccines for emergency or outbreaks.

Health_exp_public_pct_2016: Share of current health expenditures funded from domestic public sources for health. Domestic public sources include domestic revenue as internal transfers and grants, transfers, subsidies to voluntary health insurance beneficiaries, non-profit institutions serving households (NPISH) or enterprise financing schemes as well as compulsory prepayment and social health insurance contributions. They do not include external resources spent by governments on health.

Health_exp_out_of_pocket_pct_2016: Share of out-of-pocket payments of total current health expenditures. Out-of-pocket payments are spending on health directly out-of-pocket by households.

Health_exp_per_capita_USD_2016: Current expenditures on health per capita in current US dollars. Estimates of current health expenditures include healthcare goods and services consumed during each year.

per_capita_exp_PPP_2016: Current expenditures on health per capita expressed in international dollars at purchasing power parity (PPP).

External_health_exp_pct_2016: Share of current health expenditures funded from external sources. External sources compose of direct foreign transfers and foreign transfers distributed by government encompassing all financial inflows into the national health system from outside the country. External sources either flow through the government scheme or are channeled through non-governmental organizations or other schemes.

Physicians_per_1000_2009-18: Physicians include generalist and specialist medical practitioners.

Nurse_midwife_per_1000_2009-18: Nurses and midwives include professional nurses, professional midwives, auxiliary nurses, auxiliary midwives, enrolled nurses, enrolled midwives and other associated personnel, such as dental nurses and primary care nurses.

Specialist_surgical_per_1000_2008-18: Specialist surgical workforce is the number of specialist surgical, anaesthetic, and obstetric (SAO) providers who are working in each country per 100,000 population.

Completeness_of_birth_reg_2009-18: Completeness of birth registration is the percentage of children under age 5 whose births were registered at the time of the survey. The numerator of completeness of birth registration includes children whose birth certificate was seen by the interviewer or whose mother or caretaker says the birth has been registered.

Completeness_of_death_reg_2008-16: Completeness of death registration is the estimated percentage of deaths that are registered with their cause of death information in the vital registration system of a country.

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Inspiration

Does health spending levels (public or private), or hospital staff have any effect on the rate at which Covid-19 spreads in a country? Can we use this data to predict the rate at which Cases or Fatalities will grow?

--- Original source retains full ownership of the source dataset ---
d
Data from: Detecting Anomalies in Multivariate Data Sets with Switching...
catalog.data.gov
datasets.ai
+4more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Detecting Anomalies in Multivariate Data Sets with Switching Sequences and Continuous Streams [Dataset]. https://catalog.data.gov/dataset/detecting-anomalies-in-multivariate-data-sets-with-switching-sequences-and-continuous-stre
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. Here, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also briefly discuss results on synthetic and real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
a
GEO Earth Observations for Climate Action
climate.amerigeoss.org
data.amerigeoss.org
+2more
Updated Nov 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmeriGEOSS (2021). GEO Earth Observations for Climate Action [Dataset]. https://climate.amerigeoss.org/datasets/geo-earth-observations-for-climate-action
Explore at:
Dataset updated
Nov 16, 2021
Dataset authored and provided by
AmeriGEOSS
Area covered
Earth
Description
Climate change is one of the biggest challenges facing the world. It is also recognized as one of the challenges where use of Earth observations (EO) can make the most difference, as EO has the capability to capture environmental and socio-economic data over a range of spatial, spectral and temporal resolutions. The impacts of climate change are faced by all, but poor and vulnerable communities and groups are the most affected. Supporting sustainable development agendas while tackling the effects of climate change illustrates the inter-linkages between GEO’s Engagement Priorities. GEO is well positioned to support its Members and the broader community with the requisite Earth observations to support: effective assessment and monitoring of climate and related socio-economic variables; assessment and evaluation of different policy responses and actions; and tracking of the implementation of those actions and responses. GEO uses its unique convening power to connect Members and key partners such as the UN Framework Convention on Climate Change (UNFCCC), the Intergovernmental Panel on Climate Change (IPCC), the World Meteorological Organization (WMO), the United Nations Environment Programme (UN Environment), and the Committee on Earth Observation Satellites (CEOS) to lead national, regional and global climate action efforts. GEO and the UN Framework Convention on Climate Change (UNFCCC)The Earth observations community can play a crucial role in global efforts to address climate change and implement the UNFCCC and the 2015 Paris Agreement. The data and knowledge derived from Earth observations helps governments and other stakeholders at regional, national and sub-national levels to respond in many areas, including mitigation, adaptation and other specific provisions of the Paris Agreement, as well as provide input to the process including through the global stocktake.Earth observations contribute near real-time data on greenhouse gas (GHG) concentrations and emissions for carbon accounting in relation to mitigation responses. When Earth observation data is combined with other critical socio-economic information at the local scale and over extended timescales, efforts to monitor progress on adaptation responses can all be enhanced in addition to impact, vulnerability and risk assessments and the development of measures to increase resilience. GEO, through the first GEO Climate Workshop in June 2018, has engaged with the UNFCCC Secretariat to identify key areas where coordinated Earth observations through GEO could support climate action including: mitigation; activities relating to reducing emissions from deforestation and forest degradation (REDD+); adaptation; loss and damage; technology development and transfer; capacity building; and the global stocktake. The Paris Agreement has also highlighted the need to strengthen scientific knowledge on climate, including research, systematic observation of the climate system and early warning systems, in a manner that informs climate services and supports decision-making. These are all areas where GEO is active through the 2020-2022 Work Programme activities and can provide support. GEO and the Intergovernmental Panel on Climate Change (IPCC)Earth observations are important for the work of the IPCC. The IPCC provides scientific input to inform the Conference of the Parties (COP) of the UNFCCC and the Convention bodies, in particular the Subsidiary Body for Scientific and Technological Advice (SBSTA). The SBSTA has been increasingly emphasizing the value of systematic observations - a term that encompasses Earth observations in the UNFCCC context. Earth observations, and in particular satellite data, provide benchmark measurements on variables which contribute to the accuracy of climate models and projections that inform policy decisions.The refined IPCC Guidelines for National Greenhouse Gas Inventories include information on the potential contributions of space-based Earth observations for comparison with GHG emission estimates. Parties to the UNFCCC have agreed to use the IPCC Guidelines in reporting to the Convention. GEO could further support Parties in using Earth observations for their reporting. The findings of the latest IPCC special reports have already benefited from the enhanced use of Earth observation data and there is scope for improvement. For instance, satellite observations were used to monitor the frequency of marine heatwaves over several decades, and other variables including ice flows. Satellite data was also used to monitor vegetation greening/browning. Increased availability of open Earth observation datasets can increase the quality of monitoring and help address the gaps identified by the IPCC. Current IPCC analyses do not include urban ecosystem dynamics in detail. Urban areas, urban expansion, and other urban processes and their relation to land-related processes are considered “extensive, dynamic, and complex”. The provision of Earth observation information in support of urban resilience is an area of potential GEO support. IPCC analyses have identified “a lack of knowledge of adaptation limits and potential maladaptation to combined effects of climate change and desertification.” A similar issue was identified for limits to adaptation in relation to sea level rise and ice loss. Earth observations can provide an understanding of real-time physical risk exposure, notably where other sources of data are sparse, to support decision analysis and the development of adaptation solutions.Furthermore, the recent IPCC special reports have pointed out that the expanded use of new Information and Communication Technologies (ICTs), climate services and remote-sensing is critical for near-term actions for capacity-building, as well as technology transfer and deployment to strengthen adaptation and mitigation. GEO is working to address the challenges around big data involving the community to use new ICTs for the development of new ways to monitor climate and non-climate variables. Investments in human, technical and institutional capacities on the expanded use of digital technologies are crucial and are expected to bring high returns. Group on Earth Observations7 bis, avenue de la PaixCase postale 2300CH-1211 Geneva, SwitzerlandTel. +41 22 730 8505Fax +41 22 730 8520secretariat@geosec.orgFollow us on:
Artificial Intelligence (AI) Training Dataset Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Artificial Intelligence (AI) Training Dataset Market Outlook

According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.

One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.

Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.

The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.

From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological innovation, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, fueled by the digital transformation of emerging economies and the proliferation of AI applications across various industry sectors.

Data Type Analysis

The AI training dataset market is segmented by data type into Text, Image/Video, Audio, and Others, each playing a crucial role in powering different AI applications. Text da
e
International Relations (October 1969) - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). International Relations (October 1969) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e9487a3a-7051-5b58-b0bb-c47ebb4cef87
Explore at:
Dataset updated
Apr 25, 2023
Description
Judgement on American and Soviet foreign policy. Attitude to selected countries and NATO. Topics: Most important problems of the country; attitude to France, Germany, Great Britain, the USSR and the USA as well as perceived changes in the last few years; assumed reputation of one´s own country abroad; trust in the USA and the USSR to solve world problems; judgement on the agreement of words and deeds in foreign policy as well as the seriousness of the peace efforts of the two great powers; the USSR or the USA as current and as future world power in the military and scientific area as well as in space research; benefit of space travel; attitude to a strengthening of space flight efforts; knowledge about the landing on the moon; necessity of NATO; trust in NATO; judgement on the contribution of one´s own country to NATO; preference for acceptance of political functions by NATO; attitude to a reduction in US soldiers stationed in Western Europe; expected reductions of American obligations in Europe; probability of European unification; desired activities of government in the direction of European unification; preference for a European nuclear force; judgement on the disarmament negotiations between the USA and the USSR; expected benefit of such negotiations for one´s own country and expected consideration of European interests; increased danger of war from the new missile defense systems; prospects of the so-called Budapest recommendation; attitude to the American Vietnam policy; negotiating party that can be held responsible for the failure of the Paris talks; sympathy for Arabs or Israelis in the Middle East Conflict; preference for withdrawal of the Israelis from the occupied territories; attitude to an increase in the total population in one´s country and in the whole world; attitude to birth control in one´s country; attitude to economic aid for lesser developed countries; judgement on the influence and advantageousness of American investments as well as American way of life for one´s own country; autostereotype and description of the American character by means of the same list of characteristics (stereotype); general attitude to American culture; perceived increase in American prosperity; trust in the ability of American politics to solve their own economic and social problems; judgement on the treatment of blacks in the USA and determined changes; proportion of poor in the USA; comparison of proportion of violence or crime in the USA with one´s own country; general judgement on the youth in one´s country in comparison to the USA; assessment of the persuasiveness of the American or Soviet view; religiousness; city size. Also encoded was: length of interview; number of contact attempts; presence of other persons during the interview; willingness of respondent to cooperate; understanding difficulties of respondent. Beurteilung der amerikanischen und sowjetischen Außenpolitik. Einstellung zu ausgewählten Ländern und zur Nato. Themen: Wichtigste Probleme des Landes; Einstellung zu Frankreich, Deutschland, Großbritannien, UdSSR und USA sowie wahrgenommene Veränderungen in den letzten Jahren; vermutetes Ansehen des eigenen Landes im Ausland; Vertrauen in die USA und die UdSSR zur Lösung der Weltprobleme; Beurteilung der Übereinstimmung von Worten und Taten in der Außenpolitik sowie der Ernsthaftigkeit der Friedensbemühungen der beiden Großmächte; UdSSR oder USA als derzeitige und als künftige Weltmacht im militärischen, wissenschaftlichen Bereich sowie in der Weltraumforschung; Nutzen der Weltraumfahrt; Einstellung zu einer Verstärkung von Raumfahrtanstrengungen; Kenntnisse über die Mondlandung; Notwendigkeit der Nato; Vertrauen in die Nato; Beurteilung des Beitrags des eigenen Landes zur Nato; Präferenz für die Übernahme politischer Funktionen durch die Nato; Einstellung zu einer Verringerung der stationierten US-Soldaten in Westeuropa; erwartete Einschränkungen der amerikanischen Verpflichtungen in Europa; Wahrscheinlichkeit einer europäischen Vereinigung; gewünschte Aktivitäten der Regierung in Richtung europäische Einigung; Präferenz für eine europäische Atomstreitmacht; Beurteilung der Abrüstungsverhandlungen zwischen den USA und der UdSSR; erwarteter Nutzen solcher Verhandlungen für das eigene Land und erwartete Berücksichtigung der europäischen Interessen; erhöhte Kriegsgefahr durch die neuen Raketenabwehrsysteme; Aussichten des sogenannten Budapest-Vorschlags; Einstellung zur amerikanischen Vietnam-Politik; Verhandlungspartei, der die Mißerfolge der Pariser Gespräche zugeschrieben werden; Sympathie für die Araber oder Israelis im Nahost-Konflikt; Präferenz für einen Abzug der Israelis aus den besetzten Gebieten; Einstellung zu einer Erhöhung der Bevölkerungszahl im eigenen Land und auf der ganzen Welt; Einstellung zu einer Geburtenkontrolle im eigenen Land; Einstellung zur Wirtschaftshilfe an weniger entwickelte Länder; Beurteilung des Einflusses und der Vorteilhaftigkeit amerikanischer Investitionen sowie amerikanischer Lebensart für das eigene Land; Autostereotyp und Beschreibung des amerikanischen Charakters anhand der gleichen Eigenschaftsliste (Stereotyp); allgemeine Einstellung zur amerikanischen Kultur; wahrgenommene Steigerung des amerikanischen Wohlstands; Vertrauen in die Kompetenz amerikanischer Politik zur Lösung ihrer eigenen wirtschaftlichen und sozialen Probleme; Beurteilung der Behandlung von Schwarzen in den USA und festgestellte Veränderungen; Armenanteil in den USA; Vergleich des Gewaltanteils bzw. der Kriminalität in den USA mit dem eigenen Land; allgemeine Beurteilung der Jugend im eigenen Land im Vergleich zu den USA; Einschätzung der Überzeugungskraft amerikanischer bzw. sowjetischer Anschauung; Religiosität; Ortsgröße. Zusätzlich verkodet wurde: Interviewdauer; Anzahl der Kontaktversuche; Anwesenheit anderer Personen beim Interview; Kooperationsbereitschaft des Befragten; Verständnisschwierigkeiten des Befragten.
GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version...
zenodo.org
tiff
Updated Sep 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luling Liu; Xin Cao; Xin Cao; Shijie Li; Na Jie; Luling Liu; Shijie Li; Na Jie (2024). GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version 2.0-test-alpha) [Dataset]. http://doi.org/10.5281/zenodo.11071249
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11071249
Dataset updated
Sep 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luling Liu; Xin Cao; Xin Cao; Shijie Li; Na Jie; Luling Liu; Shijie Li; Na Jie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Usage Notice

This version is not recommended for download. Please check the newest version.

We would like to inform you that the updated GlobPOP dataset (2021-2022) have been available in version 2.0. The GlobPOP dataset (2021-2022) in the current version is not recommended for your work. The GlobPOP dataset (1990-2020) in the current version is the same as version 1.0.

Thank you for your continued support of the GlobPOP.

If you encounter any issues, please contact us via email at lulingliu@mail.bnu.edu.cn.

Introduction

Continuously monitoring global population spatial dynamics is essential for implementing effective policies related to sustainable development, such as epidemiology, urban planning, and global inequality.

Here, we present GlobPOP, a new continuous global gridded population product with a high-precision spatial resolution of 30 arcseconds from 1990 to 2020. Our data-fusion framework is based on cluster analysis and statistical learning approaches, which intends to fuse the existing five products(Global Human Settlements Layer Population (GHS-POP), Global Rural Urban Mapping Project (GRUMP), Gridded Population of the World Version 4 (GPWv4), LandScan Population datasets and WorldPop datasets to a new continuous global gridded population (GlobPOP). The spatial validation results demonstrate that the GlobPOP dataset is highly accurate. To validate the temporal accuracy of GlobPOP at the country level, we have developed an interactive web application, accessible at https://globpop.shinyapps.io/GlobPOP/, where data users can explore the country-level population time-series curves of interest and compare them with census data.

With the availability of GlobPOP dataset in both population count and population density formats, researchers and policymakers can leverage our dataset to conduct time-series analysis of population and explore the spatial patterns of population development at various scales, ranging from national to city level.

Data description

The product is produced in 30 arc-seconds resolution(approximately 1km in equator) and is made available in GeoTIFF format. There are two population formats, one is the 'Count'(Population count per grid) and another is the 'Density'(Population count per square kilometer each grid)

Each GeoTIFF filename has 5 fields that are separated by an underscore "_". A filename extension follows these fields. The fields are described below with the example filename:

GlobPOP_Count_30arc_1990_I32

Field 1: GlobPOP(Global gridded population)
Field 2: Pixel unit is population "Count" or population "Density"
Field 3: Spatial resolution is 30 arc seconds
Field 4: Year "1990"
Field 5: Data type is I32(Int 32) or F32(Float32)

More information

Please refer to the paper for detailed information:

Liu, L., Cao, X., Li, S. et al. A 31-year (1990–2020) global gridded population dataset generated by cluster analysis and statistical learning. Sci Data 11, 124 (2024). https://doi.org/10.1038/s41597-024-02913-0.

The fully reproducible codes are publicly available at GitHub: https://github.com/lulingliu/GlobPOP.
FloorSet
huggingface.co
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Intel Labs (2024). FloorSet [Dataset]. https://huggingface.co/datasets/IntelLabs/FloorSet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2024
Dataset provided by
Intelhttp://intel.com/
Authors
Intel Labs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for FloorSet

Dataset Summary

FloorSet is a dataset that contains a large training data that reflect real-world constraints and objectives of the Floorplanning problem in chip design flow, which is a crucial component of the structural design flow. This dataset contains synthetic fixed-outline floorplan layouts in a pkl format, that reflect the distribution of real SoCs and sub-system layouts. The dataset has 1M training samples and 100 test samples, with hard… See the full description on the dataset page: https://huggingface.co/datasets/IntelLabs/FloorSet.
e
Social and Political Attitudes 2022 (Cumulated Data Set) - Dataset - B2FIND
b2find.eudat.eu
Updated Apr 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Social and Political Attitudes 2022 (Cumulated Data Set) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bd224a2b-9e56-5ad8-9fdd-40b40b9ac5a5
Explore at:
Dataset updated
Apr 29, 2023
Description
The survey on social and political attitudes was conducted by forsa on behalf of the Press and Information Office of the German Federal Government. In the first quarter of 2022, 19542 persons aged 14 and older were surveyed in telephone interviews (CATI) on the following topics: assessment of one´s own financial situation, assessments of the general living situation and perceptions of the federal government´s policies, and assessment of the world or European political situation. Respondents were selected by a multi-stage random sample. Assessments of one´s own financial situation: assessment of one´s own financial situation compared with that of a year ago; expected change in one´s own financial situation in a year´s time; currently favorable time for major purchases vs. rather reluctant; assessment of how most people from the social environment assess their economic situation: rather optimistic or rather pessimistic. 2. Assessments of the general living situation and perceptions of the federal government´s policies: development of things in the country in the right direction; perceived issues of the federal government (e.g. debates or legislative proposals) in recent weeks (open); satisfaction in selected areas of life and problems (situation in the labor market, protection against violence and crime, extent of social justice, quality of life in Germany, financial situation of public budgets, school and education system in Germany, integration of migrants and foreigners, with the reception or handling of refugees and asylum seekers, securing old-age pensions, care for those in need of long-term care, protection of the environment and climate, health care in Germany). 3. Assessment of the world situation resp. the European political situation: concerns about world peace; worldwide crises with threat potential for Germany (open); opinion on Germany´s foreign policy role in the world with regard to the world political situation (take more vs. less responsibility, already does enough); opinion on Germany´s role in the EU (takes too little vs. too much consideration for other member states, just right). Demography: sex; age; employment; education; net household income (grouped); Federal election voting intention; federal election voting behavior. Additionally coded: Quarter; region East/West; weighting factor.
f
Categorized sample sizes in each dataset.
plos.figshare.com
xls
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taha Mohammadzadeh-Vardin; Amin Ghareyazi; Ali Gharizadeh; Karim Abbasi; Hamid R. Rabiee (2024). Categorized sample sizes in each dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0307649.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0307649.t011
Dataset updated
Jul 26, 2024
Dataset provided by
PLOS ONE
Authors
Taha Mohammadzadeh-Vardin; Amin Ghareyazi; Ali Gharizadeh; Karim Abbasi; Hamid R. Rabiee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cancer treatment has become one of the biggest challenges in the world today. Different treatments are used against cancer; drug-based treatments have shown better results. On the other hand, designing new drugs for cancer is costly and time-consuming. Some computational methods, such as machine learning and deep learning, have been suggested to solve these challenges using drug repurposing. Despite the promise of classical machine-learning methods in repurposing cancer drugs and predicting responses, deep-learning methods performed better. This study aims to develop a deep-learning model that predicts cancer drug response based on multi-omics data, drug descriptors, and drug fingerprints and facilitates the repurposing of drugs based on those responses. To reduce multi-omics data’s dimensionality, we use autoencoders. As a multi-task learning model, autoencoders are connected to MLPs. We extensively tested our model using three primary datasets: GDSC, CTRP, and CCLE to determine its efficacy. In multiple experiments, our model consistently outperforms existing state-of-the-art methods. Compared to state-of-the-art models, our model achieves an impressive AUPRC of 0.99. Furthermore, in a cross-dataset evaluation, where the model is trained on GDSC and tested on CCLE, it surpasses the performance of three previous works, achieving an AUPRC of 0.72. In conclusion, we presented a deep learning model that outperforms the current state-of-the-art regarding generalization. Using this model, we could assess drug responses and explore drug repurposing, leading to the discovery of novel cancer drugs. Our study highlights the potential for advanced deep learning to advance cancer therapeutic precision.
ERA5 hourly data on single levels from 1940 to present
cds.climate.copernicus.eu
grib
Updated Jul 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). ERA5 hourly data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.adbb2d47
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.adbb2d47
Dataset updated
Jul 31, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
Time period covered
Jan 1, 1940 - Jul 25, 2025
Description
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Department of State (Point of Contact) (2025). Large Scale International Boundaries [Dataset]. https://catalog.data.gov/dataset/large-scale-international-boundaries

Large Scale International Boundaries

Explore at:

33 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 22, 2025

Dataset provided by

United States Department of Statehttp://state.gov/

Description

Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://data.geodata.state.gov/guidance/index.html Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of “Extension” represents a field containing data interoperability information. Other attributes not listed above include “FID”, “Shape_length” and “Shape.” These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields “CC1” and “CC2” fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. “CC1_GENC3” and “CC2_GENC3” fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes “Q2” or “QX2” denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The “COUNTRY1” and “COUNTRY2” fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the ‘"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. ATTRIBUTE NAME | | VALUE | RANK | 1 | 2 | 3 STATUS | International Boundary | Other Line of International Separation | Special Line A value of “1” in the “RANK” field corresponds to an "International Boundary" value in the “STATUS” field. Values of ”2” and “3” correspond to “Other Line of International Separation” and “Special Line,” respectively. The “LABEL” field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The “NOTES” field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: International Boundaries (Rank 1); Other Lines of International Separation (Rank 2); and Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the “CC1,” “CC1_GENC3,” “CC2,” “CC2_GENC3,” “RANK,” or “NOTES” fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields “CC1_GENC3” and “CC2_GENC” contain the corresponding three-character GENC code to the “CC1” and “CC2” attributes. The code “QX2” is the three-character counterpart of the code “Q2,” which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the “CC1_WPID” and “CC2_WPID” fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The “LSIB_ID” attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new “LSIB_ID.” The “ANTECIDS,” or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The “PREVIDS,” or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the

Clear search

Close search

Google apps

Main menu

Large Scale International Boundaries

MoreFixes: Largest CVE dataset with fixes

Multi-National Student Survey, 1967-1968 - Dataset - B2FIND

VOTP Dataset

CPEDB (Comparative Political Economy Database) Main Dataset and...

Data from: OSDG Community Dataset (OSDG-CD)

COVID Impact Survey - Public Data

Overview

Queries

Margin of Error

About the Data

Attribution

AP Data Distributions

MetaGraspNet Single Class Mutiple Instance Dataset

MetaGraspNet dataset

Citing MetaGraspNet

File Structure

Understanding order.csv

datasheet1_Causal Datasheet for Datasets: An Evaluation Guide for Real-World...

Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]

‘World Bank WDI 2.12 - Health Systems’ analyzed by Analyst-2

World Bank - World Development Indicators: Health Systems

Content

Inspiration

Data from: Detecting Anomalies in Multivariate Data Sets with Switching...

GEO Earth Observations for Climate Action

Artificial Intelligence (AI) Training Dataset Market Research Report 2033

Artificial Intelligence (AI) Training Dataset Market Outlook

Data Type Analysis

International Relations (October 1969) - Dataset - B2FIND

GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version...

Data Usage Notice

This version is not recommended for download. Please check the newest version.

Introduction

Data description

More information

FloorSet

Social and Political Attitudes 2022 (Cumulated Data Set) - Dataset - B2FIND

Categorized sample sizes in each dataset.

ERA5 hourly data on single levels from 1940 to present

Large Scale International Boundaries