The United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.census_bureau_international.
What countries have the longest life expectancy? In this query, 2016 census information is retrieved by joining the mortality_life_expectancy and country_names_area tables for countries larger than 25,000 km2. Without the size constraint, Monaco is the top result with an average life expectancy of over 89 years!
SELECT
age.country_name,
age.life_expectancy,
size.country_area
FROM (
SELECT
country_name,
life_expectancy
FROM
bigquery-public-data.census_bureau_international.mortality_life_expectancy
WHERE
year = 2016) age
INNER JOIN (
SELECT
country_name,
country_area
FROM
bigquery-public-data.census_bureau_international.country_names_area
where country_area > 25000) size
ON
age.country_name = size.country_name
ORDER BY
2 DESC
/* Limit removed for Data Studio Visualization */
LIMIT
10
Which countries have the largest proportion of their population under 25? Over 40% of the world’s population is under 25 and greater than 50% of the world’s population is under 30! This query retrieves the countries with the largest proportion of young people by joining the age-specific population table with the midyear (total) population table.
SELECT
age.country_name,
SUM(age.population) AS under_25,
pop.midyear_population AS total,
ROUND((SUM(age.population) / pop.midyear_population) * 100,2) AS pct_under_25
FROM (
SELECT
country_name,
population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population_agespecific
WHERE
year =2017
AND age < 25) age
INNER JOIN (
SELECT
midyear_population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population
WHERE
year = 2017) pop
ON
age.country_code = pop.country_code
GROUP BY
1,
3
ORDER BY
4 DESC /* Remove limit for visualization*/
LIMIT
10
The International Census dataset contains growth information in the form of birth rates, death rates, and migration rates. Net migration is the net number of migrants per 1,000 population, an important component of total population and one that often drives the work of the United Nations Refugee Agency. This query joins the growth rate table with the area table to retrieve 2017 data for countries greater than 500 km2.
SELECT
growth.country_name,
growth.net_migration,
CAST(area.country_area AS INT64) AS country_area
FROM (
SELECT
country_name,
net_migration,
country_code
FROM
bigquery-public-data.census_bureau_international.birth_death_growth_rates
WHERE
year = 2017) growth
INNER JOIN (
SELECT
country_area,
country_code
FROM
bigquery-public-data.census_bureau_international.country_names_area
Historic (none)
United States Census Bureau
Terms of use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/international-census-data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
By Arthur Keen [source]
This dataset contains the top 100 global banks ranked by total assets on December 31, 2017. With a detailed list of key information for each bank's rank, country, balance sheet and US Total Assets (in billions), this data will be invaluable for those looking to research and study the current status of some of the world's leading financial organizations. From billion-dollar mega-banks such as JP Morgan Chase to small, local savings & loans institutions like BancorpSouth; this comprehensive overview allows researchers and analysts to gain a better understanding of who holds power in the world economy today
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains the rank and total asset information of the top 100 global banks as of December 31, 2017. It is a useful resource for researchers who wish to study how key financial institutions' asset information relate to each other across countries.
Using this dataset is relatively straightforward – it consists of three columns - rank (the order in which each bank appears in the list), country (the country in which the bank is located) and total assets US billions (the total value expressed in US dollars). Additionally, there is a fourth column containing the balance sheet information for each bank as well.
In order to make full use of this dataset, one should analyse it by creating comparison grids based on different factors such as region, size or ownership structures. This can provide an interesting insight into how financial markets are structured within different economies and allow researchers to better understand some banking sector dynamics that are particularly relevant for certain countries or regions. Additionally, one can compare any two banks side-by-side using their respective balance sheets or distribution plot graphs based on size or concentration metrics by leverage or other financial ratios as well.
Overall, this dataset provides useful resources that can be put into practice through data visualization making an interesting reference point for trends analysis and forecasting purposes focusing on certain banking activities worldwide
Analyzing the differences in total assets across countries. By comparing and contrasting data, patterns could be found that give insight into the factors driving differences in banks’ assets between different markets.
Using predictive models to identify which banks are more likely to perform better based on their balance sheet data, such as by predicting future profits or cashflows of said banks.
Leveraging the information on holdings and investments of “top-ranked” banks as a guide for personal investments decisions or informing investment strategies of large financial institutions or hedge funds
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: top50banks2017-03-31.csv | Column name | Description | |:----------------------|:------------------------------------------------------------------------| | rank | The rank of the bank globally based on total assets. (Integer) | | country | The country where the bank is located. (String) | | total_assets_us_b | The total assets of a bank expressed in billions of US dollars. (Float) | | balance_sheet | A snapshot of banking activities for a specific date. (Date) |
File: top100banks2017-12-31.csv | Column name | Description | |:----------------------|:--------------------------------------------...
Overview The Office of the Geographer and Global Issues at the U.S. Department of State produces the Large Scale International Boundaries (LSIB) dataset. The current edition is version 11.4 (published 24 February 2025). The 11.4 release contains updated boundary lines and data refinements designed to extend the functionality of the dataset. These data and generalized derivatives are the only international boundary lines approved for U.S. Government use. The contents of this dataset reflect U.S. Government policy on international boundary alignment, political recognition, and dispute status. They do not necessarily reflect de facto limits of control. National Geospatial Data Asset This dataset is a National Geospatial Data Asset (NGDAID 194) managed by the Department of State. It is a part of the International Boundaries Theme created by the Federal Geographic Data Committee. Dataset Source Details Sources for these data include treaties, relevant maps, and data from boundary commissions, as well as national mapping agencies. Where available and applicable, the dataset incorporates information from courts, tribunals, and international arbitrations. The research and recovery process includes analysis of satellite imagery and elevation data. Due to the limitations of source materials and processing techniques, most lines are within 100 meters of their true position on the ground. Cartographic Visualization The LSIB is a geospatial dataset that, when used for cartographic purposes, requires additional styling. The LSIB download package contains example style files for commonly used software applications. The attribute table also contains embedded information to guide the cartographic representation. Additional discussion of these considerations can be found in the Use of Core Attributes in Cartographic Visualization section below. Additional cartographic information pertaining to the depiction and description of international boundaries or areas of special sovereignty can be found in Guidance Bulletins published by the Office of the Geographer and Global Issues: https://data.geodata.state.gov/guidance/index.html Contact Direct inquiries to internationalboundaries@state.gov. Direct download: https://data.geodata.state.gov/LSIB.zip Attribute Structure The dataset uses the following attributes divided into two categories: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | Core CC1_GENC3 | Extension CC1_WPID | Extension COUNTRY1 | Core CC2 | Core CC2_GENC3 | Extension CC2_WPID | Extension COUNTRY2 | Core RANK | Core LABEL | Core STATUS | Core NOTES | Core LSIB_ID | Extension ANTECIDS | Extension PREVIDS | Extension PARENTID | Extension PARENTSEG | Extension These attributes have external data sources that update separately from the LSIB: ATTRIBUTE NAME | ATTRIBUTE STATUS CC1 | GENC CC1_GENC3 | GENC CC1_WPID | World Polygons COUNTRY1 | DoS Lists CC2 | GENC CC2_GENC3 | GENC CC2_WPID | World Polygons COUNTRY2 | DoS Lists LSIB_ID | BASE ANTECIDS | BASE PREVIDS | BASE PARENTID | BASE PARENTSEG | BASE The core attributes listed above describe the boundary lines contained within the LSIB dataset. Removal of core attributes from the dataset will change the meaning of the lines. An attribute status of “Extension” represents a field containing data interoperability information. Other attributes not listed above include “FID”, “Shape_length” and “Shape.” These are components of the shapefile format and do not form an intrinsic part of the LSIB. Core Attributes The eight core attributes listed above contain unique information which, when combined with the line geometry, comprise the LSIB dataset. These Core Attributes are further divided into Country Code and Name Fields and Descriptive Fields. County Code and Country Name Fields “CC1” and “CC2” fields are machine readable fields that contain political entity codes. These are two-character codes derived from the Geopolitical Entities, Names, and Codes Standard (GENC), Edition 3 Update 18. “CC1_GENC3” and “CC2_GENC3” fields contain the corresponding three-character GENC codes and are extension attributes discussed below. The codes “Q2” or “QX2” denote a line in the LSIB representing a boundary associated with areas not contained within the GENC standard. The “COUNTRY1” and “COUNTRY2” fields contain the names of corresponding political entities. These fields contain names approved by the U.S. Board on Geographic Names (BGN) as incorporated in the ‘"Independent States in the World" and "Dependencies and Areas of Special Sovereignty" lists maintained by the Department of State. To ensure maximum compatibility, names are presented without diacritics and certain names are rendered using common cartographic abbreviations. Names for lines associated with the code "Q2" are descriptive and not necessarily BGN-approved. Names rendered in all CAPITAL LETTERS denote independent states. Names rendered in normal text represent dependencies, areas of special sovereignty, or are otherwise presented for the convenience of the user. Descriptive Fields The following text fields are a part of the core attributes of the LSIB dataset and do not update from external sources. They provide additional information about each of the lines and are as follows: ATTRIBUTE NAME | CONTAINS NULLS RANK | No STATUS | No LABEL | Yes NOTES | Yes Neither the "RANK" nor "STATUS" fields contain null values; the "LABEL" and "NOTES" fields do. The "RANK" field is a numeric expression of the "STATUS" field. Combined with the line geometry, these fields encode the views of the United States Government on the political status of the boundary line. ATTRIBUTE NAME | | VALUE | RANK | 1 | 2 | 3 STATUS | International Boundary | Other Line of International Separation | Special Line A value of “1” in the “RANK” field corresponds to an "International Boundary" value in the “STATUS” field. Values of ”2” and “3” correspond to “Other Line of International Separation” and “Special Line,” respectively. The “LABEL” field contains required text to describe the line segment on all finished cartographic products, including but not limited to print and interactive maps. The “NOTES” field contains an explanation of special circumstances modifying the lines. This information can pertain to the origins of the boundary lines, limitations regarding the purpose of the lines, or the original source of the line. Use of Core Attributes in Cartographic Visualization Several of the Core Attributes provide information required for the proper cartographic representation of the LSIB dataset. The cartographic usage of the LSIB requires a visual differentiation between the three categories of boundary lines. Specifically, this differentiation must be between: International Boundaries (Rank 1); Other Lines of International Separation (Rank 2); and Special Lines (Rank 3). Rank 1 lines must be the most visually prominent. Rank 2 lines must be less visually prominent than Rank 1 lines. Rank 3 lines must be shown in a manner visually subordinate to Ranks 1 and 2. Where scale permits, Rank 2 and 3 lines must be labeled in accordance with the “Label” field. Data marked with a Rank 2 or 3 designation does not necessarily correspond to a disputed boundary. Please consult the style files in the download package for examples of this depiction. The requirement to incorporate the contents of the "LABEL" field on cartographic products is scale dependent. If a label is legible at the scale of a given static product, a proper use of this dataset would encourage the application of that label. Using the contents of the "COUNTRY1" and "COUNTRY2" fields in the generation of a line segment label is not required. The "STATUS" field contains the preferred description for the three LSIB line types when they are incorporated into a map legend but is otherwise not to be used for labeling. Use of the “CC1,” “CC1_GENC3,” “CC2,” “CC2_GENC3,” “RANK,” or “NOTES” fields for cartographic labeling purposes is prohibited. Extension Attributes Certain elements of the attributes within the LSIB dataset extend data functionality to make the data more interoperable or to provide clearer linkages to other datasets. The fields “CC1_GENC3” and “CC2_GENC” contain the corresponding three-character GENC code to the “CC1” and “CC2” attributes. The code “QX2” is the three-character counterpart of the code “Q2,” which denotes a line in the LSIB representing a boundary associated with a geographic area not contained within the GENC standard. To allow for linkage between individual lines in the LSIB and World Polygons dataset, the “CC1_WPID” and “CC2_WPID” fields contain a Universally Unique Identifier (UUID), version 4, which provides a stable description of each geographic entity in a boundary pair relationship. Each UUID corresponds to a geographic entity listed in the World Polygons dataset. These fields allow for linkage between individual lines in the LSIB and the overall World Polygons dataset. Five additional fields in the LSIB expand on the UUID concept and either describe features that have changed across space and time or indicate relationships between previous versions of the feature. The “LSIB_ID” attribute is a UUID value that defines a specific instance of a feature. Any change to the feature in a lineset requires a new “LSIB_ID.” The “ANTECIDS,” or antecedent ID, is a UUID that references line geometries from which a given line is descended in time. It is used when there is a feature that is entirely new, not when there is a new version of a previous feature. This is generally used to reference countries that have dissolved. The “PREVIDS,” or Previous ID, is a UUID field that contains old versions of a line. This is an additive field, that houses all Previous IDs. A new version of a feature is defined by any change to the
Census data reveals that population density varies noticeably from area to area. Small area census data do a better job depicting where the crowded neighborhoods are. In this map, the yellow areas of highest density range from 30,000 to 150,000 persons per square kilometer. In those areas, if the people were spread out evenly across the area, there would be just 4 to 9 meters between them. Very high density areas exceed 7,000 persons per square kilometer. High density areas exceed 5,200 persons per square kilometer. The last categories break at 3,330 persons per square kilometer, and 1,500 persons per square kilometer.This dataset is comprised of multiple sources. All of the demographic data are from Michael Bauer Research with the exception of the following countries:Australia: Esri Australia and MapData ServicesCanada: Esri Canada and EnvironicsFrance: Esri FranceGermany: Esri Germany and NexigaIndia: Esri India and IndicusJapan: Esri JapanSouth Korea: Esri Korea and OPENmateSpain: Esri España and AISUnited States: Esri Demographics
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for GOLD RESERVES reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for GDP reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain
Country codes: ISO 2ISO 3UNLANGLABEL (EN, FR, SP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See cvte.nl/about-cvte. The Board of Tests and Examinations is mandated by the Dutch government.
The article accompanying the dataset:
Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056. https://doi.org/10.1016/j.cognition.2019.104056.
Every row in the dataset represents the first official testing score of a unique learner. The columns contain the following information as based on questionnaires filled in at the time of the exam:
"L1" - The first language of the learner "C" - The country of birth "L1L2" - The combination of first and best additional language besides Dutch "L2" - The best additional language besides Dutch "AaA" - Age at Arrival in the Netherlands in years (starting date of residence) "LoR" - Length of residence in the Netherlands in years "Edu.day" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer. "Sex" - Gender "Family" - Language Family "ISO639.3" - Language ID code according to Ethnologue "Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin. "STEX_speaking_score" - The STEX test score for speaking proficiency. "Dissimilarity_morphological" - Morphological similarity "Dissimilarity_lexical" - Lexical similarity "Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features) "Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)
A few rows of the data:
"L1","C","L1L2","L2","AaA","LoR","Edu.day","Sex","Family","ISO639.3","Enroll","STEX_speaking_score","Dissimilarity_morphological","Dissimilarity_lexical","Dissimilarity_phonological_new_features","Dissimilarity_phonological_new_categories" "English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19 "English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19 "English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19 "English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19 "English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The average for 2023 based on 193 countries was -0.07 points. The highest value was in Liechtenstein: 1.61 points and the lowest value was in Syria: -2.75 points. The indicator is available from 1996 to 2023. Below is a chart for all countries where data are available.
The current dataset is a subset of a large data collection based on a purpose-built survey conducted in seven middle-income countries in the Global South: Chile, Colombia, India, Kenya, Nigeria, Tanzania, South Africa and Vietnam. The purpose of the collected variables in the present dataset aims to understanding public preferences as a critical way to any effort to reduce greenhouse gas emissions. There are many studies of public preferences regarding climate change in the Global North. However, survey work in low and middle-income countries is limited. Survey work facilitating cross-country comparisons not using the major omnibus surveys is relatively rare.
We designed the Environment for Development (EfD) Seven-country Global South Climate Survey (the EfD Survey) which collected information on respondents’ knowledge about climate change, the information sources that respondents rely on, and opinions on climate policy. The EfD survey contains a battery of well-known climate knowledge questions and questions concerning the attention to and degree of trust in various sources for climate information. Respondents faced several ranking tasks using a best-worst elicitation format. This approach offers greater robustness to cultural differences in how questions are answered than the Likert-scale questions commonly asked in omnibus surveys. We examine: (a) priorities for spending in thirteen policy areas including climate and COVID-19, (b) how respiratory diseases due to air pollution rank relative to six other health problems, (c) agreement with ten statements characterizing various aspects of climate policies, and (d) prioritization of uses for carbon tax revenue. The company YouGov collected data for the EfD Survey in 2023 from 8400 respondents, 1200 in each country. It supplements an earlier survey wave (administered a year earlier) that focused on COVID-19. Respondents were drawn from YouGov’s online panels. During the COVID-19 pandemic almost all surveys were conducted online. This has advantages and disadvantages. Online survey administration reduces costs and data collection times and allows for experimental designs assigning different survey stimuli. With substantial incentive payments, high response rates within the sampling frame are achievable and such incentivized respondents are hopefully motivated to carefully answer the questions posed. The main disadvantage is that the sampling frame is comprised of the internet-enabled portion of the population in each country (e.g., with computers, mobile phones, and tablets). This sample systematically underrepresents those with lower incomes and living in rural areas. This large segment of the population is, however, of considerable interest in its own right due to its exposure to online media and outsized influence on public opinion.
The data includes respondents’ preferences for climate change mitigation policies and competing policy issues like health. The data also includes questions such as how respondents think revenues from carbon taxes should be used. The outcome provide important information for policymakers to understand, evaluate, and shape national climate policies. It is worth noting that the data from Tanzania is only present in Wave 1 and that the data from Chile is only present in Wave 2.
Country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.). These indicators can be used to assess and analyze scientific domains. Country rankings may be compared or analysed separately. Indicators offered for each country: H Index, Documents, Citations, Citation per Document and Citable Documents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MGD: Music Genre Dataset
Over recent years, the world has seen a dramatic change in the way people consume music, moving from physical records to streaming services. Since 2017, such services have become the main source of revenue within the global recorded music market.
Therefore, this dataset is built by using data from Spotify. It provides a weekly chart of the 200 most streamed songs for each country and territory it is present, as well as an aggregated global chart.
Considering that countries behave differently when it comes to musical tastes, we use chart data from global and regional markets from January 2017 to December 2019, considering eight of the top 10 music markets according to IFPI: United States (1st), Japan (2nd), United Kingdom (3rd), Germany (4th), France (5th), Canada (8th), Australia (9th), and Brazil (10th).
We also provide information about the hit songs and artists present in the charts, such as all collaborating artists within a song (since the charts only provide the main ones) and their respective genres, which is the core of this work. MGD also provides data about musical collaboration, as we build collaboration networks based on artist partnerships in hit songs. Therefore, this dataset contains:
This dataset was originally built for a conference paper at ISMIR 2020. If you make use of the dataset, please also cite the following paper:
Gabriel P. Oliveira, Mariana O. Silva, Danilo B. Seufitelli, Anisio Lacerda, and Mirella M. Moro. Detecting Collaboration Profiles in Success-based Music Genre Networks. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR 2020), 2020.
@inproceedings{ismir/OliveiraSSLM20,
title = {Detecting Collaboration Profiles in Success-based Music Genre Networks},
author = {Gabriel P. Oliveira and
Mariana O. Silva and
Danilo B. Seufitelli and
Anisio Lacerda and
Mirella M. Moro},
booktitle = {21st International Society for Music Information Retrieval Conference}
pages = {726--732},
year = {2020}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Netflix "Top 10" TV Shows and Films’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/dhruvildave/netflix-top-10-tv-shows-and-films on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Every Tuesday, Netflix publishes four global Top 10 lists for films and TV: Film (English), TV (English), Film (Non-English), and TV (Non-English). These lists rank titles based on weekly hours viewed: the total number of hours that members around the world watched each title from Monday to Sunday of the previous week.
Each season of a series and each film is considered on their own, so you might see both Stranger Things seasons 2 and 3 in the Top 10. Because titles sometimes move in and out of the Top 10, there is also the total number of weeks that a season of a series or film has spent on the list.
Netflix also publishes Top 10 lists for nearly 100 countries and territories (the same locations where there are Top 10 rows on Netflix). Country lists are also ranked based on hours viewed but don’t show country-level viewing directly.
Finally, Netflix provides a list of the Top 10 most popular Netflix films and TV (branded Netflix in any country) in each of the four categories based on the hours that each title was viewed during its first 28 days.
--- Original source retains full ownership of the source dataset ---
Finland was ranked the happiest country in the world, according to the World Happiness Report from 2025. The Nordic country scored 7.74 on a scale from 0 to 10. Two other Nordic countries, Denmark and Iceland, followed in second and third place, respectively. The World Happiness Report is a landmark survey of the state of global happiness that ranks countries by how happy their citizens perceive themselves to be. Criticism The index has received criticism from different perspectives. Some argue that it is impossible to measure general happiness in a country. Others argue that the index places too much emphasis on material well-being as well as freedom from oppression. As a result, the Happy Planet Index was introduced, which takes life expectancy, experienced well-being, inequality of outcomes, and ecological footprint into account. Here, Costa Rica was ranked as the happiest country in the world. Afghanistan is the least happy country Nevertheless, most people agree that high levels of poverty, lack of access to food and water, as well as a prevalence of conflict are factors hindering public happiness. Hence, it comes as no surprise that Afghanistan was ranked as the least happy country in the world in 2024. The South Asian country is ridden by poverty and undernourishment, and topped the Global Terrorism Index in 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present the GLOBAL ROADKILL DATA, the largest worldwide compilation of roadkill data on terrestrial vertebrates. We outline the workflow (Fig. 1) to illustrate the sequential steps of the study, in which we merged local-scale survey datasets and opportunistic records into a unified roadkill large dataset comprising 208,570 roadkill records. These records include 2283 species and subspecies from 54 countries across six continents, ranging from 1971 to 2024.Large roadkill datasets offer the advantage ofpreventing the collection of redundant data and are valuable resources for both local and macro-scale analyses regarding roadkill rates, road and landscape features associated with roadkill risk, species more vulnerable to road traffic, and populations at risk due to additional mortality. The standardization of data - such as scientific names, projection coordinates, and units - in a user-friendly format, makes themreadily accessible to a broader scientific and non-scientific community, including NGOs, consultants, public administration officials, and road managers. The open-access approach promotes collaboration among researchers and road practitioners, facilitating the replication of studies, validation of findings, and expansion of previous work. Moreover, researchers can utilize suchdatasets to develop new hypotheses, conduct meta-analyses, address pressing challenges more efficiently and strengthen the robustness of road ecology research. Ensuring widespreadaccess to roadkill data fosters a more diverse and inclusive research community. This not only grants researchers in emerging economies with more data for analysis, but also cultivates a diverse array of perspectives and insightspromoting the advance of infrastructure ecology.MethodsInformation sources: A core team from different continents performed a systematic literature search in Web of Science and Google Scholar for published peer-reviewed papers and dissertations. It was searched for the following terms: “roadkill* OR “road-kill” OR “road mortality” AND (country) in English, Portuguese, Spanish, French and/or Mandarin. This initiative was also disseminated to the mailing lists associated with transport infrastructure: The CCSG Transport Working Group (WTG), Infrastructure & Ecology Network Europe (IENE) and Latin American & Caribbean Transport Working Group (LACTWG) (Fig. 1). The core team identified 750 scientific papers and dissertations with information on roadkill and contacted the first authors of the publications to request georeferenced locations of roadkill andofferco-authorship to this data paper. Of the 824 authors contacted, 145agreed to sharegeoreferenced roadkill locations, often involving additional colleagues who contributed to data collection. Since our main goal was to provide open access to data that had never been shared in this format before, data from citizen science projects (e.g., globalroakill.net) that are already available were not included.Data compilation: A total of 423 co-authors compiled the following information: continent, country, latitude and longitude in WGS 84 decimal degrees of the roadkill, coordinates uncertainty, class, order, family, scientific name of the roadkill, vernacular name, IUCN status, number of roadkill, year, month, and day of the record, identification of the road, type of road, survey type, references, and observers that recorded the roadkill (Supplementary Information Table S1 - description of the fields and Table S2 - reference list). When roadkill data were derived from systematic surveys, the dataset included additional information on road length that was surveyed, latitude and longitude of the road (initial and final part of the road segment), survey period, start year of the survey, final year of the survey, 1st month of the year surveyed, last month of the year surveyed, and frequency of the survey. We consolidated 142 valid datasets into a single dataset. We complemented this data with OccurenceID (a UUID generated using Java code), basisOfRecord, countryCode, locality using OpenStreetMap’s API (https://www.openstreetmap.org), geodeticDatum, verbatimScientificName, Kingdom, phylum, genus, specificEpithet, infraspecificEpithet, acceptedNameUsage, scientific name authorship, matchType, taxonRank using Darwin Core Reference Guide (https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters) and link of the associatedReference (URL).Data standardization - We conducted a clustering analysis on all text fields to identify similar entries with minor variations, such as typos, and corrected them using OpenRefine (http://openrefine.org). Wealsostandardized all date values using OpenRefine. Coordinate uncertainties listed as 0 m were adjusted to either 30m or 100m, depending on whether they were recorded after or before 2000, respectively, following the recommendation in the Darwin Core Reference Guide (https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters).Taxonomy - We cross-referenced all species names with the Global Biodiversity Information Facility (GBIF) Backbone Taxonomy using Java and GBIF’s API (https://doi.org/10.15468/39omei). This process aimed to rectify classification errors, include additional fields such as Kingdom, Phylum, and scientific authorship, and gather comprehensive taxonomic information to address any gap withinthe datasets. For species not automatically matched (matchType - Table S1), we manually searched for correct synonyms when available.Species conservation status - Using the species names, we retrieved their conservation status and also vernacular names by cross-referencing with the database downloaded from the IUCNRed List of Threatened Species (https://www.iucnredlist.org). Species without a match were categorized as "Not Evaluated".Data RecordsGLOBAL ROADKILL DATA is available at Figshare27 https://doi.org/10.6084/m9.figshare.25714233. The dataset incorporates opportunistic (collected incidentally without data collection efforts) and systematic data (collected through planned, structured, and controlled methods designed to ensure consistency and reliability). In total, it comprises 208,570 roadkill records across 177,428 different locations(Fig. 2). Data were collected from the road network of 54 countries from 6 continents: Europe (n = 19), Asia (n = 16), South America (n=7), North America (n = 4), Africa (n = 6) and Oceania (n = 2).(Figure 2 goes here)All data are georeferenced in WGS84 decimals with maximum uncertainty of 5000 m. Approximately 92% of records have a location uncertainty of 30 m or less, with only 1138 records having location uncertainties ranging from 1000 to 5000 m. Mammals have the highest number of roadkill records (61%), followed by amphibians (21%), reptiles (10%) and birds (8%). The species with the highest number of records were roe deer (Capreolus capreolus, n = 44,268), pool frog (Pelophylax lessonae, n = 11,999) and European fallow deer (Dama dama, n = 7,426).We collected information on 126 threatened species with a total of 4570 records. Among the threatened species, the giant anteater (Myrmecophaga tridactyla, VULNERABLE) has the highest number of records n = 1199), followed by the common fire salamander (Salamandra salamandra, VULNERABLE, n=1043), and European rabbit (Oryctolagus cuniculus, ENDANGERED, n = 440). Records ranged from 1971 and 2024, comprising 72% of the roadkill recorded since 2013. Over 46% of the records were obtained from systematic surveys, with road length and survey period averaging, respectively, 66 km (min-max: 0.09-855 km) and 780 days (1-25,720 days).Technical ValidationWe employed the OpenStreetMap API through Java todetect location inaccuracies, andvalidate whether the geographic coordinates aligned with the specified country. We calculated the distance of each occurrence to the nearest road using the GRIP global roads database28, ensuring that all records were within the defined coordinate uncertainty. We verified if the survey duration matched the provided initial and final survey dates. We calculated the distance between the provided initial and final road coordinates and cross-checked it with the given road length. We identified and merged duplicate entries within the same dataset (same location, species, and date), aggregating the number of roadkills for each occurrence.Usage NotesThe GLOBAL ROADKILL DATA is a compilation of roadkill records and was designed to serve as a valuable resource for a wide range of analyses. Nevertheless, to prevent the generation of meaningless results, users should be aware of the followinglimitations:- Geographic representation – There is an evident bias in the distribution of records. Data originatedpredominantly from Europe (60% of records), South America (22%), and North America (12%). Conversely, there is a notable lack of records from Asia (5%), Oceania (1%) and Africa (0.3%). This dataset represents 36% of the initial contacts that provided geo-referenced records, which may not necessarily correspond to locations where high-impact roads are present.- Location accuracy - Insufficient location accuracy was observed for 1% of the data (ranging from 1000 to 5000 m), that was associated with various factors, such as survey methods, recording practices, or timing of the survey.- Sampling effort - This dataset comprised both opportunistic data and records from systematic surveys, with a high variability in survey duration and frequency. As a result, the use of both opportunistic and systematic surveys may affect the relative abundance of roadkill making it hard to make sound comparisons among species or areas.- Detectability and carcass removal bias - Although several studies had a high frequency of road surveys,the duration of carcass persistence on roads may vary with species size and environmental conditions, affecting detectability. Accordingly, several approaches account for survey frequency and target speciesto estimate more
The global gender gap index benchmarks national gender gaps on economic, political, education, and health-based criteria. In 2025, the country offering most gender equal conditions was Iceland, with a score of 0.93. Overall, the Nordic countries make up 3 of the 5 most gender equal countries in the world. The Nordic countries are known for their high levels of gender equality, including high female employment rates and evenly divided parental leave. Sudan is the second-least gender equal country Pakistan is found on the other end of the scale, ranked as the least gender equal country in the world. Conditions for civilians in the North African country have worsened significantly after a civil war broke out in April 2023. Especially girls and women are suffering and have become victims of sexual violence. Moreover, nearly 9 million people are estimated to be at acute risk of famine. The Middle East and North Africa has the largest gender gap Looking at the different world regions, the Middle East and North Africa has the largest gender gap as of 2023, just ahead of South Asia. Moreover, it is estimated that it will take another 152 years before the gender gap in the Middle East and North Africa is closed. On the other hand, Europe has the lowest gender gap in the world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for MANUFACTURING PMI reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for GDP PER CAPITA PPP reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
The United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.census_bureau_international.
What countries have the longest life expectancy? In this query, 2016 census information is retrieved by joining the mortality_life_expectancy and country_names_area tables for countries larger than 25,000 km2. Without the size constraint, Monaco is the top result with an average life expectancy of over 89 years!
SELECT
age.country_name,
age.life_expectancy,
size.country_area
FROM (
SELECT
country_name,
life_expectancy
FROM
bigquery-public-data.census_bureau_international.mortality_life_expectancy
WHERE
year = 2016) age
INNER JOIN (
SELECT
country_name,
country_area
FROM
bigquery-public-data.census_bureau_international.country_names_area
where country_area > 25000) size
ON
age.country_name = size.country_name
ORDER BY
2 DESC
/* Limit removed for Data Studio Visualization */
LIMIT
10
Which countries have the largest proportion of their population under 25? Over 40% of the world’s population is under 25 and greater than 50% of the world’s population is under 30! This query retrieves the countries with the largest proportion of young people by joining the age-specific population table with the midyear (total) population table.
SELECT
age.country_name,
SUM(age.population) AS under_25,
pop.midyear_population AS total,
ROUND((SUM(age.population) / pop.midyear_population) * 100,2) AS pct_under_25
FROM (
SELECT
country_name,
population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population_agespecific
WHERE
year =2017
AND age < 25) age
INNER JOIN (
SELECT
midyear_population,
country_code
FROM
bigquery-public-data.census_bureau_international.midyear_population
WHERE
year = 2017) pop
ON
age.country_code = pop.country_code
GROUP BY
1,
3
ORDER BY
4 DESC /* Remove limit for visualization*/
LIMIT
10
The International Census dataset contains growth information in the form of birth rates, death rates, and migration rates. Net migration is the net number of migrants per 1,000 population, an important component of total population and one that often drives the work of the United Nations Refugee Agency. This query joins the growth rate table with the area table to retrieve 2017 data for countries greater than 500 km2.
SELECT
growth.country_name,
growth.net_migration,
CAST(area.country_area AS INT64) AS country_area
FROM (
SELECT
country_name,
net_migration,
country_code
FROM
bigquery-public-data.census_bureau_international.birth_death_growth_rates
WHERE
year = 2017) growth
INNER JOIN (
SELECT
country_area,
country_code
FROM
bigquery-public-data.census_bureau_international.country_names_area
Historic (none)
United States Census Bureau
Terms of use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
See the GCP Marketplace listing for more details and sample queries: https://console.cloud.google.com/marketplace/details/united-states-census-bureau/international-census-data