34 datasets found

o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
World Bank Unemployment Data (1991-2017)
kaggle.com
Updated Mar 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Uddipta@IIT_Hyd (2018). World Bank Unemployment Data (1991-2017) [Dataset]. https://www.kaggle.com/uddipta/world-bank-unemployment-data-19912017/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Uddipta@IIT_Hyd
License
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Description
Context

This data contains unemployment rate(1991-2017) of different countries, across different regions with different income groups, supplied by the world bank.

Content

The dataset has the following features 1.Country->ountry Name 2.Region-> The region of the country 3.Income Group-> In which income group, the country belongs 4.Special Notes-> Any special note about the country 5.Years(1991-2017)-> contains unemployment rate of that particular year.

Acknowledgements

The main source of this dataset is World Bank. I have just combined important features from multiple files to a single data file.
CORESIDENCE_GLAD: The Global Living Arrangements Database, 1960-2021
zenodo.org
bin, csv
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Galeano; Juan Galeano; Albert Esteve; Albert Esteve (2025). CORESIDENCE_GLAD: The Global Living Arrangements Database, 1960-2021 [Dataset]. http://doi.org/10.5281/zenodo.15038210
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15038210
Dataset updated
Mar 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Galeano; Juan Galeano; Albert Esteve; Albert Esteve
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Global Living Arrangements Database (GLAD), is a global resource designed to fill a critical gap in the availability of statistical information for examining patterns and changes in living arrangements by age, sex, marital status and educational attainment. Utilizing comprehensive census microdata from IPUMS International and the European Labour Force Survey (EU-LFS), GLAD summarizes over 740 million individual records across 107 countries, covering the period from 1960 to 2021. This database has been constructed using an innovative algorithm that reconstructs kinship relationships among all household members, providing a robust and scalable methodology for studying living arrangements. GLAD is expected to be a valuable resource for both researchers and policymakers, supporting evidence-based decision-making in areas such as housing, social services, and healthcare, as well as offering insights into long-term transformations in family structures. The open-source R code used in this project is publicly available, promoting transparency and enabling the creation of new ego-centred typologies based in interfamily relationships

The repository is composed of the following elements: a Rda file named CORESIDENCE_GLAD_2025.Rda in the form of a List. In R, a List object is a versatile data structure that can contain a collection of different data types, including vectors, matrices, data frames, other lists, spatial objects or even functions. It allows to store and organize heterogeneous data elements within a single object. The CORESIDENCE_GLAD_2025 R-list object is composed of six elements:

SINGLE AGES: a data frame where data is aggregated by single ages, marital status, educational attainment and living arrangement types. Source of the original data: IPUMS-I

AGE GROUPS IPUMS: a data frame where data is aggregated by five-year age groups, marital status, educational attainment and living arrangement types. Source of the original data: IPUMS-I.

AGE GROUPS LFS: a data frame where data is aggregated by five-year age groups, marital status, educational attainment and living arrangement types. Source of the original data: EU-LFS.

HARMONIZED: a data frame where data is aggregated by five-year age groups, marital status, educational attainment and living arrangement types. The categories of marital status and educational attainment have been harmonized between the two data sources. Source of the original data: IPUMS-I and EU-LFS

CODEBOOK: a data frame with the complete list of variables included, their names description and categories.

LABELS LAT: A R function to add the qualitative labels to Living Arrangement Types (LAT).

ATLAS LIVING ARRANGEMENTS: The url of the folder with leaflet of living arrangements for each sample included in GLAD.
The World English Bible
kaggle.com
Updated Feb 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyubyong Park (2018). The World English Bible [Dataset]. https://www.kaggle.com/datasets/bryanpark/the-world-english-bible-speech-dataset/discussion?sortBy=hot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2018
Dataset provided by
Kaggle
Authors
Kyubyong Park
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its audio recordings are freely available at http://www.audiotreasure.com/. The only problem when you use those in speech-relevant tasks is that each file is too long. That's why I split each audio file such that an audio clip is equivalent to a verse. Subsequently I aligned them to the text.

Content

This dataset is composed of the following:
- README.md
- wav files sampled at 12,000 KHZ
- transcript.txt.

transcript.txt is in a tab-delimited format. The first column is the audio file paths. The second one is the script. Finally, the rightmost column is the duration of the audio file.

Acknowledgements

I would like to show my respect to Dave, the host of www.audiotreasure.com and the reader of the audio files.

Reference

You may want to check my project using this dataset at https://github.com/Kyubyong/tacotron.
t
Overcrowding rate by age group - population without single-person households...
service.tib.eu
opendata.marche.camcom.it
+3more
Updated Jan 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Overcrowding rate by age group - population without single-person households - EU-SILC survey [Dataset]. https://service.tib.eu/ldmservice/dataset/eurostat_ub4yxjwglcni8bf3erimq
Explore at:
Dataset updated
Jan 8, 2025
Description
This indicator is defined as the percentage of the population living in an overcrowded household (excluding the single-person households). A person is considered as living in an overcrowded household if the household does not have at its disposal a minimum of rooms equal to: - one room for the household; - one room by couple in the household; - one room for each single person aged 18 and more; - one room by pair of single people of the same sex between 12 and 17 years of age; - one room for each single person between 12 and 17 years of age and not included in the previous category; - one room by pair of children under 12 years of age. The indicator is presented by age group.
Z
Global Dataset of Cyber Incidents V.1.2
data.niaid.nih.gov
zenodo.org
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Repository of Cyber Incidents (EuRepoC) (2024). Global Dataset of Cyber Incidents V.1.2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7848940
Explore at:
Dataset updated
May 3, 2024
Dataset authored and provided by
European Repository of Cyber Incidents (EuRepoC)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains data on 2889 cyber incidents between 01.01.2000 and 02.05.2024 using 60 variables, including the start date, names and categories of receivers along with names and categories of initiators. The database was compiled as part of the European Repository of Cyber Incidents (EuRepoC) project.

EuRepoC gathers, codes, and analyses publicly available information from over 200 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.For more information on the scope and data collection methodology see: https://eurepoc.eu/methodologyCodebook available hereInformation about each file:

Global Database (csv or xlsx):This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.

Receiver Dataset (csv):In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).

Attribution Dataset (csv):This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.eurepoc_global_database_1.2 (json):This file contains the whole database in JSON format.
ERA5 hourly data on pressure levels from 1940 to present
cds.climate.copernicus.eu
cds-test-cci2.copernicus-climate.eu
grib
Updated Jul 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). ERA5 hourly data on pressure levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.bd0915c6
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.bd0915c6
Dataset updated
Jul 14, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
Time period covered
Jan 1, 1940 - Jul 8, 2025
Description
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on pressure levels from 1940 to present".
C
Death Profiles by County
data.chhs.ca.gov
data.ca.gov
+3more
csv, zip
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Public Health (2025). Death Profiles by County [Dataset]. https://data.chhs.ca.gov/dataset/death-profiles-by-county
Explore at:
csv(28125832), csv(60517511), csv(75015194), csv(60201673), csv(60676655), csv(74351424), csv(52019564), csv(60023260), csv(74689382), csv(51592721), csv(73906266), csv(15127221), csv(1128641), csv(5095), csv(11738570), zip, csv(74043128), csv(24235858), csv(74497014), csv(21575405)Available download formats
Dataset updated
May 28, 2025
Dataset authored and provided by
California Department of Public Health
Description
This dataset contains counts of deaths for California counties based on information entered on death certificates. Final counts are derived from static data and include out-of-state deaths to California residents, whereas provisional counts are derived from incomplete and dynamic data. Provisional counts are based on the records available when the data was retrieved and may not represent all deaths that occurred during the time period. Deaths involving injuries from external or environmental forces, such as accidents, homicide and suicide, often require additional investigation that tends to delay certification of the cause and manner of death. This can result in significant under-reporting of these deaths in provisional data.

The final data tables include both deaths that occurred in each California county regardless of the place of residence (by occurrence) and deaths to residents of each California county (by residence), whereas the provisional data table only includes deaths that occurred in each county regardless of the place of residence (by occurrence). The data are reported as totals, as well as stratified by age, gender, race-ethnicity, and death place type. Deaths due to all causes (ALL) and selected underlying cause of death categories are provided. See temporal coverage for more information on which combinations are available for which years.

The cause of death categories are based solely on the underlying cause of death as coded by the International Classification of Diseases. The underlying cause of death is defined by the World Health Organization (WHO) as "the disease or injury which initiated the train of events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury." It is a single value assigned to each death based on the details as entered on the death certificate. When more than one cause is listed, the order in which they are listed can affect which cause is coded as the underlying cause. This means that similar events could be coded with different underlying causes of death depending on variations in how they were entered. Consequently, while underlying cause of death provides a convenient comparison between cause of death categories, it may not capture the full impact of each cause of death as it does not always take into account all conditions contributing to the death.
Global Health and Development (2012-2021)
kaggle.com
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martina Galasso (2024). Global Health and Development (2012-2021) [Dataset]. https://www.kaggle.com/datasets/martinagalasso/global-health-and-development-2012-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Martina Galasso
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides a curated and comprehensive overview of global health, demographic, economic, and environmental metrics for 188 recognized countries over a period of 10 years (2012-2021). It was created by combining reliable data from the World Bank and the World Health Organization (WHO). Due to the absence of a single source containing all necessary indicators, over 60 datasets were analyzed, cleaned, and merged, prioritizing completeness and significance.

The dataset includes 29 key indicators, ranging from life expectancy, population metrics, and economic factors to environmental conditions and health-related behaviors. Missing values were carefully handled, and only the most relevant data with substantial coverage were retained.

This dataset is ideal for researchers, analysts, and policymakers interested in exploring relationships between economic development, health outcomes, and environmental factors at a global scale.
Data from: DOO-RE: A dataset of ambient sensors in a meeting room for...
figshare.com
zip
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyunju Kim (2024). DOO-RE: A dataset of ambient sensors in a meeting room for activity recognition [Dataset]. http://doi.org/10.6084/m9.figshare.24558619.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24558619.v3
Dataset updated
Feb 23, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Hyunju Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We release the DOO-RE dataset which consists of data streams from 11 types of various ambient sensors by collecting data 24/7 from a real-world meeting room. 4 types of ambient sensors, called environment-driven sensors, measure continuous state changes in the environment (e.g. sound), and 4 types of sensors, called user-driven sensors, capture user state changes (e.g. motion). The remaining 3 types of sensors, called actuator-driven sensors, check whether the attached actuators are active (e.g. projector on/off). The values of each sensor are automatically collected by IoT agents which are responsible for each sensor in our IoT system. A part of the collected sensor data stream representing a user activity is extracted as an activity episode in the DOO-RE dataset. Each episode's activity labels are annotated and validated by cross-checking and the consent of multiple annotators. A total of 9 activity types appear in the space: 3 based on single users and 6 based on group (i.e. 2 or more people) users. As a result, DOO-RE is constructed with 696 labeled episodes for single and group activities from the meeting room. DOO-RE is a novel dataset created in a public space that contains the properties of the real-world environment and has the potential to be good uses for developing powerful activity recognition approaches.
A
‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-vehicle-miles-traveled-during-covid-19-lock-downs-636d/latest
Explore at:
Dataset updated
Jan 4, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/vehicle-miles-travelede on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

**This data set was last updated 3:30 PM ET Monday, January 4, 2021. The last date of data in this dataset is December 31, 2020. **

Overview

Data shows that mobility declined nationally since states and localities began shelter-in-place strategies to stem the spread of COVID-19. The numbers began climbing as more people ventured out and traveled further from their homes, but in parallel with the rise of COVID-19 cases in July, travel declined again.

This distribution contains county level data for vehicle miles traveled (VMT) from StreetLight Data, Inc, updated three times a week. This data offers a detailed look at estimates of how much people are moving around in each county.

Data available has a two day lag - the most recent data is from two days prior to the update date. Going forward, this dataset will be updated by AP at 3:30pm ET on Monday, Wednesday and Friday each week.

This data has been made available to members of AP’s Data Distribution Program. To inquire about access for your organization - publishers, researchers, corporations, etc. - please click Request Access in the upper right corner of the page or email kromano@ap.org. Be sure to include your contact information and use case.

Findings

Nationally, data shows that vehicle travel in the US has doubled compared to the seven-day period ending April 13, which was the lowest VMT since the COVID-19 crisis began. In early December, travel reached a low not seen since May, with a small rise leading up to the Christmas holiday.

Average vehicle miles traveled continues to be below what would be expected without a pandemic - down 38% compared to January 2020. September 4 reported the largest single day estimate of vehicle miles traveled since March 14.

New Jersey, Michigan and New York are among the states with the largest relative uptick in travel at this point of the pandemic - they report almost two times the miles traveled compared to their lowest seven-day period. However, travel in New Jersey and New York is still much lower than expected without a pandemic. Other states such as New Mexico, Vermont and West Virginia have rebounded the least.

About This Data

The county level data is provided by StreetLight Data, Inc, a transportation analysis firm that measures travel patterns across the U.S.. The data is from their Vehicle Miles Traveled (VMT) Monitor which uses anonymized and aggregated data from smartphones and other GPS-enabled devices to provide county-by-county VMT metrics for more than 3,100 counties. The VMT Monitor provides an estimate of total vehicle miles travelled by residents of each county, each day since the COVID-19 crisis began (March 1, 2020), as well as a change from the baseline average daily VMT calculated for January 2020. Additional columns are calculations by AP.

Included Data

01_vmt_nation.csv - Data summarized to provide a nationwide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

02_vmt_state.csv - Data summarized to provide a statewide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

03_vmt_county.csv - Data providing a county level look at vehicle miles traveled. Includes VMT estimate, percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

Additional Data Queries

* Filter for specific state - filters 02_vmt_state.csv daily data for specific state.

* Filter counties by state - filters 03_vmt_county.csv daily data for counties in specific state.

* Filter for specific county - filters 03_vmt_county.csv daily data for specific county.

Interactive

The AP has designed an interactive map to show percent change in vehicle miles traveled by county since each counties lowest point during the pandemic:

This dataset was created by Angeliki Kastanis and contains around 0 samples along with Date At Low, Mean7 County Vmt At Low, technical information and other features such as: - County Name - County Fips - and more.

How to use this dataset

Analyze State Name in relation to Baseline Jan Vmt

Study the influence of Date At Low on Mean7 County Vmt At Low

More datasets

Acknowledgements

If you use this dataset in your research, please credit Angeliki Kastanis

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
z
Global Dataset of Cyber Incidents
zenodo.org
bin, csv, pdf, txt
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerstin Zettl-Schabath; Kerstin Zettl-Schabath; Jakob Bund; Jakob Bund; Martin Müller; Martin Müller; Camille Borrett; Jonas Hemmelskamp; Jonas Hemmelskamp; Asaf Alibegovic; Enis Bajra; Alisa Jazxhi; Erik Kellenter; Annika Sachs; Callahan Shelley; Camille Borrett; Asaf Alibegovic; Enis Bajra; Alisa Jazxhi; Erik Kellenter; Annika Sachs; Callahan Shelley (2025). Global Dataset of Cyber Incidents [Dataset]. http://doi.org/10.5281/zenodo.14965395
Explore at:
pdf, bin, txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14965395
Dataset updated
Apr 1, 2025
Dataset provided by
European Repository of Cyber Incidents
Authors
Kerstin Zettl-Schabath; Kerstin Zettl-Schabath; Jakob Bund; Jakob Bund; Martin Müller; Martin Müller; Camille Borrett; Jonas Hemmelskamp; Jonas Hemmelskamp; Asaf Alibegovic; Enis Bajra; Alisa Jazxhi; Erik Kellenter; Annika Sachs; Callahan Shelley; Camille Borrett; Asaf Alibegovic; Enis Bajra; Alisa Jazxhi; Erik Kellenter; Annika Sachs; Callahan Shelley
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The European Repository of Cyber Incidents (EuRepoC) is releasing the Global Dataset of Cyber Incidents in Version 1.3 as an extract of our backend database. This official release contains fully consolidated cyber incident data reviewed by our interdisciplinary experts in the fields of politics, law and technology across all 60 variables covered by the European Repository. Version 1.3 covers the years 2000 – 2024 entirely. The Global Dataset is meant for reliable, evidence-based analysis. If you require real-time data, please refer to the download option in our TableView or contact us for special requirements (including API access).

The dataset now contains data on 3416 cyber incidents which started between 01.01.2000 and 31.12.2024. The European Repository of Cyber Incidents (EuRepoC) gathers, codes, and analyses publicly available information from over 220 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.

For more information on the scope and data collection methodology see: https://eurepoc.eu/methodology

Full Codebook available here

Information about each file

please scroll down this page entirely to see all files available. Zenodo only displays the attribution dataset by default.

Global Database (csv or xlsx):
This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.

Receiver Dataset (csv or xlsx):
In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).

Attribution Dataset (csv or xlsx):
This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.

Dyadic Dataset (csv or xlsx):
The dyadic dataset puts state dyads in the focus. Each row in the dataset represents one cyber incident in a specific dyad. Because incidents may affect multiple receivers, single incidents can be duplicated in this format, when they affected multiple countries.
Overcrowding rate by poverty status - population without single-person...
db.nomics.world
opendata.marche.camcom.it
+3more
Updated Feb 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DBnomics (2022). Overcrowding rate by poverty status - population without single-person households - EU-SILC survey [Dataset]. https://db.nomics.world/Eurostat/tessi178
Explore at:
Dataset updated
Feb 2, 2022
Dataset provided by
Eurostathttps://ec.europa.eu/eurostat
Authors
DBnomics
Description
This indicator is defined as the percentage of the population living in an overcrowded household (excluding the single-person households). A person is considered as living in an overcrowded household if the household does not have at its disposal a minimum of rooms equal to: - one room for the household; - one room by couple in the household; - one room for each single person aged 18 and more; - one room by pair of single people of the same sex between 12 and 17 years of age; - one room for each single person between 12 and 17 years of age and not included in the previous category; - one room by pair of children under 12 years of age. The indicator is presented by poverty status.
A
‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-list-of-top-data-breaches-2004-2021-e7ac/746cf4e2/?iid=002-608&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.

--- Dataset description provided by original source is as follows ---

This is a dataset containing all the major data breaches in the world from 2004 to 2021

As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.

This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?

Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches

Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning

--- Original source retains full ownership of the source dataset ---
CommitBench
zenodo.org
csv, json
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
Explore at:
json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10497442
Dataset updated
Feb 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Dec 15, 2023
Description
Data Statement for CommitBench

- Dataset Title: CommitBench

- Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo

- Dataset Version: 1.0, 15.12.2023

- Data Statement Author: Maximilian Schall, Tamara Czinczoll

- Data Statement Version: 1.0, 16.01.2023

- Code URL: https://github.com/maxscha/commitbench

EXECUTIVE SUMMARY

We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

CURATION RATIONALE

We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

DOCUMENTATION FOR SOURCE DATASETS

Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

LANGUAGE VARIETIES

Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

Language Number of Samples
Java 153,119
Ruby 233,710
Go 137,998
JavaScript 373,598
Python 472,469
PHP 294,394

SPEAKER DEMOGRAPHIC

Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

ANNOTATOR DEMOGRAPHIC

Due to the automated generation of the dataset, no annotators were used.

SPEECH SITUATION AND CHARACTERISTICS

The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

PREPROCESSING AND DATA FORMATTING

See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

CAPTURE QUALITY

While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

LIMITATIONS

While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

METADATA

License: Dataset under the CC BY-NC 4.0 license

DISCLOSURES AND ETHICAL REVIEW

While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

ABOUT THIS DOCUMENT

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.
i
Household Demographic Surveillance System, Cause-Specific Mortality...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Sie (2019). Household Demographic Surveillance System, Cause-Specific Mortality 1992-2012 - World [Dataset]. https://catalog.ihsn.org/catalog/5541
Explore at:
Dataset updated
Mar 29, 2019
Dataset provided by
Abba Bhuiya
Sanjay Juvekar
P. Kim Streatfield
Ali Sie
Abraham J. Herbst
Nurul Alam
Momodou Jasseh
Abdramane Soura
Bassirou Bonfoh
Valérie Delaunay
Amelia Crampin
Abraham Oduro
Marcel Tanner
Shashi Kant
Peter Byass
Berhe Weldearegawi
Stephen M. Tollman
Frank O. Odhiambo
Osman A. Sankoh
Margaret Gyapong
Siswanto Wilopo
Nguyen T.K. Chuc
Alex Ezeh
Thomas N. Williams
Wasif A. Khan
Time period covered
1992 - 2012
Area covered
World, World
Description
Abstract

Cause of death data based on VA interviews were contributed by fourteen INDEPTH HDSS sites in sub-Saharan Africa and eight sites in Asia. The principles of the Network and its constituent population surveillance sites have been described elsewhere [1]. Each HDSS site is committed to long-term longitudinal surveillance of circumscribed populations, typically each covering around 50,000 to 100,000 people. Households are registered and visited regularly by lay field-workers, with a frequency varying from once per year to several times per year. All vital events are registered at each such visit, and any deaths recorded are followed up with verbal autopsy interviews, usually 147 undertaken by specially trained lay interviewers. A few sites were already operational in the 1990s, but in this dataset 95% of the person-time observed related to the period from 2000 onwards, with 58% from 2007 onwards. Two sites, in Nairobi and Ouagadougou, followed urban populations, while the remainder covered areas that were generally more rural in character, although some included local urban centres. Sites covered entire populations, although the Karonga, Malawi, site only contributed VAs for deaths of people aged 12 years and older. Because the sites were not located or designed in a systematic way to be representative of national or regional populations, it is not meaningful to aggregate results over sites.

All cause of death assignments in this dataset were made using the InterVA-4 model version 4.02 [2]. InterVA-4 uses probabilistic modelling to arrive at likely cause(s) of death for each VA case, the workings of the model being based on a combination of expert medical opinion and relevant available data. InterVA-4 is the only model currently available that processes VA data according to the WHO 2012 standard and categorises causes of death according to ICD-10. Since the VA data reported here were collected before the WHO 2012 standard was formulated, they were all retrospectively transformed into the WHO 2012 and InterVA-4 input format for processing.

The InterVA-4 model was applied to the data from each site, yielding, for each case, up to three possible causes of death or an indeterminate result. Each cause for a case is a single record in the dataset. In a minority of cases, for example where symptoms were vague, contradictory or mutually inconsistent, it was impossible for InterVA-4 to determine a cause of death, and these deaths were attributed as entirely indeterminate. For the remaining cases, one to three likely causes and their likelihoods were assigned by InterVA-4, and if the sum of their likelihoods was less than one, the residual component was then assigned as being indeterminate. This was an important process for capturing uncertainty in cause of death outcome(s) from the model at the individual level, thus avoiding over-interpretation of specific causes. As a consequence there were three sources of unattributed cause of death: deaths registered for which VAs were not successfully completed; VAs completed but where the cause was entirely indeterminate; and residual components of deaths attributed as indeterminate.

In this dataset each case has between one and four records, each with its own cause and likelihood. Cases for which VAs were not successfully completed has a single record with the cause of death recorded as “VA not completed” and a likelihood of one. Thus the overall sum of the likelihoods equated to the total number of deaths. Each record also contains a population weighting factor reflecting the ratio of the population fraction for its site, age group, sex and year to the corresponding age group and sex fraction in the standard population (see section on weighting).

In this context, all of these data are secondary datasets derived from primary data collected separately by each participating site. In all cases the primary data collection was covered by site-level ethical approvals relating to on-going demographic surveillance in those specific locations. No individual identity or household location data are included in this secondary data.

Sankoh O, Byass P. The INDEPTH Network: filling vital gaps in global epidemiology. International Journal of Epidemiology 2012; 41:579-588.

Byass P, Chandramohan D, Clark SJ, D’Ambruoso L, Fottrell E, Graham WJ, et al. Strengthening standardised interpretation of verbal autopsy data: the new InterVA-4 tool. Global Health Action 2012; 5:19281.

Geographic coverage

Demographic surveiallance areas (countries from Africa, Asia and Oceania) of the following HDSSs:
Code Country INDEPTH Centre
BD011 Bangladesh ICDDR-B : Matlab
BD012 Bangladesh ICDDR-B : Bandarban
BD013 Bangladesh ICDDR-B : Chakaria
BD014 Bangladesh ICDDR-B : AMK BF031 Burkina Faso Nouna BF041 Burkina Faso Ouagadougou
CI011 Côte d'Ivoire Taabo ET031 Ethiopia Kilite Awlaelo
GH011 Ghana Navrongo
GH031 Ghana Dodowa
GM011 The Gambia Farafenni ID011 Indonesia Purworejo IN011 India Ballabgarh
IN021 India Vadu
KE011 Kenya Kilifi
KE021 Kenya Kisumu
KE031 Kenya Nairobi
MW011 Malawi Karonga
SN011 Senegal IRD : Bandafassi VN012 Vietnam Hanoi Medical University : Filabavi
ZA011 South Africa Agincourt ZA031 South Africa Africa Centre

Analysis unit

Death Cause

Universe

Surveillance population Deceased individuals Cause of death

Kind of data

Verbal autopsy-based cause of death data

Frequency of data collection

Rounds per year varies between sites from once to three times per year

Sampling procedure

No sampling, covers total population in demographic surveillance area

Mode of data collection

Face-to-face [f2f]

Research instrument

The Verbal Autopsy Questionnaires used by the various sites differed, but in most cases they were a derivation from the original WHO Verbal Autopsy questionnaire.

http://www.who.int/healthinfo/statistics/verbalautopsystandards/en/index1.html

Cleaning operations

One cause of death record was inserted for every death where a verbal autopsy was not conducted. The cuase of death assigned in these cases is "XX VA not completed"
A
‘FIFA - Football World Cup Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘FIFA - Football World Cup Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-fifa-football-world-cup-dataset-2599/66e21fbf/?iid=018-912&v=presentation
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Analysis of ‘FIFA - Football World Cup Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/iamsouravbanerjee/fifa-football-world-cup-dataset on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body. The championship has been awarded every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the Second World War. The current champion is France, which won its second title at the 2018 tournament in Russia.

The current format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. In the tournament phase, 32 teams, including the automatically qualifying host nation(s), compete for the title at venues within the host nation(s) over about a month.

The 21 World Cup tournaments have been won by eight national teams. Brazil have won five times, and they are the only team to have played in every tournament. The other World Cup winners are Germany and Italy, with four titles each; Argentina, France, and inaugural winner Uruguay, with two titles each; and England and Spain, with one title each.

The World Cup is the most prestigious association football tournament in the world, as well as the most widely viewed and followed single sporting event in the world. The cumulative viewership of all matches of the 2006 World Cup was estimated to be 26.29 billion with an estimated 715.1 million people watching the final match, a ninth of the entire population of the planet.

17 countries have hosted the World Cup. Brazil, France, Italy, Germany, and Mexico have each hosted twice, while Uruguay, Switzerland, Sweden, Chile, England, Argentina, Spain, the United States, Japan, and South Korea (jointly), South Africa, and Russia have each hosted once. Qatar will host the 2022 tournament, and 2026 will be jointly hosted by Canada, the United States, and Mexico, which will give Mexico the distinction of being the first country to host games in three World Cups.

Content

This Dataset consists of Records from all the previous Football World Cups (1930 to 2018)

Acknowledgements

For more, please visit - https://www.fifa.com/

--- Original source retains full ownership of the source dataset ---
The GDELT Project
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The GDELT Project (2019). The GDELT Project [Dataset]. https://www.kaggle.com/datasets/gdelt/gdelt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
The GDELT Project
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Just the 2015 data alone records nearly three quarters of a trillion emotional snapshots and more than 1.5 billion location references, while its total archives span more than 215 years, making it one of the largest open-access spatio-temporal datasets in existance and pushing the boundaries of "big data" study of global human society. Its Global Knowledge Graph connects the world's people, organizations, locations, themes, counts, images and emotions into a single holistic network over the entire planet. How can you query, explore, model, visualize, interact, and even forecast this vast archive of human society?

Content

GDELT 2.0 has a wealth of features in the event database which includes events reported in articles published in 65 live translated languages, measurements of 2,300 emotions and themes, high resolution views of the non-Western world, relevant imagery, videos, and social media embeds, quotes, names, amounts, and more.

You may find these code books helpful:
GDELT Global Knowledge Graph Codebook V2.1 (PDF)
GDELT Event Codebook V2.0 (PDF)

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. [Fork this kernel to get started][98] to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

You may redistribute, rehost, republish, and mirror any of the GDELT datasets in any form. However, any use or redistribution of the data must include a citation to the GDELT Project and a link to the website (https://www.gdeltproject.org/).
h
Welsh Demographic Service Dataset (WDSD)
healthdatagateway.org
unknown
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Health and Care Wales (DHCW) (2024). Welsh Demographic Service Dataset (WDSD) [Dataset]. https://healthdatagateway.org/en/dataset/359
Explore at:
unknownAvailable download formats
Dataset updated
Sep 16, 2024
Dataset authored and provided by
Digital Health and Care Wales (DHCW)
License
https://saildatabank.com/data/apply-to-work-with-the-data/https://saildatabank.com/data/apply-to-work-with-the-data/
Description
Administrative information about individuals in Wales that use NHS services; such as address and practice registration history. It replaced the NHS Wales Administrative Register (NHSAR) in 2009.

Data drawn from GP practices via Exeter System.

This dataset provides linkage from anonymous individual to anonymous residences, thus enable to group households of individuals.

The single views are now provisioned to new projects and described here, the metadata for the old three-view WDSD version can be found in a separate legacy metadata entry.
o
Global Employer Dataset (Wikidata)
opendatabay.com
.undefined
Updated Jul 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Employer Dataset (Wikidata) [Dataset]. https://www.opendatabay.com/data/ai-ml/e31ecab8-d78b-4108-89df-7ea2d5d3e09e
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
Area covered
E-commerce & Online Transactions
Description
This dataset provides a curated and labeled subset of employer entries derived from Wikidata, with the goal of improving the quality and usability of employer data. While Wikidata is an invaluable open resource, direct use often necessitates cleaning. This dataset addresses that need by offering metadata, statistics, and labels to help users identify and utilise valid employer information. An employer is generally defined here as a company or entity that provides employment paying wages or a salary. The dataset specifically screens out entries that do not represent true employers, such as individuals or plurals. It is particularly useful for tasks involving data cleaning, entity recognition, and understanding employment nomenclature.

Columns

item_id: The unique Wikidata item identifier (QCode without the 'Q' prefix).

employer_count: The number of Wikidata entries associated with this specific employer reference.

employer: The text label of the employer's name, sourced from Kensho's English labels.

description: The accompanying description of the Wikidata employer entry, also from Kensho.

in_google_news: A binary indicator (0 for no, 1 for yes) showing if the occupation exists within the GoogleNews embedding.

language_detected: A three-digit language code, identified using FastText language detection.

source: Indicates the origin of the information, such as Wikidata or Wikipedia.

label: A binary label (0 for invalid employer, 1 for valid employer) indicating the data's quality.

labeled_by: Specifies the method used for labeling, including human, classifier_gnew, classifier_bert, or cleanlab.

label_error_reason: Provides the specific reason if a label is deemed an error, such as 'domain' or 'plural'.

Distribution

This dataset is provided as a single CSV file, named employers.wikidata.all.labeled.csv. Its current version is 1.0, with a file size of approximately 5.98 MB. The dataset contains a substantial number of entries, with item_id having 60656 values, employer having 60456 values, and description having 60640 values.

Usage

This dataset is ideal for various applications, including: * Detecting new trends in employers, occupations, and employment terminology. * Automatic error correction of employer entries. * Converting plural forms of entities to singular forms. * Training Named Entity Recognition (NER) models to identify employer names. * Building Question/Answer models that can understand and respond to queries about employers. * Improving the accuracy of FastText language detection models. * Assessing FastText accuracy with limited data.

Coverage

The dataset's coverage is global, drawing data from a Wikidata dump dated 2 February 2020. It includes employer entries from various linguistic contexts, as indicated by the language_detected column, showcasing multilingual employer names and descriptions. The content primarily focuses on entities and organisations that meet the definition of an employer, rather than specific demographic groups.

License

CC BY-SA

Who Can Use It

This dataset is suitable for: * Data scientists and machine learning engineers working on natural language processing tasks. * Researchers interested in data quality, entity resolution, and knowledge graph analysis. * Developers building applications that require accurate employer information. * Anyone needing to clean and validate employer data for various analytical or operational purposes.

Dataset Name Suggestions

Wikidata Labeled Employers

ML-Ready Wikidata Employer Data

Cleaned Wikidata Employer References

Global Employer Dataset (Wikidata)

Validated Employer Entities

Attributes

Original Data Source: ML-You-Can-Use Wikidata Employers labeled

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/

Geonames - All Cities with a population > 1000

Explore at:

15 scholarly articles cite this dataset (View in Google Scholar)

csv, json, geojson, excelAvailable download formats

Dataset updated

Mar 10, 2024

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

Clear search

Close search

Google apps

Main menu

Language	Number of Samples
Java	153,119
Ruby	233,710
Go	137,998
JavaScript	373,598
Python	472,469
PHP	294,394

Geonames - All Cities with a population > 1000

World Bank Unemployment Data (1991-2017)

Context

Content

Acknowledgements

CORESIDENCE_GLAD: The Global Living Arrangements Database, 1960-2021

The World English Bible

Context

Content

Acknowledgements

Reference

Overcrowding rate by age group - population without single-person households...

Global Dataset of Cyber Incidents V.1.2

ERA5 hourly data on pressure levels from 1940 to present

Death Profiles by County

Global Health and Development (2012-2021)

Data from: DOO-RE: A dataset of ambient sensors in a meeting room for...

‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2

About this dataset

Overview

Findings

About This Data

Included Data

Additional Data Queries

Interactive

How to use this dataset

Acknowledgements

Start A New Notebook!

Global Dataset of Cyber Incidents

Overcrowding rate by poverty status - population without single-person...

‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2

CommitBench

Data Statement for CommitBench

EXECUTIVE SUMMARY

CURATION RATIONALE

DOCUMENTATION FOR SOURCE DATASETS

LANGUAGE VARIETIES

SPEAKER DEMOGRAPHIC

ANNOTATOR DEMOGRAPHIC

SPEECH SITUATION AND CHARACTERISTICS

PREPROCESSING AND DATA FORMATTING

CAPTURE QUALITY

LIMITATIONS

METADATA

DISCLOSURES AND ETHICAL REVIEW

ABOUT THIS DOCUMENT

Household Demographic Surveillance System, Cause-Specific Mortality...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Frequency of data collection

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

‘FIFA - Football World Cup Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

The GDELT Project

Context

Content

Querying BigQuery tables

Acknowledgements

Welsh Demographic Service Dataset (WDSD)

Global Employer Dataset (Wikidata)

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Geonames - All Cities with a population > 1000See More Versions

Geonames - All Cities with a population > 1000