56 datasets found

World Bank: GHNP Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2019). World Bank: GHNP Data [Dataset]. https://www.kaggle.com/theworldbank/world-bank-health-population
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
Authors
World Bank
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

Content

This dataset combines key health statistics from a variety of sources to provide a look at global health and population trends. It includes information on nutrition, reproductive health, education, immunization, and diseases from over 200 countries.

Update Frequency: Biannual

For more information, see the World Bank website.

Fork this kernel to get started with this dataset.

Acknowledgements

https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics

https://cloud.google.com/bigquery/public-data/world-bank-hnp

Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Citation: The World Bank: Health Nutrition and Population Statistics

Banner Photo by @till_indeman from Unplash.

Inspiration

What’s the average age of first marriages for females around the world?
World Bank: Education Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2019). World Bank: Education Data [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-bank-intl-education
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
Authors
World Bank
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

Content

This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.

For more information, see the World Bank website.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population

http://data.worldbank.org/data-catalog/ed-stats

https://cloud.google.com/bigquery/public-data/world-bank-education

Citation: The World Bank: Education Statistics

Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @till_indeman from Unplash.

Inspiration

Of total government spending, what percentage is spent on education?
C
China CN: Internet Service: No of Domain: ORG
ceicdata.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). China CN: Internet Service: No of Domain: ORG [Dataset]. https://www.ceicdata.com/en/china/internet-number-of-domain-and-website/cn-internet-service-no-of-domain-org
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2017 - Dec 1, 2024
Area covered
China
Variables measured
Internet Statistics
Description
China Internet Service: Number of Domain: ORG data was reported at 0.023 Unit mn in Dec 2024. This records a decrease from the previous number of 0.026 Unit mn for Jun 2024. China Internet Service: Number of Domain: ORG data is updated semiannually, averaging 0.128 Unit mn from Dec 2005 (Median) to Dec 2024, with 35 observations. The data reached an all-time high of 0.398 Unit mn in Dec 2015 and a record low of 0.023 Unit mn in Dec 2024. China Internet Service: Number of Domain: ORG data remains active status in CEIC and is reported by China Internet Network Information Center. The data is categorized under China Premium Database’s Information and Communication Sector – Table CN.ICE: Internet: Number of Domain and Website.
E
World Sites (TimeMap Sample Dataset)
ecaidata.org
Updated Oct 4, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECAI Clearinghouse (2014). World Sites (TimeMap Sample Dataset) [Dataset]. https://ecaidata.org/dataset/ecaiclearinghouse-id-12
Explore at:
Dataset updated
Oct 4, 2014
Dataset provided by
ECAI Clearinghouse
Area covered
World
Description
Initial data source was UNESCO web site, supplemented by individual work on different countires/regions;A database of cultural heritage sites assembled by volunteers at the Archaeological Computing Laboratory, University of Sydney
Data from: Exploring the Dominance of the English Language on the Websites...
zenodo.org
data.niaid.nih.gov
bin, xls
Updated Mar 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis; Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis (2020). Exploring the Dominance of the English Language on the Websites of EU Countries [Dataset]. http://doi.org/10.5281/zenodo.3698008
Explore at:
xls, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3698008
Dataset updated
Mar 5, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis; Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
European Union
Description
This Dataset, in 29 files of xlsx format, contains the data of all metrics and accumulated information as they are described in the methodology, results and discussion section of the research article "Exploring the Dominance of the English Language on the Websites of EU Countries".
A global database for the distributions of crop wild relatives
gbif.org
researchdata.edu.au
+1more
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crop Wild Relatives Occurrence data consortia; Crop Wild Relatives Occurrence data consortia (2024). A global database for the distributions of crop wild relatives [Dataset]. http://doi.org/10.15468/jyrthk
Explore at:
Unique identifier
https://doi.org/10.15468/jyrthk
Dataset updated
Feb 9, 2024
Dataset provided by
International Center for Tropical Agriculturehttps://alliancebioversityciat.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Crop Wild Relatives Occurrence data consortia; Crop Wild Relatives Occurrence data consortia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset originally held 5 647 442 total records, where 34% of the records corresponded to germplasm accessions and 66% to herbarium samples. A total of 3 231 286 records had cross-checked coordinates (see Figure 2). 322 735 records were newly georeferenced using The Google Geocoding API and 15 713 new records were obtained after digitizing the information contained in herbaria specimens. Data was gathered from more than 100 data providers, including GBIF (a comprehensive list of institutions and individuals is available here: http://www.cwrdiversity.org/data-sources/ ).

The geographic coverage of the dataset includes 96% of the world countries and also includes records of cultivated plants (1/3 of the dataset). Records of the crop wild relatives of 80 crop gene pools can be queried and visualized in this interactive map: http://www.cwrdiversity.org/distribution-map/

This dataset was assembled as part of the project ‘Adapting Agriculture to Climate Change: Collecting, Protecting and Preparing Crop Wild Relatives’, which is supported by the Government of Norway. The project is managed by the Global Crop Diversity Trust and the Millennium Seed Bank of the Royal Botanic Gardens, Kew, and implemented in partnership with national and international genebanks and plant breeding institutes around the world. For further information, please refer to the project website: http://www.cwrdiversity.org/

For publication to GBIF, all records originally gathered from GBIF have been removed to avoid data duplication.

Citation: Crop Wild Relatives Occurrence data consortia ([year]). A global database for the distributions of crop wild relatives. Centro Internacional de Agricultura Tropical (CIAT). Occurrence dataset https://doi.org/10.15468/jyrthk accessed via GBIF.org on [date].
s
Data from: World Database on Protected Areas
fsm-data.sprep.org
pacificdata.org
+13more
geojson, html, jpeg +3
Updated Feb 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UN Environment World Conservation Monitoring Centre (UNEP-WCMC) (2022). World Database on Protected Areas [Dataset]. https://fsm-data.sprep.org/dataset/world-database-protected-areas
Explore at:
html, jpeg, pdf, zip, geojson, websiteAvailable download formats
Dataset updated
Feb 15, 2022
Dataset provided by
The Nature Conservancy
Authors
UN Environment World Conservation Monitoring Centre (UNEP-WCMC)
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
164.23324584961 4.7844689665794, 155.88363647461 0.043945308191358, 154.38949584961 0.39550467153202, 136.54769897461 7.3188817303668, 153.42269897461 9.9255659124055, 152.98324584961 3.995780512963, 139.71176147461 11.135287077054)), 162.91488647461 6.1842461612806, POLYGON ((136.54769897461 10.531020008465, 142.61215209961 5.5722498011139, Federated States of Micronesia
Description
The World Database on Protected Areas (WDPA) is the most comprehensive global database of marine and terrestrial protected areas, updated on a monthly basis, and is one of the key global biodiversity data sets being widely used by scientists, businesses, governments, International secretariats and others to inform planning, policy decisions and management. The WDPA is a joint project between UN Environment and the International Union for Conservation of Nature (IUCN). The compilation and management of the WDPA is carried out by UN Environment World Conservation Monitoring Centre (UNEP-WCMC), in collaboration with governments, non-governmental organisations, academia and industry. There are monthly updates of the data which are made available online through the Protected Planet website where the data is both viewable and downloadable. Data and information on the world's protected areas compiled in the WDPA are used for reporting to the Convention on Biological Diversity on progress towards reaching the Aichi Biodiversity Targets (particularly Target 11), to the UN to track progress towards the 2030 Sustainable Development Goals, to some of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) core indicators, and other international assessments and reports including the Global Biodiversity Outlook, as well as for the publication of the United Nations List of Protected Areas. Every two years, UNEP-WCMC releases the Protected Planet Report on the status of the world's protected areas and recommendations on how to meet international goals and targets. Many platforms are incorporating the WDPA to provide integrated information to diverse users, including businesses and governments, in a range of sectors including mining, oil and gas, and finance. For example, the WDPA is included in the Integrated Biodiversity Assessment Tool, an innovative decision support tool that gives users easy access to up-to-date information that allows them to identify biodiversity risks and opportunities within a project boundary. The reach of the WDPA is further enhanced in services developed by other parties, such as the Global Forest Watch and the Digital Observatory for Protected Areas, which provide decision makers with access to monitoring and alert systems that allow whole landscapes to be managed better. Together, these applications of the WDPA demonstrate the growing value and significance of the Protected Planet initiative.
s
Data from: Ramsar Sites
pacific-data.sprep.org
pacificdata.org
+1more
pdf
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PNG Conservation and Environment Protection Authority (2025). Ramsar Sites [Dataset]. https://pacific-data.sprep.org/dataset/ramsar-sites
Explore at:
pdf(115614), pdf(15018951)Available download formats
Dataset updated
Apr 8, 2025
Dataset provided by
PNG Conservation and Environment Protection Authority
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Papua New Guinea
Description
Ramsar and wetlands
i
Building a DGA Classifier: Part 1, Data Preparation
impactcybertrust.org
search.datacite.org
Updated Jan 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2019). Building a DGA Classifier: Part 1, Data Preparation [Dataset]. http://doi.org/10.23721/100/1478811
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478811
Dataset updated
Jan 28, 2019
Authors
External Data Source
Description
The purpose of building aDGAclassifier isn't specifically for takedowns of botnets, but to discover and detect the use on our network or services. If we can you have a list of domains resolved and accessed at your organization, it is possible now to see which of those are potentially generated and used bymalware.

The dataset consists of three sources (as decribed in the Data-Driven Security blog):

Alexa: For samples of legitimate domains, an obvious choice is to go to the Alexa list of top web sites. But it's not ready for our use as is. If you grab thetop 1 Million Alexa domainsand parse it, you'll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don't help us (we are only classifying on domains here). So after I remove the URLs, de-duplicated the domains and clean it up, I end up with the Alexa top965,843.

"Real World" Data fromOpenDNS: After reading the post from Frank Denis at OpenDNS titled"Why Using Real World Data Matters For Building Effective Security Models", I grabbed their10,000 Top Domainsand their10,000 Random samples. If we compare that to the top Alexa domains, 6,901 of the top ten thousand are in the alexa data and 893 of the random domains are in the Alexa data. I will clean that up as I make the final training dataset.

DGAdo: The Click Security version wasn't very clear in where they got their bad domains so I decided to collect my own and this was rather fun. Because I work with some interesting characters (who know interesting characters), I was able to collect several data sets from recent botnets: "Cryptolocker", two seperate "Game-Over Zues" algorithms, and an anonymous collection of malicious (and algorithmically generated) domains. In the end, I was able to collect 73,598 algorithmically generateddomains.
;
C
China CN: Internet Service: No of Website: ORG
ceicdata.com
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2024). China CN: Internet Service: No of Website: ORG [Dataset]. https://www.ceicdata.com/en/china/internet-number-of-domain-and-website/cn-internet-service-no-of-website-org
Explore at:
Dataset updated
Dec 15, 2024
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2005 - Dec 1, 2008
Area covered
China
Variables measured
Internet Statistics
Description
China Internet Service: Number of Website: ORG data was reported at 0.021 Unit mn in Dec 2008. This records an increase from the previous number of 0.017 Unit mn for Jun 2008. China Internet Service: Number of Website: ORG data is updated semiannually, averaging 0.017 Unit mn from Dec 2005 (Median) to Dec 2008, with 7 observations. The data reached an all-time high of 0.021 Unit mn in Dec 2008 and a record low of 0.009 Unit mn in Dec 2007. China Internet Service: Number of Website: ORG data remains active status in CEIC and is reported by China Internet Network Information Center. The data is categorized under China Premium Database’s Information and Communication Sector – Table CN.ICE: Internet: Number of Domain and Website.
Share of global mobile website traffic 2015-2025
statista.com
tokrwards.com
+1more
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Share of global mobile website traffic 2015-2025 [Dataset]. https://www.statista.com/statistics/277125/share-of-website-traffic-coming-from-mobile-devices/
Explore at:
Dataset updated
Sep 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In the second quarter of 2025, mobile devices (excluding tablets) accounted for 62.54 percent of global website traffic. Since consistently maintaining a share of around 50 percent beginning in 2017, mobile usage surpassed this threshold in 2020 and has demonstrated steady growth in its dominance of global web access. Mobile traffic Due to low infrastructure and financial restraints, many emerging digital markets skipped the desktop internet phase entirely and moved straight onto mobile internet via smartphone and tablet devices. India is a prime example of a market with a significant mobile-first online population. Other countries with a significant share of mobile internet traffic include Nigeria, Ghana and Kenya. In most African markets, mobile accounts for more than half of the web traffic. By contrast, mobile only makes up around 45.49 percent of online traffic in the United States. Mobile usage The most popular mobile internet activities worldwide include watching movies or videos online, e-mail usage and accessing social media. Apps are a very popular way to watch video on the go and the most-downloaded entertainment apps in the Apple App Store are Netflix, Tencent Video and Amazon Prime Video.
F
Internet users for the United States
fred.stlouisfed.org
json
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Internet users for the United States [Dataset]. https://fred.stlouisfed.org/series/ITNETUSERP2USA
Explore at:
jsonAvailable download formats
Dataset updated
Oct 8, 2025
License
https://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Area covered
United States
Description
Graph and download economic data for Internet users for the United States (ITNETUSERP2USA) from 1990 to 2023 about internet, persons, and USA.
DCASE 2023 Challenge Task 2 Development Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated May 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kota Dohi; Kota Dohi; Keisuke Imoto; Keisuke Imoto; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Yuma Koizumi; Yuma Koizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Takashi Endo (2023). DCASE 2023 Challenge Task 2 Development Dataset [Dataset]. http://doi.org/10.5281/zenodo.7882613
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7882613
Dataset updated
May 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kota Dohi; Kota Dohi; Keisuke Imoto; Keisuke Imoto; Noboru Harada; Noboru Harada; Daisuke Niizumi; Daisuke Niizumi; Yuma Koizumi; Yuma Koizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Yohei Kawaguchi; Yohei Kawaguchi; Tomoya Nishida; Harsh Purohit; Takashi Endo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This dataset is the "development dataset" for the DCASE 2023 Challenge Task 2 "First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring".

The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

ToyCar

ToyTrain

Fan

Gearbox

Bearing

Slide rail

Valve

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2022 Task 2. The task this year is to develop an ASD system that meets the following four requirements.

1. Train a model using only normal sound (unsupervised learning scenario)

Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

2. Detect anomalies regardless of domain shifts (domain generalization task)

In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2.

3. Train a model for a completely new machine type

For a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning.

4. Train a model using only one machine from its machine type

While sounds from multiple machines of the same machine type can be used to enhance detection performance, it is often the case that sound data from only one machine are available for a machine type. In such a case, the system should be able to train models using only one machine from a machine type.

The last two requirements are newly introduced in DCASE 2023 Task2 as the "first-shot problem".

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the development dataset is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise.

Dataset

This dataset consists of seven machine types. For each machine type, one section is provided, and the section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

- /dev_data

- /raw
- /fan
- /train (only normal clips)
- /section_00_source_train_normal_0000_

Baseline system

The baseline system is available on the Github repository dcase2023_task2_baseline_ae.The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

If you use this dataset, please cite all the following papers. We will publish a paper on the description of the DCASE 2023 Task 2, so pleasure make sure to cite the paper, too.

Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda. First-shot anomaly detection for machine condition monitoring: A domain generalization baseline. In arXiv e-prints: 2303.00455, 2023. [URL]

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), 31-35. Nancy, France, November 2022, . [URL]

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for
I
World Heritage Site List
ihp-wins.unesco.org
csv
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). World Heritage Site List [Dataset]. https://ihp-wins.unesco.org/dataset/unesco-world-heritage-sites
Explore at:
csvAvailable download formats
Dataset updated
Jul 24, 2025
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Area covered
World
Description
The World Heritage List includes 1248 properties forming part of the cultural and natural heritage which the World Heritage Committee considers as having outstanding universal value.

These include 972 cultural, 235 natural and 41 mixed properties in 170 States Parties. As of October 2024, 196 States Parties have ratified the World Heritage Convention.
i
Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in...
ieee-dataport.org
Updated Sep 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Gorissen (2025). Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in the World Wide Lab: Trajectory Data and Benchmark Models [Dataset]. https://ieee-dataport.org/documents/demonstrating-data-knowledge-pipelines-connecting-production-sites-world-wide-lab
Explore at:
Dataset updated
Sep 9, 2025
Authors
Leon Gorissen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
check the project website for the code repository link. In the folder test you can find data used for model evaluation or testing. Metadata must be derived from the metadata_dump_test.json. In the folder train you can find data used for model training and cross validation.
World Bank: International Debt Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2019). World Bank: International Debt Data [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-bank-intl-debt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
Authors
World Bank
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

Content

This dataset contains both national and regional debt statistics captured by over 200 economic indicators. Time series data is available for those indicators from 1970 to 2015 for reporting countries.

For more information, see the World Bank website.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_intl_debt

https://cloud.google.com/bigquery/public-data/world-bank-international-debt

Citation: The World Bank: International Debt Statistics

Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @till_indeman from Unplash.

Inspiration

What countries have the largest outstanding debt?

https://cloud.google.com/bigquery/images/outstanding-debt.png" alt="enter image description here"> https://cloud.google.com/bigquery/images/outstanding-debt.png
Data from: Harmonized chronologies of a global late Quaternary pollen...
doi.pangaea.de
service.tib.eu
html, tsv
Updated Jun 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenzhi Li; Alexander Postl; Thomas Böhmer; Andrew M Dolman; Ulrike Herzschuh (2021). Harmonized chronologies of a global late Quaternary pollen dataset (LegacyAge 1.0) [Dataset]. http://doi.org/10.1594/PANGAEA.933132
Explore at:
tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.933132
Dataset updated
Jun 28, 2021
Dataset provided by
PANGAEA
Authors
Chenzhi Li; Alexander Postl; Thomas Böhmer; Andrew M Dolman; Ulrike Herzschuh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 26, 1938 - Mar 18, 2014
Area covered
Variables measured
Site, Type, LATITUDE, Continent, ELEVATION, LONGITUDE, Replicates, Description, Event label, Location type, and 6 more
Description
This dataset presents global revised age models for taxonomically harmonized fossil pollen records. The age-depth models were established from mostly Intcal20-calibrated radiocarbon datings with a predefined parameter setting. 1032 sites are located in North America, 1075 sites in Europe, 488 sites in Asia. In the Southern Hemisphere, there are 150 sites in South America, 54 in Africa, and 32 in the Indopacific region. Datings, mostly C14, were retrieved from the Neotoma Paleoecology Database (https://www.neotomadb.org/), with additional data from Cao et al. (2020; https://doi.org/10.5194/essd-12-119-2020), Cao et al. (2013, https://doi.org/10.1016/j.revpalbo.2013.02.003) and our own collection. The related age records were revised by applying a similar approach, i.e., using the Bayesian age-depth modeling routine in R-BACON software. […]
Data cleaning using unstructured data
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13135983
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this project, we work on repairing three datasets:

Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.

Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.

Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

"{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")

"{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")

"{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
NFA 2018 Edition
data.world
csv, zip
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global Footprint Network (2025). NFA 2018 Edition [Dataset]. https://data.world/footprint/nfa-2018-edition
Explore at:
zip, csvAvailable download formats
Dataset updated
Feb 25, 2025
Authors
Global Footprint Network
Time period covered
1961 - 2014
Description
@youtube

Our National Footprint Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2014. The calculations in the National Footprint Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency. The 2018 edition of the NFA features some exciting updates from last year’s 2017 edition, including data for more countries and improved data sources and methodology. Methodology changes:

Our conversion of carbon to CO2 increased in precision, which increased the world’s carbon footprint by approximately 1%.

We implemented a new data quality scoring system. This allowed us to publish data for more countries by omitting unreliable data for some years rather than the entire country’s Ecological Footprint timeline.

We used more precise data from the Global Carbon Project to calculate ocean carbon sequestration rates for 2014.

National Footprint Accounts 2018 Edition

To visualize our data in our data explorer click here. Dataset provides Ecological Footprint per capita data for years 1961-2014 in global hectares (gha). Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. Without further specification, Ecological Footprint generally refers to the Ecological Footprint of consumption (rather than only production or export). Ecological Footprint is often referred to in short form as Footprint.

About this Dataset

This data includes total and per capita national biocapacity, ecological footprint of consumption, ecological footprint of production and total area in hectares. This dataset, however, does not include any of our yield factors (national or world) nor any equivalence factors. To view these click here.

Objectives

Revealing links between human consumption and other human behaviors, geographic characteristics, political landscapes,

Get involved

How can others contribute? - [ ] Join this table on other data.world datasets (prefereably country-level data) - [ ] Write queries - [ ] Create graphics - [ ] Post and share discoveries

External resources

Data Explorer

Footprint Website

Calculate your own Ecological Footprint

Facebook

Twitter

Click to copy link

Link copied

Cite

World Bank (2019). World Bank: GHNP Data [Dataset]. https://www.kaggle.com/theworldbank/world-bank-health-population

World Bank: GHNP Data

World Bank: Global Health, Nutrition, and Population Data (BigQuery Dataset)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

World Bank Grouphttp://www.worldbank.org/

Authors

World Bank

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

Content

This dataset combines key health statistics from a variety of sources to provide a look at global health and population trends. It includes information on nutrition, reproductive health, education, immunization, and diseases from over 200 countries.

Update Frequency: Biannual

For more information, see the World Bank website.

Fork this kernel to get started with this dataset.

Acknowledgements

https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics

https://cloud.google.com/bigquery/public-data/world-bank-hnp

Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Citation: The World Bank: Health Nutrition and Population Statistics

Banner Photo by @till_indeman from Unplash.

Inspiration

What’s the average age of first marriages for females around the world?

Clear search

Close search

Google apps

Main menu

World Bank: GHNP Data

Context

Content

Acknowledgements

Inspiration

World Bank: Education Data

Context

Content

Acknowledgements

Inspiration

China CN: Internet Service: No of Domain: ORG

World Sites (TimeMap Sample Dataset)

Data from: Exploring the Dominance of the English Language on the Websites...

A global database for the distributions of crop wild relatives

Data from: World Database on Protected Areas

Data from: Ramsar Sites

Building a DGA Classifier: Part 1, Data Preparation

China CN: Internet Service: No of Website: ORG

Share of global mobile website traffic 2015-2025

Internet users for the United States

DCASE 2023 Challenge Task 2 Development Dataset

World Heritage Site List

Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in...

World Bank: International Debt Data

Context

Content

Acknowledgements

Inspiration

Data from: Harmonized chronologies of a global late Quaternary pollen...

Data cleaning using unstructured data

civil_comments

NFA 2018 Edition

National Footprint Accounts 2018 Edition

About this Dataset

Objectives

Get involved

External resources

World Bank: GHNP Data

World Bank: Global Health, Nutrition, and Population Data (BigQuery Dataset)

Context

Content

Acknowledgements

Inspiration