47 datasets found

P
Garbage Classification Dataset Dataset
paperswithcode.com
Updated Mar 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Garbage Classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/garbage-classification-dataset
Explore at:
Dataset updated
Mar 24, 2025
Description
Description:

👉 Download the dataset here

This dataset contains a collection of 15,150 images, categorized into 12 distinct classes of common household waste. The classes include paper, cardboard, biological waste, metal, plastic, green glass, brown glass, white glass, clothing, shoes, batteries, and general trash. Each category represents a different type of material, contributing to more effective recycling and waste management strategies. Garbage Classification Dataset.

Objective

The purpose of this dataset is to aid in the development of machine learning models designed to automatically classify household waste into its appropriate categories, thus promoting more efficient recycling processes. Proper waste sorting is crucial for maximizing the amount of material that can be recycled, and this dataset is aimed at enhancing automation in this area. The classification of garbage into a broader range of categories, as opposed to the limited classes found in most available datasets (2-6 classes), allows for a more precise recycling process and could significantly improve recycling rates.

Download Dataset

Dataset Composition and Collection Process

The dataset was primarily collected through web scraping, as simulating a real-world garbage collection scenario (such as placing a camera above a conveyor belt) was not feasible at the time of collection. The goal was to obtain images that closely resemble actual garbage. For example, images in the biological waste category include rotten fruits, vegetables, and food remnants. Similarly, categories such as glass and metal consist of images of bottles, cans, and containers typically found in household trash. While the images for some categories, like clothes or shoes, were harder to find specifically as garbage, they still represent the items that may end up in waste streams.

In an ideal setting, a conveyor system could be used to gather real-time data by capturing images of waste in a continuous flow. Such a setup would enhance the dataset by providing authentic waste images for all categories. However, until that setup is available, this dataset serves as a significant step toward automating garbage classification and improving recycling technologies.

Potential for Future Improvements

While this dataset provides a strong foundation for household waste classification, there is potential for further improvements. For example, real-time data collection using conveyor systems or garbage processing plants could provide higher accuracy and more contextual images. Additionally, future datasets could expand to include more specialized categories, such as electronic waste, hazardous materials, or specific types of plastic.

Conclusion

The Garbage Classification dataset offers a broad and diverse collection of household waste images, making it a valuable resource for researchers and developers working in environmental sustainability, machine learning, and recycling automation. By improving the accuracy of waste classification systems, we can contribute to a cleaner, more sustainable future.

This dataset is sourced from Kaggle.
Dataset for: Navigating Ecosystem Services Trade-offs: A Global...
zenodo.org
bin, pdf
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Jose Martinez-Harms; Maria Jose Martinez-Harms; Barbara Larrain Barrios; Barbara Larrain Barrios (2024). Dataset for: Navigating Ecosystem Services Trade-offs: A Global Comprehensive Review [Dataset]. http://doi.org/10.5281/zenodo.13249080
Explore at:
pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13249080
Dataset updated
Aug 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Jose Martinez-Harms; Maria Jose Martinez-Harms; Barbara Larrain Barrios; Barbara Larrain Barrios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 2024
Description
Methods

The dataset is the output of a comprehensive literature-based search that aims to collate all the evidence on where ES relationships have been mentioned and addressed. We applied systematic mapping which is based on the “Guidelines for Systematic Review in Environmental Management” developed by the Centre for Evidence-Based Conservation at Bangor University (Pullin and Stewart 2006).

The methodological framework followed the standard stages outlined for systematic mapping in environmental sciences (James et al. 2016). Briefly, we defined the scope and objectives:

· We comprehensively review and further explore the global evidence of ES trade-offs and synergies focusing on all systems including terrestrial, freshwater, and marine.

· We compiled the evidence on trade-offs and synergies among multiple ES interacting across various ecosystems.

· We performed a geographical and temporal trend analysis exploring the distribution of studies across the world examining how the focus on various ecosystem types and ES categories has evolved to highlight gaps and biases.

Then we set the criteria for study inclusion (Table 1), searched the evidence, coded, and produced the database. Extracted article information including the specific criteria is detailed in Table 1.

The first step was to search the ISI Web of Knowledge core collection (http://apps.webofknowledge.com) database, targeting the search on the ecosystem services literature and studies dealing with trade-offs/synergies, win-win outcomes or bundles when managing different ecosystem services in the landscape/seascape. All peer-reviewed journal articles written in English and Spanish have been considered for review.

The peer-reviewed literature from 2005 to 2021 was reviewed identifying relevant studies according to specific search terms. The relevant search terms and descriptive words derived from (Howe et al. 2014) adding “bundles” and “co-benefits”. Boolean nomenclatures ‘*’ = all letters were allowed after the *, were used on the root of words where several different endings applied (Figure 1). Search terms used were:

(“*ecosystem service*” OR “environment* service*” OR “ecosystem* approach*” OR “ecosystem good*” OR “environment* good*”)

AND

(“*trade-off*” OR “tradeoff*” OR “synerg*” OR “win-win*” OR “bundle*” OR “cost*and benefit*” OR “co-benefit*”) n=5194

Papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org).

In the second step (Figure 1) papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org). Papers were classified according to three systems: terrestrial, marine, and freshwater (Table 1). Papers with multiple systems, transitional habitats or those that could not be classified were classified as “other” (Mazor et al. 2018). Articles were classified based on the occurrence of the most frequent system words in their title, keywords, and abstract (Mazor et al. 2018). The set of system-specific words was determined by extracting the 250 most frequently used keywords from all considered articles and assigning each word to either system (articles could fall into just one of the four categories). Using this technique, we managed to classify 100% of the papers. To further enrich the dataset and make it a useful repository for science and policy, an additional sub-classification was performed, categorizing papers into the following categories: Coastal, Urban, Wetlands, Forest, Mountain, Freshwater, Agroecosystems, and Others that mainly represented multiple ecosystems (Table S1). This comprehensive classification approach enhances the dataset’s utility for various scientific and policy-making applications.

In the third step (Figure 1), applying the same technique, we classified the papers into four ES categories: habitat (supporting biodiversity related), provisioning, regulating, and cultural services (De Groot et al. 2010; MEA 2005; Sukhdev 2010; Wallace 2007). For the classification into ES categories, articles could fall into one or more of the four categories (see Table 1 for example the keywords used to classify ecosystems, ES categories, and countries). Applying this technique, we excluded 2149 papers that weren’t classified in any of the ecosystem services types categories resulting in 3629 papers (see Figure 1).

In the fourth step (Figure 1), an initial screening was conducted to identify papers that did not align with the review objectives of assessing ecosystem services trade-offs and synergies to inform policy and management decisions. We manually reviewed the titles of each paper in the dataset, excluding those that were from other fields or did not align with the review objectives. In this initial assessment, we excluded 347 papers, leaving a total of 3,286 papers for further review. A descriptive analysis of this 3286 article dataset was performed to examine the distribution of ES categories within each ecosystem type over the specified period. This analysis allowed us to conclude the prevalence of each ecosystem service category in different ecosystem types, identifying temporal trends and patterns. The number of occurrences was calculated for each ES category within each ecosystem type, expressed as counts. This allowed for the comparison of ecosystem service distributions across the selected ecosystem types.

In the fifth step (Figure 1), we employed an approach to visually represent the geographical distribution and focus of ES studies across the world. With the classification of studies in ES categories and the types of ecosystems, the papers were coded according to the country where the study was performed. It was possible to assign a specific country to 2636 studies, removing 650 studies that did not specify the country of study. From these 2636 papers classified, a proportion were global studies that consider several countries under study (499 global studies).

We developed global maps (Figure 1), each offering a unique perspective on the ES research landscape. The first map presents the total number of ES trade-off studies conducted worldwide, illustrating the geographical spread and concentration of research efforts to provide a clear overview of regions that have been extensively studied and those that may require more attention in future research. Additionally, we calculated two key metrics to assess research productivity more comprehensively: the number of research papers per capita and the number of research papers relative to Gross Domestic Product (GDP). For population and GDP, we used the most recent available data from the World Bank (https://data.worldbank.org). These alternative metrics normalize the data based on economic output and population size, providing a more balanced view of research activity across different countries (Figures S3).

Detailed maps were created featuring pie charts that highlight the different categories of ES and ecosystem types addressed for each country. These charts offer an understanding of how various ES categories and ecosystems are represented in different parts of the world. Finally, we assessed ES trade-off studies to world regions (Africa, Antarctica, Asia, Australasia, Europe, Latin America, and North America) looking at the relationships between the categories of ES. We considered papers that evaluated more than one category of ES and the papers that considered only one category of ES. This country-level analysis offers insights into regional research trends and priorities, contributing to a more localized understanding of ES studies.

In the sixth step (Figure 1), each publication in this review was critically appraised to evaluate the quality of the papers included in the review. The foundation for our critical appraisal stems from the comprehensive and multidimensional approach of Belcher et al. (2016) that is framed to evaluate research quality, which aligns well with the interdisciplinary nature of our study. Belcher et al. (2016) developed a robust framework that incorporates essential principles and criteria for assessing the quality of transdisciplinary research. This is particularly relevant for ecosystem services science and our review that contributes to advancing current knowledge by systematically synthesizing evidence on relationships among various ES across these diverse systems.

The Belcher et al. (2016) framework emphasizes four main principles: relevance, credibility (which we have adapted as methodological transparency), legitimacy (generalizability in our context), and effectiveness (significance). A continuous scoring system (ranging from 0 to 1) was applied for the four main criteria to maintain simplicity and consistency across the large number of studies. In this system, a value closer to 0 indicates that the criteria are not met, while a value closer to 1 indicates that the criteria are more closely met. This scoring method was a useful indicator of the overall quality of the paper and how well the article met the review's goals overall.

Methodological Transparency was assessed based on the clarity and completeness of methodological descriptions, including data availability, the rigor of statistical analyses, methodological detail, and reproducibility of the findings. This criterion assesses the transparency and rigor of the study's methodology, including data collection, analysis, and reporting (Belcher et al. 2016). Relevance was evaluated by the study's alignment with the review's objectives, its importance to the field, and its practical applicability. This includes the extent to which the study addresses pertinent research
P
SynthPAI Dataset
paperswithcode.com
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev (2024). SynthPAI Dataset [Dataset]. https://paperswithcode.com/dataset/synthpai
Explore at:
Dataset updated
Jun 10, 2024
Authors
Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev
Description
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

Dataset Details Dataset Description SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity. Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant. Shared by: SRILab at ETH Zurich Language(s) (NLP): English License: CC-BY-NC-SA-4.0

Dataset Sources

Repository: https://github.com/eth-sri/SynthPAI Paper: https://arxiv.org/abs/2406.07217

Uses The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.

Direct Use As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.

Out-of-Scope Use The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.

Dataset Structure We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):

Comment

author str: unique identifier of the person writing

username str: corresponding username

parent_id str: unique identifier of the parent comment

thread_id str: unique identifier of the thread

children list[str]: unique identifiers of children comments

profile Profile: profile making the comment - described below

text str: text of the comment

guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.

reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes

The associated profiles are structured as follows

Profile

username str: identifier

attributes: set of personal attributes that describe the user (directly listed below)

The corresponding attributes and values are

Attributes

Age continuous [18-99] The age of a user in years.

Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)

Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)

Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.

Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).

Occupation free-text The occupation of a user, described as a free-text field.

Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.

Sex categorical [Male, Female] Biological Sex of a profile.

Dataset Creation Curation Rationale SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Source Data The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.

Data Collection and Processing The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.

Annotations

Annotations are provided by authors of the paper.

Personal and Sensitive Information

All contained personal information is purely synthetic and does not relate to any real individual.

Bias, Risks, and Limitations All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper. As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.

Citation BibTeX:

@misc{2406.07217, Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev}, Title = {A Synthetic Dataset for Personal Attribute Inference}, Year = {2024}, Eprint = {arXiv:2406.07217}, } APA:

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; arXiv:2406.07217.

Dataset Card Authors

Hanna Yukhymenko Robin Staab Mark Vero
w
Globalization and Income Distribution Dataset 1975-2002 - Aruba,...
microdata.worldbank.org
dev.ihsn.org
+2more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Branko L. Milanovic (2023). Globalization and Income Distribution Dataset 1975-2002 - Aruba, Afghanistan, Angola...and 188 more [Dataset]. https://microdata.worldbank.org/index.php/catalog/1786
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
Branko L. Milanovic
Time period covered
1975 - 2002
Area covered
Angola
Description
Abstract

Dataset used in World Bank Policy Research Working Paper #2876, published in World Bank Economic Review, No. 1, 2005, pp. 21-44.

The effects of globalization on income distribution in rich and poor countries are a matter of controversy. While international trade theory in its most abstract formulation implies that increased trade and foreign investment should make income distribution more equal in poor countries and less equal in rich countries, finding these effects has proved elusive. The author presents another attempt to discern the effects of globalization by using data from household budget surveys and looking at the impact of openness and foreign direct investment on relative income shares of low and high deciles. The author finds some evidence that at very low average income levels, it is the rich who benefit from openness. As income levels rise to those of countries such as Chile, Colombia, or Czech Republic, for example, the situation changes, and it is the relative income of the poor and the middle class that rises compared with the rich. It seems that openness makes income distribution worse before making it better-or differently in that the effect of openness on a country's income distribution depends on the country's initial income level.

Kind of data

Aggregate data [agg]
a
Catholic CO2 Footprint Beta Full NULL NAmerica2
catholic-geo-hub-cgisc.hub.arcgis.com
Updated Oct 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholic CO2 Footprint Beta Full NULL NAmerica2 [Dataset]. https://catholic-geo-hub-cgisc.hub.arcgis.com/items/222af869ad744eda8a4fe566b2d7b9cf
Explore at:
Dataset updated
Oct 7, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Description
Catholic_CO2_Footprint_Beta_Full_NULL_NAmerica2Burhans, Molly A., Cheney, David M., Gerlt, R.. . “Catholic_CO2_Footprint_Beta_Full_NULL_NAmerica2”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.MethodologyThis is the first global Carbon footprint of the Catholic population. We will continue to improve and develop these data with our research partners over the coming years. While it is helpful, it should also be viewed and used as a "beta" prototype that we and our research partners will build from and improve. The years of carbon data are (2010) and (2015 - SHOWN). The year of Catholic data is 2018. The year of population data is 2016. Care should be taken during future developments to harmonize the years used for catholic, population, and CO2 data.1. Zonal Statistics: Esri Population Data and Dioceses --> Population per dioceses, non Vatican based numbers2. Zonal Statistics: FFDAS and Dioceses and Population dataset --> Mean CO2 per Diocese3. Field Calculation: Population per Diocese and Mean CO2 per diocese --> CO2 per Capita4. Field Calculation: CO2 per Capita * Catholic Population --> Catholic Carbon FootprintAssumption: PerCapita CO2Deriving per-capita CO2 from mean CO2 in a geography assumes that people's footprint accounts for their personal lifestyle and involvement in local business and industries that are contribute CO2. Catholic CO2Assumes that Catholics and non-Catholic have similar CO2 footprints from their lifestyles.Derived from:A multiyear, global gridded fossil fuel CO2 emission data product: Evaluation and analysis of resultshttp://ffdas.rc.nau.edu/About.htmlRayner et al., JGR, 2010 - The is the first FFDAS paper describing the version 1.0 methods and results published in the Journal of Geophysical Research.Asefi et al., 2014 - This is the paper describing the methods and results of the FFDAS version 2.0 published in the Journal of Geophysical Research.Readme version 2.2 - A simple readme file to assist in using the 10 km x 10 km, hourly gridded Vulcan version 2.2 results.Liu et al., 2017 - A paper exploring the carbon cycle response to the 2015-2016 El Nino through the use of carbon cycle data assimilation with FFDAS as the boundary condition for FFCO2."S. Asefi‐Najafabady P. J. Rayner K. R. Gurney A. McRobert Y. Song K. Coltin J. Huang C. Elvidge K. BaughFirst published: 10 September 2014 https://doi.org/10.1002/2013JD021296 Cited by: 30Link to FFDAS data retrieval and visualization: http://hpcg.purdue.edu/FFDAS/index.phpAbstractHigh‐resolution, global quantification of fossil fuel CO2 emissions is emerging as a critical need in carbon cycle science and climate policy. We build upon a previously developed fossil fuel data assimilation system (FFDAS) for estimating global high‐resolution fossil fuel CO2 emissions. We have improved the underlying observationally based data sources, expanded the approach through treatment of separate emitting sectors including a new pointwise database of global power plants, and extended the results to cover a 1997 to 2010 time series at a spatial resolution of 0.1°. Long‐term trend analysis of the resulting global emissions shows subnational spatial structure in large active economies such as the United States, China, and India. These three countries, in particular, show different long‐term trends and exploration of the trends in nighttime lights, and population reveal a decoupling of population and emissions at the subnational level. Analysis of shorter‐term variations reveals the impact of the 2008–2009 global financial crisis with widespread negative emission anomalies across the U.S. and Europe. We have used a center of mass (CM) calculation as a compact metric to express the time evolution of spatial patterns in fossil fuel CO2 emissions. The global emission CM has moved toward the east and somewhat south between 1997 and 2010, driven by the increase in emissions in China and South Asia over this time period. Analysis at the level of individual countries reveals per capita CO2 emission migration in both Russia and India. The per capita emission CM holds potential as a way to succinctly analyze subnational shifts in carbon intensity over time. Uncertainties are generally lower than the previous version of FFDAS due mainly to an improved nightlight data set."Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/Esri Gridded Population Data 2016DescriptionThis layer is a global estimate of human population for 2016. Esri created this estimate by modeling a footprint of where people live as a dasymetric settlement likelihood surface, and then assigned 2016 population estimates stored on polygons of the finest level of geography available onto the settlement surface. Where people live means where their homes are, as in where people sleep most of the time, and this is opposed to where they work. Another way to think of this estimate is a night-time estimate, as opposed to a day-time estimate.Knowledge of population distribution helps us understand how humans affect the natural world and how natural events such as storms and earthquakes, and other phenomena affect humans. This layer represents the footprint of where people live, and how many people live there.Dataset SummaryEach cell in this layer has an integer value with the estimated number of people likely to live in the geographic region represented by that cell. Esri additionally produced several additional layers World Population Estimate Confidence 2016: the confidence level (1-5) per cell for the probability of people being located and estimated correctly. World Population Density Estimate 2016: this layer is represented as population density in units of persons per square kilometer.World Settlement Score 2016: the dasymetric likelihood surface used to create this layer by apportioning population from census polygons to the settlement score raster.To use this layer in analysis, there are several properties or geoprocessing environment settings that should be used:Coordinate system: WGS_1984. This service and its underlying data are WGS_1984. We do this because projecting population count data actually will change the populations due to resampling and either collapsing or splitting cells to fit into another coordinate system. Cell Size: 0.0013474728 degrees (approximately 150-meters) at the equator. No Data: -1Bit Depth: 32-bit signedThis layer has query, identify, pixel, and export image functions enabled, and is restricted to a maximum analysis size of 30,000 x 30,000 pixels - an area about the size of Africa.Frye, C. et al., (2018). Using Classified and Unclassified Land Cover Data to Estimate the Footprint of Human Settlement. Data Science Journal. 17, p.20. DOI: http://doi.org/10.5334/dsj-2018-020.What can you do with this layer?This layer is unsuitable for mapping or cartographic use, and thus it does not include a convenient legend. Instead, this layer is useful for analysis, particularly for estimating counts of people living within watersheds, coastal areas, and other areas that do not have standard boundaries. Esri recommends using the Zonal Statistics tool or the Zonal Statistics to Table tool where you provide input zones as either polygons, or raster data, and the tool will summarize the count of population within those zones. https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/data-management/2016-world-population-estimate-services-are-now-available/
Global Markup Estimates Primary Foods Industry
kaggle.com
Updated Feb 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jocelyn Dumlao (2024). Global Markup Estimates Primary Foods Industry [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/global-markup-estimates-primary-foods-industry
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Dataset provided by
Kaggle
Authors
Jocelyn Dumlao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

Estimates of the markup for the Primary Foods industry (comprised of agriculture, hunting, fishing, and logging) using the De Loecker et al. (2020) methodology. Instead of using micro-based, firm-level data, we calculate the markups using aggregate, macro-data. The database contains information on 170 countries for the years 1995 - 2015.

Our sources of data are two-fold; the first is the EORA input-output database, and the second is the UN FAO-Stat database. Our paper Rodriguez del Valle and Fernández-Vázquez (2024), explains in more detail the estimation technique based on Generalized Maximum Entropy employed to derive these estimates.

The dataset can be used to explore and research a myriad of topics, including the impact globalization has on markups, the role of institutional quality, and even climate. We have found strong evidence that the percentage share of value added required for production originating from abroad is significantly connected to lower markups. We have also found compelling empirical evidence that institutional quality can impact the evolution of markups.

Please note that we make a strong assumption that each industry is produced by one "firm" consistent with input-output theory. For this industry, we believe the assumption will not bias results too strongly, particularly when analyzing developing countries. Results may be biased in countries where large farms exist.

Categories

Industrial Organization, Macroeconomics, Microeconomics, Agriculture Industry

Acknowledgements & Source

Adrian Rodriguez del Valle,Esteban Fernandez-Vazquez

Institutions

Universidad de Oviedo

Data Source

Please don't forget to upvote if you find this useful.
e
A global database of long-term changes in insect assemblages
knb.ecoinformatics.org
search-dev.test.dataone.org
+4more
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2022). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F1ZC817H
Explore at:
Unique identifier
https://doi.org/10.5063/F1ZC817H
Dataset updated
Jan 26, 2022
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 63 more
Description
UPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
Meta Kaggle Code
kaggle.com
zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(147568851439 bytes)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
u
ARISE-SAI-1.5
rda.ucar.edu
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ARISE-SAI-1.5 [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
Explore at:
Dataset updated
May 20, 2022
Description
The Assessing Responses and Impacts of Solar climate intervention on the Earth system with stratospheric aerosol injection simulations (ARISE-SAI-1.5) utilize a moderate emission scenario, introduce stratospheric aerosol injection at approximately ... 21 km in year 2035, and keep global mean surface air temperature near 1.5C above the pre-industrial value. CESM2 (WACCM6) global output from all model components (atmosphere, ice, land, ocean) on monthly, daily and sub-daily frequencies. All atmospheric data is on the original CESM2 (WACCM6) grid (0.9 by 1.25 degrees). All data is in time-series, NetCDF format. More details about this dataset (including information on reference simulations) can be found on the ARISE-SAI-1.5 CESM Community Project page linked below under Related Resources.
Global urban greening data and code
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Cuthbert; Gabriel Rau (2023). Global urban greening data and code [Dataset]. http://doi.org/10.6084/m9.figshare.17049806.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17049806.v2
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mark Cuthbert; Gabriel Rau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file contains a summary of data and code used in the paper:

M. O. Cuthbert, G. C. Rau, M. Ekström, D. M O’Carroll & A. J. Bates (2022). Global climate-driven trade-offs between the water retention and cooling benefits of urban greening. Nature Communications. https://doi.org/10.1038/s41467-022-28160-8See the ReadMe file uploaded with the data and the Methods section of the paper for details of the derivation of each dataset.
Z
Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...
data.niaid.nih.gov
zenodo.org
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6624080
Explore at:
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:

N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109

Abstract

The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description

The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.

The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.

Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)

Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)

Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)

Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)

Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)

Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)

Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)

Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)

Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.

Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development

Terminology

List of synonyms and terms

COVID-19

Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus

online learning

online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
zenodo.org
data.niaid.nih.gov
bin, json +3
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
Explore at:
zip, text/x-python, bin, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10875063
Dataset updated
Apr 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 26, 2024
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

>> FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env

How to open zarr files

Install the python zarr package:
pip install zarr

Opened a zarr file with:

import zarr
raw = zarr.open(
seg = zarr.open(

# optional:
import numpy as np
raw_np = np.array(raw)

Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:
pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()

Execute:
python view_data.py

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.
All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
Z
GTSM-ERA5-E dataset - Data underlying the paper "Global dataset of storm...
data.niaid.nih.gov
Updated Feb 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksandrova, Natalia (2025). GTSM-ERA5-E dataset - Data underlying the paper "Global dataset of storm surges and extreme sea levels for 1950-2024 based on the ERA5 climate reanalysis" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10671283
Explore at:
Dataset updated
Feb 25, 2025
Dataset provided by
Aleksandrova, Natalia
Veenstra, Jelmer
Gwee, Robyn
Muis, Sanne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Extreme sea levels, generated by storm surges and high tides, have the potential to cause coastal flooding and erosion. Global datasets are instrumental for mapping of extreme sea levels and associated societal risks. Harnessing the backward extension of the ERA5 reanalysis, we present a dataset containing the statistics of water levels based on a global hydrodynamic model (GTSMv3.0) covering the period 1950-2024. This is an extension of a previously published dataset for 1979-2018 (Muis et al. 2020). The timeseries (10-min, hourly mean and daily maxima) are available via the Climate Data Store of ECMWF at DOI: 10.24381/cds.a6d42d60. Using this extended ERA5 dataset, we calculate percentiles and estimate extreme water levels for various return periods globally. The percentiles dataset includes the 1, 5, 10, 25, 50, 75, 90, 95 and 99th percentiles. The extreme water levels include return values for 1, 2, 5, 10, 25, 50, 75 and 100 years, and they are estimated using POT-GPD method applied with a threshold of 99th percentile of the timeseries and using a 72-hour window for declustering peak events, and MLE method for fitting the GPD parameters. The parameters (shape, scale and location) are also supplied with this dataset.

Validation of the underlying timeseries and the statistical values shows that there is a good agreement between observed and modelled sea levels, with the level of agreement being very similar to that of the previously published dataset. The extended 75-year dataset allows for a more robust estimation of extremes, often resulting in smaller uncertainties than its 40-year precursor. The present dataset can be used in global assessments of flood risk, climate variability and climate changes.

Global modelling of water levels and extreme value analysis are associated with a number of uncertainties and limitations, that are particularly important to consider when conducting local assessments. Please refer to the Usage Notes in the corresponding manuscript (Aleksandrova et al. 2025, paper currently under review) for an overview of limitations.
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
zip
Updated Jun 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Jun 6, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
a
Catholic Carbon Footprint Story Map Map
hub.arcgis.com
catholic-geo-hub-cgisc.hub.arcgis.com
Updated Oct 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
burhansm2 (2019). Catholic Carbon Footprint Story Map Map [Dataset]. https://hub.arcgis.com/maps/8c3112552bdd4bd3962ab8b94bcf6ee5
Explore at:
Dataset updated
Oct 7, 2019
Dataset authored and provided by
burhansm2
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Area covered
Description
Catholic Carbon Footprint Story Map Map:DataBurhans, Molly A., Cheney, David M., Gerlt, R.. . “PerCapita_CO2_Footprint_InDioceses_FULL”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.Map Development: Molly BurhansMethodologyThis is the first global Carbon footprint of the Catholic population. We will continue to improve and develop these data with our research partners over the coming years. While it is helpful, it should also be viewed and used as a "beta" prototype that we and our research partners will build from and improve. The years of carbon data are (2010) and (2015 - SHOWN). The year of Catholic data is 2018. The year of population data is 2016. Care should be taken during future developments to harmonize the years used for catholic, population, and CO2 data.1. Zonal Statistics: Esri Population Data and Dioceses --> Population per dioceses, non Vatican based numbers2. Zonal Statistics: FFDAS and Dioceses and Population dataset --> Mean CO2 per Diocese3. Field Calculation: Population per Diocese and Mean CO2 per diocese --> CO2 per Capita4. Field Calculation: CO2 per Capita * Catholic Population --> Catholic Carbon FootprintAssumption: PerCapita CO2Deriving per-capita CO2 from mean CO2 in a geography assumes that people's footprint accounts for their personal lifestyle and involvement in local business and industries that are contribute CO2. Catholic CO2Assumes that Catholics and non-Catholic have similar CO2 footprints from their lifestyles.Derived from:A multiyear, global gridded fossil fuel CO2 emission data product: Evaluation and analysis of resultshttp://ffdas.rc.nau.edu/About.htmlRayner et al., JGR, 2010 - The is the first FFDAS paper describing the version 1.0 methods and results published in the Journal of Geophysical Research.Asefi et al., 2014 - This is the paper describing the methods and results of the FFDAS version 2.0 published in the Journal of Geophysical Research.Readme version 2.2 - A simple readme file to assist in using the 10 km x 10 km, hourly gridded Vulcan version 2.2 results.Liu et al., 2017 - A paper exploring the carbon cycle response to the 2015-2016 El Nino through the use of carbon cycle data assimilation with FFDAS as the boundary condition for FFCO2."S. Asefi‐Najafabady P. J. Rayner K. R. Gurney A. McRobert Y. Song K. Coltin J. Huang C. Elvidge K. BaughFirst published: 10 September 2014 https://doi.org/10.1002/2013JD021296 Cited by: 30Link to FFDAS data retrieval and visualization: http://hpcg.purdue.edu/FFDAS/index.phpAbstractHigh‐resolution, global quantification of fossil fuel CO2 emissions is emerging as a critical need in carbon cycle science and climate policy. We build upon a previously developed fossil fuel data assimilation system (FFDAS) for estimating global high‐resolution fossil fuel CO2 emissions. We have improved the underlying observationally based data sources, expanded the approach through treatment of separate emitting sectors including a new pointwise database of global power plants, and extended the results to cover a 1997 to 2010 time series at a spatial resolution of 0.1°. Long‐term trend analysis of the resulting global emissions shows subnational spatial structure in large active economies such as the United States, China, and India. These three countries, in particular, show different long‐term trends and exploration of the trends in nighttime lights, and population reveal a decoupling of population and emissions at the subnational level. Analysis of shorter‐term variations reveals the impact of the 2008–2009 global financial crisis with widespread negative emission anomalies across the U.S. and Europe. We have used a center of mass (CM) calculation as a compact metric to express the time evolution of spatial patterns in fossil fuel CO2 emissions. The global emission CM has moved toward the east and somewhat south between 1997 and 2010, driven by the increase in emissions in China and South Asia over this time period. Analysis at the level of individual countries reveals per capita CO2 emission migration in both Russia and India. The per capita emission CM holds potential as a way to succinctly analyze subnational shifts in carbon intensity over time. Uncertainties are generally lower than the previous version of FFDAS due mainly to an improved nightlight data set."Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/Esri Gridded Population Data 2016DescriptionThis layer is a global estimate of human population for 2016. Esri created this estimate by modeling a footprint of where people live as a dasymetric settlement likelihood surface, and then assigned 2016 population estimates stored on polygons of the finest level of geography available onto the settlement surface. Where people live means where their homes are, as in where people sleep most of the time, and this is opposed to where they work. Another way to think of this estimate is a night-time estimate, as opposed to a day-time estimate.Knowledge of population distribution helps us understand how humans affect the natural world and how natural events such as storms and earthquakes, and other phenomena affect humans. This layer represents the footprint of where people live, and how many people live there.Dataset SummaryEach cell in this layer has an integer value with the estimated number of people likely to live in the geographic region represented by that cell. Esri additionally produced several additional layers World Population Estimate Confidence 2016: the confidence level (1-5) per cell for the probability of people being located and estimated correctly. World Population Density Estimate 2016: this layer is represented as population density in units of persons per square kilometer.World Settlement Score 2016: the dasymetric likelihood surface used to create this layer by apportioning population from census polygons to the settlement score raster.To use this layer in analysis, there are several properties or geoprocessing environment settings that should be used:Coordinate system: WGS_1984. This service and its underlying data are WGS_1984. We do this because projecting population count data actually will change the populations due to resampling and either collapsing or splitting cells to fit into another coordinate system. Cell Size: 0.0013474728 degrees (approximately 150-meters) at the equator. No Data: -1Bit Depth: 32-bit signedThis layer has query, identify, pixel, and export image functions enabled, and is restricted to a maximum analysis size of 30,000 x 30,000 pixels - an area about the size of Africa.Frye, C. et al., (2018). Using Classified and Unclassified Land Cover Data to Estimate the Footprint of Human Settlement. Data Science Journal. 17, p.20. DOI: http://doi.org/10.5334/dsj-2018-020.What can you do with this layer?This layer is unsuitable for mapping or cartographic use, and thus it does not include a convenient legend. Instead, this layer is useful for analysis, particularly for estimating counts of people living within watersheds, coastal areas, and other areas that do not have standard boundaries. Esri recommends using the Zonal Statistics tool or the Zonal Statistics to Table tool where you provide input zones as either polygons, or raster data, and the tool will summarize the count of population within those zones. https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/data-management/2016-world-population-estimate-services-are-now-available/
P
UCF-Crime Dataset
paperswithcode.com
Updated Nov 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waqas Sultani; Chen Chen; Mubarak Shah (2023). UCF-Crime Dataset [Dataset]. https://paperswithcode.com/dataset/ucf-crime
Explore at:
Dataset updated
Nov 26, 2023
Authors
Waqas Sultani; Chen Chen; Mubarak Shah
Description
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety.

This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities.
Forest proximate people - 5km cutoff distance (Global - 100m)
data.amerigeoss.org
http, wmts
Updated Oct 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Food and Agriculture Organization (2022). Forest proximate people - 5km cutoff distance (Global - 100m) [Dataset]. https://data.amerigeoss.org/dataset/8ed893bd-842a-4866-a655-a0a0c02b79b5
Explore at:
http, wmtsAvailable download formats
Dataset updated
Oct 24, 2022
Dataset provided by
Food and Agriculture Organizationhttp://fao.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Forest Proximate People" (FPP) dataset is one of the data layers contributing to the development of indicator #13, “number of forest-dependent people in extreme poverty,” of the Collaborative Partnership on Forests (CPF) Global Core Set of forest-related indicators (GCS). The FPP dataset provides an estimate of the number of people living in or within 5 kilometers of forests (forest-proximate people) for the year 2019 with a spatial resolution of 100 meters at a global level.

For more detail, such as the theory behind this indicator and the definition of parameters, and to cite this data, see: Newton, P., Castle, S.E., Kinzer, A.T., Miller, D.C., Oldekop, J.A., Linhares-Juvenal, T., Pina, L. Madrid, M., & de Lamo, J. 2022. The number of forest- and tree-proximate people: A new methodology and global estimates. Background Paper to The State of the World’s Forests 2022 report. Rome, FAO.

Contact points:

Maintainer: Leticia Pina

Maintainer: Sarah E., Castle

Data lineage:

The FPP data are generated using Google Earth Engine. Forests are defined by the Copernicus Global Land Cover (CGLC) (Buchhorn et al. 2020) classification system’s definition of forests: tree cover ranging from 15-100%, with or without understory of shrubs and grassland, and including both open and closed forests. Any area classified as forest sized ≥ 1 ha in 2019 was included in this definition. Population density was defined by the WorldPop global population data for 2019 (WorldPop 2018). High density urban populations were excluded from the analysis. High density urban areas were defined as any contiguous area with a total population (using 2019 WorldPop data for population) of at least 50,000 people and comprised of pixels all of which met at least one of two criteria: either the pixel a) had at least 1,500 people per square km, or b) was classified as “built-up” land use by the CGLC dataset (where “built-up” was defined as land covered by buildings and other manmade structures) (Dijkstra et al. 2020). Using these datasets, any rural people living in or within 5 kilometers of forests in 2019 were classified as forest proximate people. Euclidean distance was used as the measure to create a 5-kilometer buffer zone around each forest cover pixel. The scripts for generating the forest-proximate people and the rural-urban datasets using different parameters or for different years are published and available to users. For more detail, such as the theory behind this indicator and the definition of parameters, and to cite this data, see: Newton, P., Castle, S.E., Kinzer, A.T., Miller, D.C., Oldekop, J.A., Linhares-Juvenal, T., Pina, L., Madrid, M., & de Lamo, J. 2022. The number of forest- and tree-proximate people: a new methodology and global estimates. Background Paper to The State of the World’s Forests 2022. Rome, FAO.

References:

Buchhorn, M., Smets, B., Bertels, L., De Roo, B., Lesiv, M., Tsendbazar, N.E., Herold, M., Fritz, S., 2020. Copernicus Global Land Service: Land Cover 100m: collection 3 epoch 2019. Globe.

Dijkstra, L., Florczyk, A.J., Freire, S., Kemper, T., Melchiorri, M., Pesaresi, M. and Schiavina, M., 2020. Applying the degree of urbanisation to the globe: A new harmonised definition reveals a different picture of global urbanisation. Journal of Urban Economics, p.103312.

WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University, 2018. Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645

Online resources:

GEE asset for "Forest proximate people - 5km cutoff distance"
Global predicted sea-surface iodide concentrations v0.0.1
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Apr 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomas Sherwen; Rosie J. Chance; Liselotte Tinel; Daniel Ellis; Mat J. Evans; Lucy J. Carpenter (2019). Global predicted sea-surface iodide concentrations v0.0.1 [Dataset]. https://catalogue.ceda.ac.uk/uuid/6448e7c92d4e48188533432f6b26fe22
Explore at:
Dataset updated
Apr 30, 2019
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Tomas Sherwen; Rosie J. Chance; Liselotte Tinel; Daniel Ellis; Mat J. Evans; Lucy J. Carpenter
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Time period covered
Dec 31, 1969 - Nov 30, 1970
Area covered
Earth
Variables measured
time
Description
This dataset contains global spatially predicted sea-surface iodide concentrations at a monthly resolution for the year 1970. It was developed as part of the NERC project Iodide in the ocean:distribution and impact on iodine flux and ozone loss (NE/N009983/1), which aimed to quantify the dominant controls on the sea surface iodide distribution and improve parameterisation of the sea-to-air iodine flux and of ozone deposition.

This dataset is the output used in the published paper 'A machine learning based global sea-surface iodide distribution' ( https://doi.org/10.5194/essd-2019-40).

The main ensemble prediction ("Ensemble_Monthly_mean ") is provided in a NetCDF file as a single variable (1). A second file (2) is provided which includes all of the predictions and the standard deviation on the prediction. (1) predicted_iodide_0.125x0.125_Ns_Just_Ensemble.nc (2) predicted_iodide_0.125x0.125_Ns_All_Ensemble_members.nc

For ease of use, this output has been re-gridded to various commonly used atmosphere and ocean model resolutions (see table SI table A5 in paper). These re-gridded files are included in the folder titled "regridded_data".

Additionally, a further file (3) is provided including the prediction made included data from the Skagerak dataset. As stated in the paper referenced above, it is recommended to use the use the core files (1,2) or their re-gridded equivalents.

(3) predicted_iodide_0.125x0.125_All_Ensemble_members.nc

As new observations are made, this global data product will be updated through a "living data" model. The dataset versions follow semantic versioning (https://semver.org/). This dataset contains the first publicly released version v0.0.1 and supersedes the pre-review dataset named v0.0.0, Please refer to the paper referenced above for the current version number and information on this.

Updates for v0.0.1 vs. v0.0.0 - Additional files included of the core data re-gridded for 0.5x0.5 degree and 0.25x0.25 degree horizontal resolution. - Minor updates were applied to all metadata in NetCDF files. - Updates were made to coordinate grids used for regriding files from 1x1 degree to 4x5 degree.
P
NuiSI Dataset Dataset
paperswithcode.com
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vignesh Prasad; Alap Kshirsagar; Dorothea Koert; Ruth Stock-Homburg; Jan Peters; Georgia Chalvatzaki (2024). NuiSI Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/nuisi-dataset
Explore at:
Dataset updated
Jul 9, 2024
Authors
Vignesh Prasad; Alap Kshirsagar; Dorothea Koert; Ruth Stock-Homburg; Jan Peters; Georgia Chalvatzaki
Description
The NuiSI dataset contains skeleton tracking trajectories of Human Interaction Partners performing a variety of physically interactive behaviors (waving, handshaking, rocket fistbump, parachute fistbump) with each other. This is inspired by the dataset in Bütepage et al. "Imitating by generating: Deep generative models for imitation of interactive tasks." Frontiers in Robotics and AI (2020) wherein they capture a dataset with rokoko motion capture suits. Instead we track the skeletons of the interaction partner with Intel Realsense cameras using Nuitrack, for a more realistic scenario, with noise coming from the depth sensor, the skeleton tracking and some partial occlusions. This makes it more representative of real world interactions with a Robot equipped with an RGBD camera. T This dataset is used in our papers for training Interaction models for Human-Robot Interaction with a humanoid social robot. If you find the dataset useful in your work, please cite our paper:

Prasad, V., Heitlinger, L., Koert, D., Stock-Homburg, R., Peters, J., & Chalvatzaki, G. (2023). Learning multimodal latent dynamics for human-robot interaction. arXiv preprint arXiv:2311.16380.

@article{prasad2023learning, title={Learning multimodal latent dynamics for human-robot interaction}, author={Prasad, Vignesh and Heitlinger, Lea and Koert, Dorothea and Stock-Homburg, Ruth and Peters, Jan and Chalvatzaki, Georgia}, journal={arXiv preprint arXiv:2311.16380}, year={2023} }
Human-Building Office Space Interactions
kaggle.com
zip
Updated Aug 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clayton Miller (2020). Human-Building Office Space Interactions [Dataset]. https://www.kaggle.com/claytonmiller/humanbuilding-office-space-interactions
Explore at:
zip(21899713 bytes)Available download formats
Dataset updated
Aug 31, 2020
Authors
Clayton Miller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

This data were collected and disseminated according to this publication: https://www.nature.com/articles/s41597-019-0273-5

All descriptors below are taken from this publication and are copyright of the authors.

Abstract

Adaptive interactions between building occupants and their surrounding environments affect both energy use and environmental quality, as demonstrated by a large body of modeling research that quantifies the impacts of occupant behavior on building operations. Yet, available occupant field data are insufficient to explore the mechanisms that drive this interaction. This paper introduces data from a one year study of 24 U.S. office occupants that recorded a comprehensive set of possible exogenous and endogenous drivers of personal comfort and behavior over time. The longitudinal data collection protocol merges individual thermal comfort, preference, and behavior information from online daily surveys with datalogger readings of occupants’ local thermal environments and control states, yielding 2503 survey responses alongside tens of thousands of concurrent behavior and environment measurements. These data have been used to uncover links between the built environment, personal variables, and adaptive actions, and the data contribute to international research collaborations focused on understanding the human-building interaction.

Background

Humans interact with the built environment in a variety of ways that contribute to both building energy use and environmental quality and thus warrant significant attention in the building design, operation, and retrofit processes. Occupants’ thermally adaptive behaviours such as adjusting thermostats and clothing, opening and closing windows and doors, operating personal heating and cooling devices, are strongly tied to total site energy consumed in residential and commercial buildings in the United States (U.S.). This dataset introduces longitudinal data from a one-year study of occupant thermal comfort and several related behavioural adaptations in an air-conditioned office setting in U.S. Offices. The primary objective of the data collection approach was to record a comprehensive range of exogenous and endogenous factors that may drive personal comfort and behaviour outcomes over time.

Methods

Longitudinal data on building occupant behavior, comfort, and environmental conditions were collected between July 2012 and August 2013 at the Friends Center office building in Center City Philadelphia, Pennsylvania, United States. Data collection proceeded in three stages:

Semi-structured interviews Semi-structured interviews identify aspects of behavior that are not yet well known or understood and provide a rich qualitative context for developing and interpreting responses from structured survey instruments. 32 interviews about thermal comfort and related behaviours were first conducted with office occupants from 7 air-conditioned buildings, ranging from aging to recently renovated, around the Philadelphia region

Site selection and subject recruitment for the longitudinal study Subject recruitment was initiated through an e-mail message sent to all employees in the Friends Center by its Executive Director. The following question areas were included: (a) demographic information, (b) office characteristics, (c) thermal comfort and preferences, (d) control options, (e) personal values, and f) typical work schedule (arrival, lunch, departure times).

Longitudinal survey and datalogger measurements. The final occupant sample participated in a series of subjective and objective measurements of thermal comfort, adaptive behavior, and related items. These measurements were carried via longitudinal online surveys, as well as through parallel datalogger and BAS measurements of the local environment and behavioural actions.

Technical Validation

Several measures were taken to ensure the validity of the collected data, following data collection guidance included in the final report for International Energy Agency Annex 66: Definition and Simulation of Occupant Behavior in Buildings. These measures include survey preparation phase, encourage high response rates, pilot studies, quality control, redundancy and comparison against expected conditions. Details for these measures can be found in the paper.

Usage Notes

NaNs for certain variables do not generally imply missing data; rather, they indicate that the variable was not being measured for a given occupant at a given time stamp

The variable “Occupant Number” refers to an occupant ID (e.g., a value between 1 and the 24 occupants who participated in the study); it does not refer to an occupancy (presence/absence) measurement.

The variable “Occupancy 1” is an expected occupancy (presence/absence) value calculated across all time steps in the study based on the occupant’s responses about typical periods of occupancy on the background survey conducted before the start of the longitudinal measurements. The variable “Occupancy 2” is an expected occupancy state calculated across all time steps that fall under the two week daily surveying window, based on the time of arrival reported in the daily morning survey and recent departures from the office reported in the daily mid-day and afternoon surveys, as well as periods of prolonged absence reported on the retrospective surveys conducted after each two week period.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Garbage Classification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/garbage-classification-dataset

Garbage Classification Dataset Dataset

Explore at:

113 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 24, 2025

Description

Description:

👉 Download the dataset here

This dataset contains a collection of 15,150 images, categorized into 12 distinct classes of common household waste. The classes include paper, cardboard, biological waste, metal, plastic, green glass, brown glass, white glass, clothing, shoes, batteries, and general trash. Each category represents a different type of material, contributing to more effective recycling and waste management strategies. Garbage Classification Dataset.

Objective

The purpose of this dataset is to aid in the development of machine learning models designed to automatically classify household waste into its appropriate categories, thus promoting more efficient recycling processes. Proper waste sorting is crucial for maximizing the amount of material that can be recycled, and this dataset is aimed at enhancing automation in this area. The classification of garbage into a broader range of categories, as opposed to the limited classes found in most available datasets (2-6 classes), allows for a more precise recycling process and could significantly improve recycling rates.

Download Dataset

Dataset Composition and Collection Process

The dataset was primarily collected through web scraping, as simulating a real-world garbage collection scenario (such as placing a camera above a conveyor belt) was not feasible at the time of collection. The goal was to obtain images that closely resemble actual garbage. For example, images in the biological waste category include rotten fruits, vegetables, and food remnants. Similarly, categories such as glass and metal consist of images of bottles, cans, and containers typically found in household trash. While the images for some categories, like clothes or shoes, were harder to find specifically as garbage, they still represent the items that may end up in waste streams.

In an ideal setting, a conveyor system could be used to gather real-time data by capturing images of waste in a continuous flow. Such a setup would enhance the dataset by providing authentic waste images for all categories. However, until that setup is available, this dataset serves as a significant step toward automating garbage classification and improving recycling technologies.

Potential for Future Improvements

While this dataset provides a strong foundation for household waste classification, there is potential for further improvements. For example, real-time data collection using conveyor systems or garbage processing plants could provide higher accuracy and more contextual images. Additionally, future datasets could expand to include more specialized categories, such as electronic waste, hazardous materials, or specific types of plastic.

Conclusion

The Garbage Classification dataset offers a broad and diverse collection of household waste images, making it a valuable resource for researchers and developers working in environmental sustainability, machine learning, and recycling automation. By improving the accuracy of waste classification systems, we can contribute to a cleaner, more sustainable future.

This dataset is sourced from Kaggle.

Clear search

Close search

Google apps

Main menu

Garbage Classification Dataset Dataset

Dataset for: Navigating Ecosystem Services Trade-offs: A Global...

SynthPAI Dataset

Globalization and Income Distribution Dataset 1975-2002 - Aruba,...

Abstract

Kind of data

Catholic CO2 Footprint Beta Full NULL NAmerica2

Global Markup Estimates Primary Foods Industry

Description

Categories

Acknowledgements & Source

Please don't forget to upvote if you find this useful.

A global database of long-term changes in insect assemblages

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

ARISE-SAI-1.5

Global urban greening data and code

Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

General

Summary

Abstract

Dataset documentation:

Files

How to work with the image files

How to open zarr files

How to view zarr image files

Metrics

Baseline

License

Citation

Acknowledgments

Changelog

Contributing

GTSM-ERA5-E dataset - Data underlying the paper "Global dataset of storm...

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

Catholic Carbon Footprint Story Map Map

UCF-Crime Dataset

Forest proximate people - 5km cutoff distance (Global - 100m)

Global predicted sea-surface iodide concentrations v0.0.1

NuiSI Dataset Dataset

Human-Building Office Space Interactions

Context

Abstract

Background

Methods

Technical Validation

Usage Notes

Garbage Classification Dataset DatasetSee More Versions

Garbage Classification Dataset Dataset