Description:
This dataset contains a collection of 15,150 images, categorized into 12 distinct classes of common household waste. The classes include paper, cardboard, biological waste, metal, plastic, green glass, brown glass, white glass, clothing, shoes, batteries, and general trash. Each category represents a different type of material, contributing to more effective recycling and waste management strategies. Garbage Classification Dataset.
Objective
The purpose of this dataset is to aid in the development of machine learning models designed to automatically classify household waste into its appropriate categories, thus promoting more efficient recycling processes. Proper waste sorting is crucial for maximizing the amount of material that can be recycled, and this dataset is aimed at enhancing automation in this area. The classification of garbage into a broader range of categories, as opposed to the limited classes found in most available datasets (2-6 classes), allows for a more precise recycling process and could significantly improve recycling rates.
Download Dataset
Dataset Composition and Collection Process
The dataset was primarily collected through web scraping, as simulating a real-world garbage collection scenario (such as placing a camera above a conveyor belt) was not feasible at the time of collection. The goal was to obtain images that closely resemble actual garbage. For example, images in the biological waste category include rotten fruits, vegetables, and food remnants. Similarly, categories such as glass and metal consist of images of bottles, cans, and containers typically found in household trash. While the images for some categories, like clothes or shoes, were harder to find specifically as garbage, they still represent the items that may end up in waste streams.
In an ideal setting, a conveyor system could be used to gather real-time data by capturing images of waste in a continuous flow. Such a setup would enhance the dataset by providing authentic waste images for all categories. However, until that setup is available, this dataset serves as a significant step toward automating garbage classification and improving recycling technologies.
Potential for Future Improvements
While this dataset provides a strong foundation for household waste classification, there is potential for further improvements. For example, real-time data collection using conveyor systems or garbage processing plants could provide higher accuracy and more contextual images. Additionally, future datasets could expand to include more specialized categories, such as electronic waste, hazardous materials, or specific types of plastic.
Conclusion
The Garbage Classification dataset offers a broad and diverse collection of household waste images, making it a valuable resource for researchers and developers working in environmental sustainability, machine learning, and recycling automation. By improving the accuracy of waste classification systems, we can contribute to a cleaner, more sustainable future.
This dataset is sourced from Kaggle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methods
The dataset is the output of a comprehensive literature-based search that aims to collate all the evidence on where ES relationships have been mentioned and addressed. We applied systematic mapping which is based on the “Guidelines for Systematic Review in Environmental Management” developed by the Centre for Evidence-Based Conservation at Bangor University (Pullin and Stewart 2006).
The methodological framework followed the standard stages outlined for systematic mapping in environmental sciences (James et al. 2016). Briefly, we defined the scope and objectives:
· We comprehensively review and further explore the global evidence of ES trade-offs and synergies focusing on all systems including terrestrial, freshwater, and marine.
· We compiled the evidence on trade-offs and synergies among multiple ES interacting across various ecosystems.
· We performed a geographical and temporal trend analysis exploring the distribution of studies across the world examining how the focus on various ecosystem types and ES categories has evolved to highlight gaps and biases.
Then we set the criteria for study inclusion (Table 1), searched the evidence, coded, and produced the database. Extracted article information including the specific criteria is detailed in Table 1.
The first step was to search the ISI Web of Knowledge core collection (http://apps.webofknowledge.com) database, targeting the search on the ecosystem services literature and studies dealing with trade-offs/synergies, win-win outcomes or bundles when managing different ecosystem services in the landscape/seascape. All peer-reviewed journal articles written in English and Spanish have been considered for review.
The peer-reviewed literature from 2005 to 2021 was reviewed identifying relevant studies according to specific search terms. The relevant search terms and descriptive words derived from (Howe et al. 2014) adding “bundles” and “co-benefits”. Boolean nomenclatures ‘*’ = all letters were allowed after the *, were used on the root of words where several different endings applied (Figure 1). Search terms used were:
(“*ecosystem service*” OR “environment* service*” OR “ecosystem* approach*” OR “ecosystem good*” OR “environment* good*”)
AND
(“*trade-off*” OR “tradeoff*” OR “synerg*” OR “win-win*” OR “bundle*” OR “cost*and benefit*” OR “co-benefit*”) n=5194
Papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org).
In the second step (Figure 1) papers were preliminarily coded with a semantic analysis using the R package Bibliometrix (http://www.bibliometrix.org). Papers were classified according to three systems: terrestrial, marine, and freshwater (Table 1). Papers with multiple systems, transitional habitats or those that could not be classified were classified as “other” (Mazor et al. 2018). Articles were classified based on the occurrence of the most frequent system words in their title, keywords, and abstract (Mazor et al. 2018). The set of system-specific words was determined by extracting the 250 most frequently used keywords from all considered articles and assigning each word to either system (articles could fall into just one of the four categories). Using this technique, we managed to classify 100% of the papers. To further enrich the dataset and make it a useful repository for science and policy, an additional sub-classification was performed, categorizing papers into the following categories: Coastal, Urban, Wetlands, Forest, Mountain, Freshwater, Agroecosystems, and Others that mainly represented multiple ecosystems (Table S1). This comprehensive classification approach enhances the dataset’s utility for various scientific and policy-making applications.
In the third step (Figure 1), applying the same technique, we classified the papers into four ES categories: habitat (supporting biodiversity related), provisioning, regulating, and cultural services (De Groot et al. 2010; MEA 2005; Sukhdev 2010; Wallace 2007). For the classification into ES categories, articles could fall into one or more of the four categories (see Table 1 for example the keywords used to classify ecosystems, ES categories, and countries). Applying this technique, we excluded 2149 papers that weren’t classified in any of the ecosystem services types categories resulting in 3629 papers (see Figure 1).
In the fourth step (Figure 1), an initial screening was conducted to identify papers that did not align with the review objectives of assessing ecosystem services trade-offs and synergies to inform policy and management decisions. We manually reviewed the titles of each paper in the dataset, excluding those that were from other fields or did not align with the review objectives. In this initial assessment, we excluded 347 papers, leaving a total of 3,286 papers for further review. A descriptive analysis of this 3286 article dataset was performed to examine the distribution of ES categories within each ecosystem type over the specified period. This analysis allowed us to conclude the prevalence of each ecosystem service category in different ecosystem types, identifying temporal trends and patterns. The number of occurrences was calculated for each ES category within each ecosystem type, expressed as counts. This allowed for the comparison of ecosystem service distributions across the selected ecosystem types.
In the fifth step (Figure 1), we employed an approach to visually represent the geographical distribution and focus of ES studies across the world. With the classification of studies in ES categories and the types of ecosystems, the papers were coded according to the country where the study was performed. It was possible to assign a specific country to 2636 studies, removing 650 studies that did not specify the country of study. From these 2636 papers classified, a proportion were global studies that consider several countries under study (499 global studies).
We developed global maps (Figure 1), each offering a unique perspective on the ES research landscape. The first map presents the total number of ES trade-off studies conducted worldwide, illustrating the geographical spread and concentration of research efforts to provide a clear overview of regions that have been extensively studied and those that may require more attention in future research. Additionally, we calculated two key metrics to assess research productivity more comprehensively: the number of research papers per capita and the number of research papers relative to Gross Domestic Product (GDP). For population and GDP, we used the most recent available data from the World Bank (https://data.worldbank.org). These alternative metrics normalize the data based on economic output and population size, providing a more balanced view of research activity across different countries (Figures S3).
Detailed maps were created featuring pie charts that highlight the different categories of ES and ecosystem types addressed for each country. These charts offer an understanding of how various ES categories and ecosystems are represented in different parts of the world. Finally, we assessed ES trade-off studies to world regions (Africa, Antarctica, Asia, Australasia, Europe, Latin America, and North America) looking at the relationships between the categories of ES. We considered papers that evaluated more than one category of ES and the papers that considered only one category of ES. This country-level analysis offers insights into regional research trends and priorities, contributing to a more localized understanding of ES studies.
In the sixth step (Figure 1), each publication in this review was critically appraised to evaluate the quality of the papers included in the review. The foundation for our critical appraisal stems from the comprehensive and multidimensional approach of Belcher et al. (2016) that is framed to evaluate research quality, which aligns well with the interdisciplinary nature of our study. Belcher et al. (2016) developed a robust framework that incorporates essential principles and criteria for assessing the quality of transdisciplinary research. This is particularly relevant for ecosystem services science and our review that contributes to advancing current knowledge by systematically synthesizing evidence on relationships among various ES across these diverse systems.
The Belcher et al. (2016) framework emphasizes four main principles: relevance, credibility (which we have adapted as methodological transparency), legitimacy (generalizability in our context), and effectiveness (significance). A continuous scoring system (ranging from 0 to 1) was applied for the four main criteria to maintain simplicity and consistency across the large number of studies. In this system, a value closer to 0 indicates that the criteria are not met, while a value closer to 1 indicates that the criteria are more closely met. This scoring method was a useful indicator of the overall quality of the paper and how well the article met the review's goals overall.
Methodological Transparency was assessed based on the clarity and completeness of methodological descriptions, including data availability, the rigor of statistical analyses, methodological detail, and reproducibility of the findings. This criterion assesses the transparency and rigor of the study's methodology, including data collection, analysis, and reporting (Belcher et al. 2016). Relevance was evaluated by the study's alignment with the review's objectives, its importance to the field, and its practical applicability. This includes the extent to which the study addresses pertinent research
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.
Dataset Details Dataset Description SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity. Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant. Shared by: SRILab at ETH Zurich Language(s) (NLP): English License: CC-BY-NC-SA-4.0
Dataset Sources
Repository: https://github.com/eth-sri/SynthPAI Paper: https://arxiv.org/abs/2406.07217
Uses The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.
Direct Use As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.
Out-of-Scope Use The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.
Dataset Structure We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):
Comment
author str: unique identifier of the person writing
username str: corresponding username
parent_id str: unique identifier of the parent comment
thread_id str: unique identifier of the thread
children list[str]: unique identifiers of children comments
profile Profile: profile making the comment - described below
text str: text of the comment
guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.
reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes
The associated profiles are structured as follows
Profile
username str: identifier
attributes: set of personal attributes that describe the user (directly listed below)
The corresponding attributes and values are
Attributes
Age continuous [18-99] The age of a user in years.
Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)
Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)
Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.
Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).
Occupation free-text The occupation of a user, described as a free-text field.
Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.
Sex categorical [Male, Female] Biological Sex of a profile.
Dataset Creation Curation Rationale SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
Source Data The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.
Data Collection and Processing The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.
Annotations
Annotations are provided by authors of the paper.
Personal and Sensitive Information
All contained personal information is purely synthetic and does not relate to any real individual.
Bias, Risks, and Limitations All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper. As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.
Citation BibTeX:
@misc{2406.07217, Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev}, Title = {A Synthetic Dataset for Personal Attribute Inference}, Year = {2024}, Eprint = {arXiv:2406.07217}, } APA:
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; arXiv:2406.07217.
Dataset Card Authors
Hanna Yukhymenko Robin Staab Mark Vero
Dataset used in World Bank Policy Research Working Paper #2876, published in World Bank Economic Review, No. 1, 2005, pp. 21-44.
The effects of globalization on income distribution in rich and poor countries are a matter of controversy. While international trade theory in its most abstract formulation implies that increased trade and foreign investment should make income distribution more equal in poor countries and less equal in rich countries, finding these effects has proved elusive. The author presents another attempt to discern the effects of globalization by using data from household budget surveys and looking at the impact of openness and foreign direct investment on relative income shares of low and high deciles. The author finds some evidence that at very low average income levels, it is the rich who benefit from openness. As income levels rise to those of countries such as Chile, Colombia, or Czech Republic, for example, the situation changes, and it is the relative income of the poor and the middle class that rises compared with the rich. It seems that openness makes income distribution worse before making it better-or differently in that the effect of openness on a country's income distribution depends on the country's initial income level.
Aggregate data [agg]
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Catholic_CO2_Footprint_Beta_Full_NULL_NAmerica2Burhans, Molly A., Cheney, David M., Gerlt, R.. . “Catholic_CO2_Footprint_Beta_Full_NULL_NAmerica2”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.MethodologyThis is the first global Carbon footprint of the Catholic population. We will continue to improve and develop these data with our research partners over the coming years. While it is helpful, it should also be viewed and used as a "beta" prototype that we and our research partners will build from and improve. The years of carbon data are (2010) and (2015 - SHOWN). The year of Catholic data is 2018. The year of population data is 2016. Care should be taken during future developments to harmonize the years used for catholic, population, and CO2 data.1. Zonal Statistics: Esri Population Data and Dioceses --> Population per dioceses, non Vatican based numbers2. Zonal Statistics: FFDAS and Dioceses and Population dataset --> Mean CO2 per Diocese3. Field Calculation: Population per Diocese and Mean CO2 per diocese --> CO2 per Capita4. Field Calculation: CO2 per Capita * Catholic Population --> Catholic Carbon FootprintAssumption: PerCapita CO2Deriving per-capita CO2 from mean CO2 in a geography assumes that people's footprint accounts for their personal lifestyle and involvement in local business and industries that are contribute CO2. Catholic CO2Assumes that Catholics and non-Catholic have similar CO2 footprints from their lifestyles.Derived from:A multiyear, global gridded fossil fuel CO2 emission data product: Evaluation and analysis of resultshttp://ffdas.rc.nau.edu/About.htmlRayner et al., JGR, 2010 - The is the first FFDAS paper describing the version 1.0 methods and results published in the Journal of Geophysical Research.Asefi et al., 2014 - This is the paper describing the methods and results of the FFDAS version 2.0 published in the Journal of Geophysical Research.Readme version 2.2 - A simple readme file to assist in using the 10 km x 10 km, hourly gridded Vulcan version 2.2 results.Liu et al., 2017 - A paper exploring the carbon cycle response to the 2015-2016 El Nino through the use of carbon cycle data assimilation with FFDAS as the boundary condition for FFCO2."S. Asefi‐Najafabady P. J. Rayner K. R. Gurney A. McRobert Y. Song K. Coltin J. Huang C. Elvidge K. BaughFirst published: 10 September 2014 https://doi.org/10.1002/2013JD021296 Cited by: 30Link to FFDAS data retrieval and visualization: http://hpcg.purdue.edu/FFDAS/index.phpAbstractHigh‐resolution, global quantification of fossil fuel CO2 emissions is emerging as a critical need in carbon cycle science and climate policy. We build upon a previously developed fossil fuel data assimilation system (FFDAS) for estimating global high‐resolution fossil fuel CO2 emissions. We have improved the underlying observationally based data sources, expanded the approach through treatment of separate emitting sectors including a new pointwise database of global power plants, and extended the results to cover a 1997 to 2010 time series at a spatial resolution of 0.1°. Long‐term trend analysis of the resulting global emissions shows subnational spatial structure in large active economies such as the United States, China, and India. These three countries, in particular, show different long‐term trends and exploration of the trends in nighttime lights, and population reveal a decoupling of population and emissions at the subnational level. Analysis of shorter‐term variations reveals the impact of the 2008–2009 global financial crisis with widespread negative emission anomalies across the U.S. and Europe. We have used a center of mass (CM) calculation as a compact metric to express the time evolution of spatial patterns in fossil fuel CO2 emissions. The global emission CM has moved toward the east and somewhat south between 1997 and 2010, driven by the increase in emissions in China and South Asia over this time period. Analysis at the level of individual countries reveals per capita CO2 emission migration in both Russia and India. The per capita emission CM holds potential as a way to succinctly analyze subnational shifts in carbon intensity over time. Uncertainties are generally lower than the previous version of FFDAS due mainly to an improved nightlight data set."Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/Esri Gridded Population Data 2016DescriptionThis layer is a global estimate of human population for 2016. Esri created this estimate by modeling a footprint of where people live as a dasymetric settlement likelihood surface, and then assigned 2016 population estimates stored on polygons of the finest level of geography available onto the settlement surface. Where people live means where their homes are, as in where people sleep most of the time, and this is opposed to where they work. Another way to think of this estimate is a night-time estimate, as opposed to a day-time estimate.Knowledge of population distribution helps us understand how humans affect the natural world and how natural events such as storms and earthquakes, and other phenomena affect humans. This layer represents the footprint of where people live, and how many people live there.Dataset SummaryEach cell in this layer has an integer value with the estimated number of people likely to live in the geographic region represented by that cell. Esri additionally produced several additional layers World Population Estimate Confidence 2016: the confidence level (1-5) per cell for the probability of people being located and estimated correctly. World Population Density Estimate 2016: this layer is represented as population density in units of persons per square kilometer.World Settlement Score 2016: the dasymetric likelihood surface used to create this layer by apportioning population from census polygons to the settlement score raster.To use this layer in analysis, there are several properties or geoprocessing environment settings that should be used:Coordinate system: WGS_1984. This service and its underlying data are WGS_1984. We do this because projecting population count data actually will change the populations due to resampling and either collapsing or splitting cells to fit into another coordinate system. Cell Size: 0.0013474728 degrees (approximately 150-meters) at the equator. No Data: -1Bit Depth: 32-bit signedThis layer has query, identify, pixel, and export image functions enabled, and is restricted to a maximum analysis size of 30,000 x 30,000 pixels - an area about the size of Africa.Frye, C. et al., (2018). Using Classified and Unclassified Land Cover Data to Estimate the Footprint of Human Settlement. Data Science Journal. 17, p.20. DOI: http://doi.org/10.5334/dsj-2018-020.What can you do with this layer?This layer is unsuitable for mapping or cartographic use, and thus it does not include a convenient legend. Instead, this layer is useful for analysis, particularly for estimating counts of people living within watersheds, coastal areas, and other areas that do not have standard boundaries. Esri recommends using the Zonal Statistics tool or the Zonal Statistics to Table tool where you provide input zones as either polygons, or raster data, and the tool will summarize the count of population within those zones. https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/data-management/2016-world-population-estimate-services-are-now-available/
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Estimates of the markup for the Primary Foods industry (comprised of agriculture, hunting, fishing, and logging) using the De Loecker et al. (2020) methodology. Instead of using micro-based, firm-level data, we calculate the markups using aggregate, macro-data. The database contains information on 170 countries for the years 1995 - 2015.
Our sources of data are two-fold; the first is the EORA input-output database, and the second is the UN FAO-Stat database. Our paper Rodriguez del Valle and Fernández-Vázquez (2024), explains in more detail the estimation technique based on Generalized Maximum Entropy employed to derive these estimates.
The dataset can be used to explore and research a myriad of topics, including the impact globalization has on markups, the role of institutional quality, and even climate. We have found strong evidence that the percentage share of value added required for production originating from abroad is significantly connected to lower markups. We have also found compelling empirical evidence that institutional quality can impact the evolution of markups.
Please note that we make a strong assumption that each industry is produced by one "firm" consistent with input-output theory. For this industry, we believe the assumption will not bias results too strongly, particularly when analyzing developing countries. Results may be biased in countries where large farms exist.
Industrial Organization, Macroeconomics, Microeconomics, Agriculture Industry
Adrian Rodriguez del Valle,Esteban Fernandez-Vazquez
Institutions
Universidad de Oviedo
UPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
The Assessing Responses and Impacts of Solar climate intervention on the Earth system with stratospheric aerosol injection simulations (ARISE-SAI-1.5) utilize a moderate emission scenario, introduce stratospheric aerosol injection at approximately ... 21 km in year 2035, and keep global mean surface air temperature near 1.5C above the pre-industrial value. CESM2 (WACCM6) global output from all model components (atmosphere, ice, land, ocean) on monthly, daily and sub-daily frequencies. All atmospheric data is on the original CESM2 (WACCM6) grid (0.9 by 1.25 degrees). All data is in time-series, NetCDF format. More details about this dataset (including information on reference simulations) can be found on the ARISE-SAI-1.5 CESM Community Project page linked below under Related Resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains a summary of data and code used in the paper:
M. O. Cuthbert, G. C. Rau, M. Ekström, D. M O’Carroll & A. J. Bates (2022). Global climate-driven trade-offs between the water retention and cooling benefits of urban greening. Nature Communications. https://doi.org/10.1038/s41467-022-28160-8See the ReadMe file uploaded with the data and the Methods section of the paper for details of the derivation of each dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,” Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109
Abstract
The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description
The dataset comprises a total of 52,984 Tweet IDs (that correspond to the same number of Tweets) about online learning that were posted on Twitter from 9th November 2021 to 13th July 2022. The earliest date was selected as 9th November 2021, as the Omicron variant was detected for the first time in a sample that was collected on this date. 13th July 2022 was the most recent date as per the time of data collection and publication of this dataset.
The dataset consists of 9 .txt files. An overview of these dataset files along with the number of Tweet IDs and the date range of the associated tweets is as follows. Table 1 shows the list of all the synonyms or terms that were used for the dataset development.
Filename: TweetIDs_November_2021.txt (No. of Tweet IDs: 1283, Date Range of the associated Tweet IDs: November 1, 2021 to November 30, 2021)
Filename: TweetIDs_December_2021.txt (No. of Tweet IDs: 10545, Date Range of the associated Tweet IDs: December 1, 2021 to December 31, 2021)
Filename: TweetIDs_January_2022.txt (No. of Tweet IDs: 23078, Date Range of the associated Tweet IDs: January 1, 2022 to January 31, 2022)
Filename: TweetIDs_February_2022.txt (No. of Tweet IDs: 4751, Date Range of the associated Tweet IDs: February 1, 2022 to February 28, 2022)
Filename: TweetIDs_March_2022.txt (No. of Tweet IDs: 3434, Date Range of the associated Tweet IDs: March 1, 2022 to March 31, 2022)
Filename: TweetIDs_April_2022.txt (No. of Tweet IDs: 3355, Date Range of the associated Tweet IDs: April 1, 2022 to April 30, 2022)
Filename: TweetIDs_May_2022.txt (No. of Tweet IDs: 3120, Date Range of the associated Tweet IDs: May 1, 2022 to May 31, 2022)
Filename: TweetIDs_June_2022.txt (No. of Tweet IDs: 2361, Date Range of the associated Tweet IDs: June 1, 2022 to June 30, 2022)
Filename: TweetIDs_July_2022.txt (No. of Tweet IDs: 1057, Date Range of the associated Tweet IDs: July 1, 2022 to July 13, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.
Table 1. List of commonly used synonyms, terms, and phrases for online learning and COVID-19 that were used for the dataset development
Terminology
List of synonyms and terms
COVID-19
Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus
online learning
online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env
pip install zarr
import zarr
raw = zarr.open(
seg = zarr.open(
# optional:
import numpy as np
raw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Extreme sea levels, generated by storm surges and high tides, have the potential to cause coastal flooding and erosion. Global datasets are instrumental for mapping of extreme sea levels and associated societal risks. Harnessing the backward extension of the ERA5 reanalysis, we present a dataset containing the statistics of water levels based on a global hydrodynamic model (GTSMv3.0) covering the period 1950-2024. This is an extension of a previously published dataset for 1979-2018 (Muis et al. 2020). The timeseries (10-min, hourly mean and daily maxima) are available via the Climate Data Store of ECMWF at DOI: 10.24381/cds.a6d42d60. Using this extended ERA5 dataset, we calculate percentiles and estimate extreme water levels for various return periods globally. The percentiles dataset includes the 1, 5, 10, 25, 50, 75, 90, 95 and 99th percentiles. The extreme water levels include return values for 1, 2, 5, 10, 25, 50, 75 and 100 years, and they are estimated using POT-GPD method applied with a threshold of 99th percentile of the timeseries and using a 72-hour window for declustering peak events, and MLE method for fitting the GPD parameters. The parameters (shape, scale and location) are also supplied with this dataset.
Validation of the underlying timeseries and the statistical values shows that there is a good agreement between observed and modelled sea levels, with the level of agreement being very similar to that of the previously published dataset. The extended 75-year dataset allows for a more robust estimation of extremes, often resulting in smaller uncertainties than its 40-year precursor. The present dataset can be used in global assessments of flood risk, climate variability and climate changes.
Global modelling of water levels and extreme value analysis are associated with a number of uncertainties and limitations, that are particularly important to consider when conducting local assessments. Please refer to the Usage Notes in the corresponding manuscript (Aleksandrova et al. 2025, paper currently under review) for an overview of limitations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Catholic Carbon Footprint Story Map Map:DataBurhans, Molly A., Cheney, David M., Gerlt, R.. . “PerCapita_CO2_Footprint_InDioceses_FULL”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.Map Development: Molly BurhansMethodologyThis is the first global Carbon footprint of the Catholic population. We will continue to improve and develop these data with our research partners over the coming years. While it is helpful, it should also be viewed and used as a "beta" prototype that we and our research partners will build from and improve. The years of carbon data are (2010) and (2015 - SHOWN). The year of Catholic data is 2018. The year of population data is 2016. Care should be taken during future developments to harmonize the years used for catholic, population, and CO2 data.1. Zonal Statistics: Esri Population Data and Dioceses --> Population per dioceses, non Vatican based numbers2. Zonal Statistics: FFDAS and Dioceses and Population dataset --> Mean CO2 per Diocese3. Field Calculation: Population per Diocese and Mean CO2 per diocese --> CO2 per Capita4. Field Calculation: CO2 per Capita * Catholic Population --> Catholic Carbon FootprintAssumption: PerCapita CO2Deriving per-capita CO2 from mean CO2 in a geography assumes that people's footprint accounts for their personal lifestyle and involvement in local business and industries that are contribute CO2. Catholic CO2Assumes that Catholics and non-Catholic have similar CO2 footprints from their lifestyles.Derived from:A multiyear, global gridded fossil fuel CO2 emission data product: Evaluation and analysis of resultshttp://ffdas.rc.nau.edu/About.htmlRayner et al., JGR, 2010 - The is the first FFDAS paper describing the version 1.0 methods and results published in the Journal of Geophysical Research.Asefi et al., 2014 - This is the paper describing the methods and results of the FFDAS version 2.0 published in the Journal of Geophysical Research.Readme version 2.2 - A simple readme file to assist in using the 10 km x 10 km, hourly gridded Vulcan version 2.2 results.Liu et al., 2017 - A paper exploring the carbon cycle response to the 2015-2016 El Nino through the use of carbon cycle data assimilation with FFDAS as the boundary condition for FFCO2."S. Asefi‐Najafabady P. J. Rayner K. R. Gurney A. McRobert Y. Song K. Coltin J. Huang C. Elvidge K. BaughFirst published: 10 September 2014 https://doi.org/10.1002/2013JD021296 Cited by: 30Link to FFDAS data retrieval and visualization: http://hpcg.purdue.edu/FFDAS/index.phpAbstractHigh‐resolution, global quantification of fossil fuel CO2 emissions is emerging as a critical need in carbon cycle science and climate policy. We build upon a previously developed fossil fuel data assimilation system (FFDAS) for estimating global high‐resolution fossil fuel CO2 emissions. We have improved the underlying observationally based data sources, expanded the approach through treatment of separate emitting sectors including a new pointwise database of global power plants, and extended the results to cover a 1997 to 2010 time series at a spatial resolution of 0.1°. Long‐term trend analysis of the resulting global emissions shows subnational spatial structure in large active economies such as the United States, China, and India. These three countries, in particular, show different long‐term trends and exploration of the trends in nighttime lights, and population reveal a decoupling of population and emissions at the subnational level. Analysis of shorter‐term variations reveals the impact of the 2008–2009 global financial crisis with widespread negative emission anomalies across the U.S. and Europe. We have used a center of mass (CM) calculation as a compact metric to express the time evolution of spatial patterns in fossil fuel CO2 emissions. The global emission CM has moved toward the east and somewhat south between 1997 and 2010, driven by the increase in emissions in China and South Asia over this time period. Analysis at the level of individual countries reveals per capita CO2 emission migration in both Russia and India. The per capita emission CM holds potential as a way to succinctly analyze subnational shifts in carbon intensity over time. Uncertainties are generally lower than the previous version of FFDAS due mainly to an improved nightlight data set."Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/Esri Gridded Population Data 2016DescriptionThis layer is a global estimate of human population for 2016. Esri created this estimate by modeling a footprint of where people live as a dasymetric settlement likelihood surface, and then assigned 2016 population estimates stored on polygons of the finest level of geography available onto the settlement surface. Where people live means where their homes are, as in where people sleep most of the time, and this is opposed to where they work. Another way to think of this estimate is a night-time estimate, as opposed to a day-time estimate.Knowledge of population distribution helps us understand how humans affect the natural world and how natural events such as storms and earthquakes, and other phenomena affect humans. This layer represents the footprint of where people live, and how many people live there.Dataset SummaryEach cell in this layer has an integer value with the estimated number of people likely to live in the geographic region represented by that cell. Esri additionally produced several additional layers World Population Estimate Confidence 2016: the confidence level (1-5) per cell for the probability of people being located and estimated correctly. World Population Density Estimate 2016: this layer is represented as population density in units of persons per square kilometer.World Settlement Score 2016: the dasymetric likelihood surface used to create this layer by apportioning population from census polygons to the settlement score raster.To use this layer in analysis, there are several properties or geoprocessing environment settings that should be used:Coordinate system: WGS_1984. This service and its underlying data are WGS_1984. We do this because projecting population count data actually will change the populations due to resampling and either collapsing or splitting cells to fit into another coordinate system. Cell Size: 0.0013474728 degrees (approximately 150-meters) at the equator. No Data: -1Bit Depth: 32-bit signedThis layer has query, identify, pixel, and export image functions enabled, and is restricted to a maximum analysis size of 30,000 x 30,000 pixels - an area about the size of Africa.Frye, C. et al., (2018). Using Classified and Unclassified Land Cover Data to Estimate the Footprint of Human Settlement. Data Science Journal. 17, p.20. DOI: http://doi.org/10.5334/dsj-2018-020.What can you do with this layer?This layer is unsuitable for mapping or cartographic use, and thus it does not include a convenient legend. Instead, this layer is useful for analysis, particularly for estimating counts of people living within watersheds, coastal areas, and other areas that do not have standard boundaries. Esri recommends using the Zonal Statistics tool or the Zonal Statistics to Table tool where you provide input zones as either polygons, or raster data, and the tool will summarize the count of population within those zones. https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/data-management/2016-world-population-estimate-services-are-now-available/
The UCF-Crime dataset is a large-scale dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety.
This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "Forest Proximate People" (FPP) dataset is one of the data layers contributing to the development of indicator #13, “number of forest-dependent people in extreme poverty,” of the Collaborative Partnership on Forests (CPF) Global Core Set of forest-related indicators (GCS). The FPP dataset provides an estimate of the number of people living in or within 5 kilometers of forests (forest-proximate people) for the year 2019 with a spatial resolution of 100 meters at a global level.
For more detail, such as the theory behind this indicator and the definition of parameters, and to cite this data, see: Newton, P., Castle, S.E., Kinzer, A.T., Miller, D.C., Oldekop, J.A., Linhares-Juvenal, T., Pina, L. Madrid, M., & de Lamo, J. 2022. The number of forest- and tree-proximate people: A new methodology and global estimates. Background Paper to The State of the World’s Forests 2022 report. Rome, FAO.
Contact points:
Maintainer: Leticia Pina
Maintainer: Sarah E., Castle
Data lineage:
The FPP data are generated using Google Earth Engine. Forests are defined by the Copernicus Global Land Cover (CGLC) (Buchhorn et al. 2020) classification system’s definition of forests: tree cover ranging from 15-100%, with or without understory of shrubs and grassland, and including both open and closed forests. Any area classified as forest sized ≥ 1 ha in 2019 was included in this definition. Population density was defined by the WorldPop global population data for 2019 (WorldPop 2018). High density urban populations were excluded from the analysis. High density urban areas were defined as any contiguous area with a total population (using 2019 WorldPop data for population) of at least 50,000 people and comprised of pixels all of which met at least one of two criteria: either the pixel a) had at least 1,500 people per square km, or b) was classified as “built-up” land use by the CGLC dataset (where “built-up” was defined as land covered by buildings and other manmade structures) (Dijkstra et al. 2020). Using these datasets, any rural people living in or within 5 kilometers of forests in 2019 were classified as forest proximate people. Euclidean distance was used as the measure to create a 5-kilometer buffer zone around each forest cover pixel. The scripts for generating the forest-proximate people and the rural-urban datasets using different parameters or for different years are published and available to users. For more detail, such as the theory behind this indicator and the definition of parameters, and to cite this data, see: Newton, P., Castle, S.E., Kinzer, A.T., Miller, D.C., Oldekop, J.A., Linhares-Juvenal, T., Pina, L., Madrid, M., & de Lamo, J. 2022. The number of forest- and tree-proximate people: a new methodology and global estimates. Background Paper to The State of the World’s Forests 2022. Rome, FAO.
References:
Buchhorn, M., Smets, B., Bertels, L., De Roo, B., Lesiv, M., Tsendbazar, N.E., Herold, M., Fritz, S., 2020. Copernicus Global Land Service: Land Cover 100m: collection 3 epoch 2019. Globe.
Dijkstra, L., Florczyk, A.J., Freire, S., Kemper, T., Melchiorri, M., Pesaresi, M. and Schiavina, M., 2020. Applying the degree of urbanisation to the globe: A new harmonised definition reveals a different picture of global urbanisation. Journal of Urban Economics, p.103312.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University, 2018. Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
Online resources:
GEE asset for "Forest proximate people - 5km cutoff distance"
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This dataset contains global spatially predicted sea-surface iodide concentrations at a monthly resolution for the year 1970. It was developed as part of the NERC project Iodide in the ocean:distribution and impact on iodine flux and ozone loss (NE/N009983/1), which aimed to quantify the dominant controls on the sea surface iodide distribution and improve parameterisation of the sea-to-air iodine flux and of ozone deposition.
This dataset is the output used in the published paper 'A machine learning based global sea-surface iodide distribution' ( https://doi.org/10.5194/essd-2019-40).
The main ensemble prediction ("Ensemble_Monthly_mean ") is provided in a NetCDF file as a single variable (1). A second file (2) is provided which includes all of the predictions and the standard deviation on the prediction. (1) predicted_iodide_0.125x0.125_Ns_Just_Ensemble.nc (2) predicted_iodide_0.125x0.125_Ns_All_Ensemble_members.nc
For ease of use, this output has been re-gridded to various commonly used atmosphere and ocean model resolutions (see table SI table A5 in paper). These re-gridded files are included in the folder titled "regridded_data".
Additionally, a further file (3) is provided including the prediction made included data from the Skagerak dataset. As stated in the paper referenced above, it is recommended to use the use the core files (1,2) or their re-gridded equivalents.
(3) predicted_iodide_0.125x0.125_All_Ensemble_members.nc
As new observations are made, this global data product will be updated through a "living data" model. The dataset versions follow semantic versioning (https://semver.org/). This dataset contains the first publicly released version v0.0.1 and supersedes the pre-review dataset named v0.0.0, Please refer to the paper referenced above for the current version number and information on this.
Updates for v0.0.1 vs. v0.0.0 - Additional files included of the core data re-gridded for 0.5x0.5 degree and 0.25x0.25 degree horizontal resolution. - Minor updates were applied to all metadata in NetCDF files. - Updates were made to coordinate grids used for regriding files from 1x1 degree to 4x5 degree.
The NuiSI dataset contains skeleton tracking trajectories of Human Interaction Partners performing a variety of physically interactive behaviors (waving, handshaking, rocket fistbump, parachute fistbump) with each other. This is inspired by the dataset in Bütepage et al. "Imitating by generating: Deep generative models for imitation of interactive tasks." Frontiers in Robotics and AI (2020) wherein they capture a dataset with rokoko motion capture suits. Instead we track the skeletons of the interaction partner with Intel Realsense cameras using Nuitrack, for a more realistic scenario, with noise coming from the depth sensor, the skeleton tracking and some partial occlusions. This makes it more representative of real world interactions with a Robot equipped with an RGBD camera. T This dataset is used in our papers for training Interaction models for Human-Robot Interaction with a humanoid social robot. If you find the dataset useful in your work, please cite our paper:
Prasad, V., Heitlinger, L., Koert, D., Stock-Homburg, R., Peters, J., & Chalvatzaki, G. (2023). Learning multimodal latent dynamics for human-robot interaction. arXiv preprint arXiv:2311.16380.
@article{prasad2023learning, title={Learning multimodal latent dynamics for human-robot interaction}, author={Prasad, Vignesh and Heitlinger, Lea and Koert, Dorothea and Stock-Homburg, Ruth and Peters, Jan and Chalvatzaki, Georgia}, journal={arXiv preprint arXiv:2311.16380}, year={2023} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data were collected and disseminated according to this publication: https://www.nature.com/articles/s41597-019-0273-5
All descriptors below are taken from this publication and are copyright of the authors.
Adaptive interactions between building occupants and their surrounding environments affect both energy use and environmental quality, as demonstrated by a large body of modeling research that quantifies the impacts of occupant behavior on building operations. Yet, available occupant field data are insufficient to explore the mechanisms that drive this interaction. This paper introduces data from a one year study of 24 U.S. office occupants that recorded a comprehensive set of possible exogenous and endogenous drivers of personal comfort and behavior over time. The longitudinal data collection protocol merges individual thermal comfort, preference, and behavior information from online daily surveys with datalogger readings of occupants’ local thermal environments and control states, yielding 2503 survey responses alongside tens of thousands of concurrent behavior and environment measurements. These data have been used to uncover links between the built environment, personal variables, and adaptive actions, and the data contribute to international research collaborations focused on understanding the human-building interaction.
Humans interact with the built environment in a variety of ways that contribute to both building energy use and environmental quality and thus warrant significant attention in the building design, operation, and retrofit processes. Occupants’ thermally adaptive behaviours such as adjusting thermostats and clothing, opening and closing windows and doors, operating personal heating and cooling devices, are strongly tied to total site energy consumed in residential and commercial buildings in the United States (U.S.). This dataset introduces longitudinal data from a one-year study of occupant thermal comfort and several related behavioural adaptations in an air-conditioned office setting in U.S. Offices. The primary objective of the data collection approach was to record a comprehensive range of exogenous and endogenous factors that may drive personal comfort and behaviour outcomes over time.
Longitudinal data on building occupant behavior, comfort, and environmental conditions were collected between July 2012 and August 2013 at the Friends Center office building in Center City Philadelphia, Pennsylvania, United States. Data collection proceeded in three stages:
Semi-structured interviews Semi-structured interviews identify aspects of behavior that are not yet well known or understood and provide a rich qualitative context for developing and interpreting responses from structured survey instruments. 32 interviews about thermal comfort and related behaviours were first conducted with office occupants from 7 air-conditioned buildings, ranging from aging to recently renovated, around the Philadelphia region
Site selection and subject recruitment for the longitudinal study Subject recruitment was initiated through an e-mail message sent to all employees in the Friends Center by its Executive Director. The following question areas were included: (a) demographic information, (b) office characteristics, (c) thermal comfort and preferences, (d) control options, (e) personal values, and f) typical work schedule (arrival, lunch, departure times).
Longitudinal survey and datalogger measurements. The final occupant sample participated in a series of subjective and objective measurements of thermal comfort, adaptive behavior, and related items. These measurements were carried via longitudinal online surveys, as well as through parallel datalogger and BAS measurements of the local environment and behavioural actions.
Several measures were taken to ensure the validity of the collected data, following data collection guidance included in the final report for International Energy Agency Annex 66: Definition and Simulation of Occupant Behavior in Buildings. These measures include survey preparation phase, encourage high response rates, pilot studies, quality control, redundancy and comparison against expected conditions. Details for these measures can be found in the paper.
Description:
This dataset contains a collection of 15,150 images, categorized into 12 distinct classes of common household waste. The classes include paper, cardboard, biological waste, metal, plastic, green glass, brown glass, white glass, clothing, shoes, batteries, and general trash. Each category represents a different type of material, contributing to more effective recycling and waste management strategies. Garbage Classification Dataset.
Objective
The purpose of this dataset is to aid in the development of machine learning models designed to automatically classify household waste into its appropriate categories, thus promoting more efficient recycling processes. Proper waste sorting is crucial for maximizing the amount of material that can be recycled, and this dataset is aimed at enhancing automation in this area. The classification of garbage into a broader range of categories, as opposed to the limited classes found in most available datasets (2-6 classes), allows for a more precise recycling process and could significantly improve recycling rates.
Download Dataset
Dataset Composition and Collection Process
The dataset was primarily collected through web scraping, as simulating a real-world garbage collection scenario (such as placing a camera above a conveyor belt) was not feasible at the time of collection. The goal was to obtain images that closely resemble actual garbage. For example, images in the biological waste category include rotten fruits, vegetables, and food remnants. Similarly, categories such as glass and metal consist of images of bottles, cans, and containers typically found in household trash. While the images for some categories, like clothes or shoes, were harder to find specifically as garbage, they still represent the items that may end up in waste streams.
In an ideal setting, a conveyor system could be used to gather real-time data by capturing images of waste in a continuous flow. Such a setup would enhance the dataset by providing authentic waste images for all categories. However, until that setup is available, this dataset serves as a significant step toward automating garbage classification and improving recycling technologies.
Potential for Future Improvements
While this dataset provides a strong foundation for household waste classification, there is potential for further improvements. For example, real-time data collection using conveyor systems or garbage processing plants could provide higher accuracy and more contextual images. Additionally, future datasets could expand to include more specialized categories, such as electronic waste, hazardous materials, or specific types of plastic.
Conclusion
The Garbage Classification dataset offers a broad and diverse collection of household waste images, making it a valuable resource for researchers and developers working in environmental sustainability, machine learning, and recycling automation. By improving the accuracy of waste classification systems, we can contribute to a cleaner, more sustainable future.
This dataset is sourced from Kaggle.