Facebook
TwitterThis dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary
Facebook
TwitterThis dataset lists various data sources used within the Department of Community Resources & Services for various internal and external reports. This dataset allows individuals and organizations to identify the type of data they are looking for and to which geographical level they are trying to get the data for (i.e. National, State, County, etc.). This dataset will be updated every quarter and should be utilized for research purposes
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank
This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.
For more information, see the World Bank website.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population
http://data.worldbank.org/data-catalog/ed-stats
https://cloud.google.com/bigquery/public-data/world-bank-education
Citation: The World Bank: Education Statistics
Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @till_indeman from Unplash.
Of total government spending, what percentage is spent on education?
Facebook
TwitterClassification of Mars Terrain Using Multiple Data Sources Alan Kraut1, David Wettergreen1 ABSTRACT. Images of Mars are being collected faster than they can be analyzed by planetary scientists. Automatic analysis of images would enable more rapid and more consistent image interpretation and could draft geologic maps where none yet exist. In this work we develop a method for incorporating images from multiple instruments to classify Martian terrain into multiple types. Each image is segmented into contiguous groups of similar pixels, called superpixels, with an associated vector of discriminative features. We have developed and tested several classification algorithms to associate a best class to each superpixel. These classifiers are trained using three different manual classifications with between 2 and 6 classes. Automatic classification accuracies of 50 to 80% are achieved in leave-one-out cross-validation across 20 scenes using a multi-class boosting classifier.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Open Data 500, funded by the John S. and James L. Knight Foundation (http://www.knightfoundation.org/) and conducted by the GovLab, is the first comprehensive study of U.S. companies that use open government data to generate new business and develop new products and services.
Provide a basis for assessing the economic value of government open data
Encourage the development of new open data companies
Foster a dialogue between government and business on how government data can be made more useful
The Open Data 500 study is conducted by the GovLab at New York University with funding from the John S. and James L. Knight Foundation. The GovLab works to improve people’s lives by changing how we govern, using technology-enabled solutions and a collaborative, networked approach. As part of its mission, the GovLab studies how institutions can publish the data they collect as open data so that businesses, organizations, and citizens can analyze and use this information.
The Open Data 500 team has compiled our list of companies through (1) outreach campaigns, (2) advice from experts and professional organizations, and (3) additional research.
Outreach Campaign
Mass email to over 3,000 contacts in the GovLab network
Mass email to over 2,000 contacts OpenDataNow.com
Blog posts on TheGovLab.org and OpenDataNow.com
Social media recommendations
Media coverage of the Open Data 500
Attending presentations and conferences
Expert Advice
Recommendations from government and non-governmental organizations
Guidance and feedback from Open Data 500 advisors
Research
Companies identified for the book, Open Data Now
Companies using datasets from Data.gov
Directory of open data companies developed by Deloitte
Online Open Data Userbase created by Socrata
General research from publicly available sources
The Open Data 500 is not a rating or ranking of companies. It covers companies of different sizes and categories, using various kinds of data.
The Open Data 500 is not a competition, but an attempt to give a broad, inclusive view of the field.
The Open Data 500 study also does not provide a random sample for definitive statistical analysis. Since this is the first thorough scan of companies in the field, it is not yet possible to determine the exact landscape of open data companies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
More details about each file are in the individual file descriptions.
This is a dataset hosted by the State of New York. The state has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York State using Kaggle and all of the data sources available through the State of New York organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
This dataset is distributed under the following licenses: Public Domain
Facebook
TwitterThis dataset lists out all software in use by NASA
Facebook
TwitterThis is a PDF document created by the Department of Information Technology (DoIT) and the Governor's Office of Performance Improvement to assist training Maryland state employees on use of the Open Data Portal, https://opendata.maryland.gov. This document covers direct data entry, uploading Excel spreadsheets, connecting source databases, and transposing data. Please note that this tutorial is intended for use by state employees, as non-state users cannot upload datasets to the Open Data Portal.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
Facebook
TwitterThis database table consists of a preliminary source list for the Einstein Observatory's High Resolution Imager (HRI). The source list, obtained from EINLINE, the Einstein On-line Service at the Smithsonian Astrophysical Observatory (SAO), contains basic information about the sources detected with the HRI. This is a service provided by NASA HEASARC .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Market research dataset covering growth of the global open-source software market, including benefits, adoption, and enterprise usage in 2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset to the following paper https://www.nature.com/articles/s41597-023-01975-w
Caravan is an open community dataset of meteorological forcing data, catchment attributes, and discharge daat for catchments around the world. Additionally, Caravan provides code to derive meteorological forcing data and catchment attributes from the same data sources in the cloud, making it easy for anyone to extend Caravan to new catchments. The vision of Caravan is to provide the foundation for a truly global open source community resource that will grow over time.
If you use Caravan in your research, it would be appreciated to not only cite Caravan itself, but also the source datasets, to pay respect to the amount of work that was put into the creation of these datasets and that made Caravan possible in the first place.
All current development and additional community extensions can be found at https://github.com/kratzert/Caravan
Channel Log:
23 May 2022: Version 0.2 - Resolved a bug when renaming the LamaH gauge ids from the LamaH ids to the official gauge ids provided as "govnr" in the LamaH dataset attribute files.
24 May 2022: Version 0.3 - Fixed gaps in forcing data in some "camels" (US) basins.
15 June 2022: Version 0.4 - Fixed replacing negative CAMELS US values with NaN (-999 in CAMELS indicates missing observation).
1 December 2022: Version 0.4 - Added 4298 basins in the US, Canada and Mexico (part of HYSETS), now totalling to 6830 basins. Fixed a bug in the computation of catchment attributes that are defined as pour point properties, where sometimes the wrong HydroATLAS polygon was picked. Restructured the attribute files and added some more meta data (station name and country).
16 January 2023: Version 1.0 - Version of the official paper release. No changes in the data but added a static copy of the accompanying code of the paper. For the most up to date version, please check https://github.com/kratzert/Caravan
10 May 2023: Version 1.1 - No data change, just update data description.
17 May 2023: Version 1.2 - Updated a handful of attribute values that were affected by a bug in their derivation. See https://github.com/kratzert/Caravan/issues/22 for details.
16 April 2024: Version 1.4 - Added 9130 gauges from the original source dataset that were initially not included because of the area thresholds (i.e. basins smaller than 100sqkm or larger than 2000sqkm). Also extended the forcing period for all gauges (including the original ones) to 1950-2023. Added two different download options that include timeseries data only as either csv files (Caravan-csv.tar.xz) or netcdf files (Caravan-nc.tar.xz). Including the large basins also required an update in the earth engine code
16 Jan 2025: Version 1.5 - Added FAO Penman-Monteith PET (potential_evaporation_sum_FAO_PENMAN_MONTEITH) and renamed the ERA5-LAND potential_evaporation band to potential_evaporation_sum_ERA5_LAND. Also added all PET-related climated indices derived with the Penman-Monteith PET band (suffix "_FAO_PM") and renamed the old PET-related indices accordingly (suffix "_ERA5_LAND").
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
A source protection area (SPA) is an area of land and water governed by a Source Protection Authority, an agency, person or body. This dataset defines the geographic boundaries where each SPA's terms of reference, assessment reports and source protection plans must be developed. *[SPA]: source protection area
Facebook
TwitterBy Throwback Thursday [source]
Here are some tips on how to make the most out of this dataset:
Data Exploration:
- Begin by understanding the structure and contents of the dataset. Evaluate the number of rows (sites) and columns (attributes) available.
- Check for missing values or inconsistencies in data entry that may impact your analysis.
- Assess column descriptions to understand what information is included in each attribute.
Geographical Analysis:
- Leverage geographical features such as latitude and longitude coordinates provided in this dataset.
- Plot these sites on a map using any mapping software or library like Google Maps or Folium for Python. Visualizing their distribution can provide insights into patterns based on location, climate, or cultural factors.
Analyzing Attributes:
- Familiarize yourself with different attributes available for analysis. Possible attributes include Name, Description, Category, Region, Country, etc.
- Understand each attribute's format and content type (categorical, numerical) for better utilization during data analysis.
Exploring Categories & Regions:
- Look at unique categories mentioned in the Category column (e.g., Cultural Site, Natural Site) to explore specific interests. This could help identify clusters within particular heritage types across countries/regions worldwide.
- Analyze regions with high concentrations of heritage sites using data visualizations like bar plots or word clouds based on frequency counts.
Identify Trends & Patterns:
- Discover recurring themes across various sites by analyzing descriptive text attributes such as names and descriptions.
- Identify patterns and correlations between attributes by performing statistical analysis or utilizing machine learning techniques.
Comparison:
- Compare different attributes to gain a deeper understanding of the sites.
- For example, analyze the number of heritage sites per country/region or compare the distribution between cultural and natural heritage sites.
Additional Data Sources:
- Use this dataset as a foundation to combine it with other datasets for in-depth analysis. There are several sources available that provide additional data on UNESCO World Heritage Sites, such as travel blogs, official tourism websites, or academic research databases.
Remember to cite this dataset appropriately if you use it in
- Travel Planning: This dataset can be used to identify and plan visits to UNESCO World Heritage sites around the world. It provides information about the location, category, and date of inscription for each site, allowing users to prioritize their travel destinations based on personal interests or preferences.
- Cultural Preservation: Researchers or organizations interested in cultural preservation can use this dataset to analyze trends in UNESCO World Heritage site listings over time. By studying factors such as geographical distribution, types of sites listed, and inscription dates, they can gain insights into patterns of cultural heritage recognition and protection.
- Statistical Analysis: The dataset can be used for statistical analysis to explore various aspects related to UNESCO World Heritage sites. For example, it could be used to examine the correlation between a country's economic indicators (such as GDP per capita) and the number or type of World Heritage sites it possesses. This analysis could provide insights into the relationship between economic development and cultural preservation efforts at a global scale
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Throwback Thursday.
Facebook
TwitterThe Planck list of high-redshift source candidates (PHZ) is a list of 2151 sources located in the cleanest 26% of the sky and identified as point sources exhibiting an excess in the submillimeter compared to their environment. It has been built using the 48 months Planck data at 857, 545, 353 and 217 GHz combined with the 3 THz IRAS data, as it is described in Planck-2015-XXXIX. These sources are considered as high-z source candidates (z>1.5-2), given the very low contamination by Galactic cirrus, and their typical colour-colour ratio. A subsample of the PHZ list has already been followed-up with Herschel, and chararcterized as overdensities of red galaxies for more than 93% of the population, and as strongly lensed galaxies in 3% of the cases, as detailed in Planck-2014-XXVIII.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sources of data for Figure 3 (in the online article; Figure 2 in print) and Tables 1, 2a, 2b, 3a and 3b of the Geoscientist article Digging into data access: The need for reform. Files:
digging_into_data_access_sources.xlsx - spreadsheet listing all references for Figure 3 (in the online article; Figure 2 in print) and Tables 1, 2a, 2b, 3a and 3b. The spreadsheet includes references to page numbers on which quoted figures are given
Source files for Figure 3 (in the online article; Figure 2 in print):
digging_into_data_access_figure3online_figure2print_source_documentA.pdf - British Geological Survey Annual Report 2019–2020. Source of data for the column BGS in Figure 3 digging_into_data_access_figure3online_figure2print_source_documentB.pdf - OGA Annual Report and Accounts 2020–21. Source of data for the column OGA in Figure 3 digging_into_data_access_figure3online_figure2print_source_documentC.pdf - UK Onshore Geophysical Library Trustees' Report and Financial Statements 2020 for the Year Ended 31 December 2020. Source of data for the column UKOGL in Figure 3 digging_into_data_access_figure3online_figure2print_source_documentD.pdf - Environment Agency Annual Report and Accounts for the Financial Year 2020 to 2021. Source of data for the column EA in Figure 3
Source files for Tables 1, 2a, 2b, 3a and 3b:
digging_into_data_access_tables_source_documentX.pdf - X corresponds to the number in the column Source in Tables 1, 2a, 2b, 3a and 3b. See also the tab 'tables_sources' in the spreadsheet digging_into_data_access_sources.xlsx
Facebook
TwitterThe dataset contains the 2021 engineering infrastructure solutions from the Vilnius city general plan – zone for new centralized heat production sources.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset and replication package of the study "A continuous open source data collection platform for architectural technical debt assessment".
Abstract
Architectural decisions are the most important source of technical debt. In recent years, researchers spent an increasing amount of effort investigating this specific category of technical debt, with quantitative methods, and in particular static analysis, being the most common approach to investigate such a topic.
However, quantitative studies are susceptible, to varying degrees, to external validity threats, which hinder the generalisation of their findings.
In response to this concern, researchers strive to expand the scope of their study by incorporating a larger number of projects into their analyses. This practice is typically executed on a case-by-case basis, necessitating substantial data collection efforts that have to be repeated for each new study.
To address this issue, this paper presents our initial attempt at tackling this problem and enabling researchers to study architectural smells at large scale, a well-known indicator of architectural technical debt. Specifically, we introduce a novel approach to data collection pipeline that leverages Apache Airflow to continuously generate up-to-date, large-scale datasets using Arcan, a tool for architectural smells detection (or any other tool).
Finally, we present the publicly-available dataset resulting from the first three months of execution of the pipeline, that includes over 30,000 analysed commits and releases from over 10,000 open source GitHub projects written in 5 different programming languages and amounting to over a billion of lines of code analysed.
Facebook
TwitterThis dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary