This dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Data Sources and Resources
This dataset lists out all software in use by NASA
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank
This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.
For more information, see the World Bank website.
Fork this kernel to get started with this dataset.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population
http://data.worldbank.org/data-catalog/ed-stats
https://cloud.google.com/bigquery/public-data/world-bank-education
Citation: The World Bank: Education Statistics
Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @till_indeman from Unplash.
Of total government spending, what percentage is spent on education?
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
DECD's listing of direct financial assistance to businesses from July 1, 2009 through June 30, 2024. New projects are usually added quarterly, but updates may be made on an ongoing basis.
Small Business Boost loan recipients can be found here: https://data.ct.gov/d/yk65-8y82
This is a PDF document created by the Department of Information Technology (DoIT) and the Governor's Office of Performance Improvement to assist training Maryland state employees on use of the Open Data Portal, https://opendata.maryland.gov. This document covers direct data entry, uploading Excel spreadsheets, connecting source databases, and transposing data. Please note that this tutorial is intended for use by state employees, as non-state users cannot upload datasets to the Open Data Portal.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset to the following paper https://www.nature.com/articles/s41597-023-01975-w
Caravan is an open community dataset of meteorological forcing data, catchment attributes, and discharge daat for catchments around the world. Additionally, Caravan provides code to derive meteorological forcing data and catchment attributes from the same data sources in the cloud, making it easy for anyone to extend Caravan to new catchments. The vision of Caravan is to provide the foundation for a truly global open source community resource that will grow over time.
If you use Caravan in your research, it would be appreciated to not only cite Caravan itself, but also the source datasets, to pay respect to the amount of work that was put into the creation of these datasets and that made Caravan possible in the first place.
All current development and additional community extensions can be found at https://github.com/kratzert/Caravan
Channel Log:
23 May 2022: Version 0.2 - Resolved a bug when renaming the LamaH gauge ids from the LamaH ids to the official gauge ids provided as "govnr" in the LamaH dataset attribute files.
24 May 2022: Version 0.3 - Fixed gaps in forcing data in some "camels" (US) basins.
15 June 2022: Version 0.4 - Fixed replacing negative CAMELS US values with NaN (-999 in CAMELS indicates missing observation).
1 December 2022: Version 0.4 - Added 4298 basins in the US, Canada and Mexico (part of HYSETS), now totalling to 6830 basins. Fixed a bug in the computation of catchment attributes that are defined as pour point properties, where sometimes the wrong HydroATLAS polygon was picked. Restructured the attribute files and added some more meta data (station name and country).
16 January 2023: Version 1.0 - Version of the official paper release. No changes in the data but added a static copy of the accompanying code of the paper. For the most up to date version, please check https://github.com/kratzert/Caravan
10 May 2023: Version 1.1 - No data change, just update data description.
17 May 2023: Version 1.2 - Updated a handful of attribute values that were affected by a bug in their derivation. See https://github.com/kratzert/Caravan/issues/22 for details.
16 April 2024: Version 1.4 - Added 9130 gauges from the original source dataset that were initially not included because of the area thresholds (i.e. basins smaller than 100sqkm or larger than 2000sqkm). Also extended the forcing period for all gauges (including the original ones) to 1950-2023. Added two different download options that include timeseries data only as either csv files (Caravan-csv.tar.xz) or netcdf files (Caravan-nc.tar.xz). Including the large basins also required an update in the earth engine code
16 Jan 2025: Version 1.5 - Added FAO Penman-Monteith PET (potential_evaporation_sum_FAO_PENMAN_MONTEITH) and renamed the ERA5-LAND potential_evaporation band to potential_evaporation_sum_ERA5_LAND. Also added all PET-related climated indices derived with the Penman-Monteith PET band (suffix "_FAO_PM") and renamed the old PET-related indices accordingly (suffix "_ERA5_LAND").
This database table consists of a preliminary source list for the Einstein Observatory's High Resolution Imager (HRI). The source list, obtained from EINLINE, the Einstein On-line Service at the Smithsonian Astrophysical Observatory (SAO), contains basic information about the sources detected with the HRI. This is a service provided by NASA HEASARC .
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
SEPAL (https://sepal.io/) is a free and open source cloud computing platform for geo-spatial data access and processing. It empowers users to quickly process large amounts of data on their computer or mobile device. Users can create custom analysis ready data using freely available satellite imagery, generate and improve land use maps, analyze time series, run change detection and perform accuracy assessment and area estimation, among many other functionalities in the platform. Data can be created and analyzed for any place on Earth using SEPAL.
https://data.apps.fao.org/catalog/dataset/9c4d7c45-7620-44c4-b653-fbe13eb34b65/resource/63a3efa0-08ab-4ad6-9d4a-96af7b6a99ec/download/cambodia_mosaic_2020.png" alt="alt text" title="Figure 1: Best pixel mosaic of Landsat 8 data for 2020 over Cambodia">
SEPAL reaches over 5000 users in 180 countries for the creation of custom data products from freely available satellite data. SEPAL was developed as a part of the Open Foris suite, a set of free and open source software platforms and tools that facilitate flexible and efficient data collection, analysis and reporting. SEPAL combines and integrates modern geospatial data infrastructures and supercomputing power available through Google Earth Engine and Amazon Web Services with powerful open-source data processing software, such as R, ORFEO, GDAL, Python and Jupiter Notebooks. Users can easily access the archive of satellite imagery from NASA, the European Space Agency (ESA) as well as high spatial and temporal resolution data from Planet Labs and turn such images into data that can be used for reporting and better decision making.
National Forest Monitoring Systems in many countries have been strengthened by SEPAL, which provides technical government staff with computing resources and cutting edge technology to accurately map and monitor their forests. The platform was originally developed for monitoring forest carbon stock and stock changes for reducing emissions from deforestation and forest degradation (REDD+). The application of the tools on the platform now reach far beyond forest monitoring by providing different stakeholders access to cloud based image processing tools, remote sensing and machine learning for any application. Presently, users work on SEPAL for various applications related to land monitoring, land cover/use, land productivity, ecological zoning, ecosystem restoration monitoring, forest monitoring, near real time alerts for forest disturbances and fire, flood mapping, mapping impact of disasters, peatland rewetting status, and many others.
The Hand-in-Hand initiative enables countries that generate data through SEPAL to disseminate their data widely through the platform and to combine their data with the numerous other datasets available through Hand-in-Hand.
https://data.apps.fao.org/catalog/dataset/9c4d7c45-7620-44c4-b653-fbe13eb34b65/resource/868e59da-47b9-4736-93a9-f8d83f5731aa/download/probability_classification_over_zambia.png" alt="alt text" title="Figure 2: Image classification module for land monitoring and mapping. Probability classification over Zambia">
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for TinyDialogues
TinyDialogues dataset collected as part of the EMNLP 2024 paper "Is Child-Directed Speech Effective Training Data for Language Models?" by Steven Y. Feng, Noah D. Goodman, and Michael C. Frank. For more details, please see Appendices A-C in our paper.
Dataset Sources
Repository: https://github.com/styfeng/TinyDialogues Paper: https://aclanthology.org/2024.emnlp-main.1231/
Dataset Structure
Final training and validation data… See the full description on the dataset page: https://huggingface.co/datasets/styfeng/TinyDialogues.
Under the Freedom of Information Act 2000, I was wondering if you would be able to develop on top of the FOI Request FOI 24442 and FOI 27689. https://opendata.nhsbsa.net/dataset/foi-24442 https://opendata.nhsbsa.net/dataset/foi-27689 The data in this request relates to April 2020 to March 2022 and April 2022 to June 2022 from the data source ‘NHSBSA Information Services Data Warehouse’ with the Columns YEAR_MONTH, PRACTICE_CODE, DISPENSER_CODE, BNF_CODE, PRODUCT_ORDER_NUMBER, PACK_ORDER_NUMBER and NIC_GBP. Would it be possible to have the data in the same format from July 2022 to December 2022 or from July 2022 to the latest possible month please?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘2019 NYC Open Data Plan: Removed Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/80355d19-52a3-435d-bc73-2dfb2770c3c4 on 13 November 2021.
--- Dataset description provided by original source is as follows ---
Datasets removed from the Open Data Plan, and an explanation why they were removed.
--- Original source retains full ownership of the source dataset ---
This is the authoritative public subset of the compiled Minnesota statewide parcel dataset. By authoritative, we mean this is the official source of statewide parcel data compiled from the counties that have opted-in to be included. Counties are the authoritative source and owner of parcel data. Quarterly, MnGeo compiles and standardizes the county data using the Minnesota Geospatial Advisory Council's parcel data standard. In the compilation process, some data content is standardized or otherwise modified (capitalization and address parsing are the most common changes). The full opt-in compiled parcel metadata record can be found on the Minnesota Geospatial Commons.To obtain the most current and authoritative data in its original form, users are referred back to the respective county. Links to each county's downloadable and/or web-viewable data, where known, are available in the accompanying spatial metadata dataset.Known limitations:Data provided by counties are often limited to a subset of fields and may not be the same fields across all counties. The fields provided by a given county may change by quarter.The USECLASS and XUSECLASS fields, while often consistent within a county, are not standardized between counties.The OWN_ADDR_# and TAX_ADDR_# fields are often populated in ways not consistent with the standard. In particular, an address number/street address may not be in Line 1, and city/state/zip cannot be relied on to be in Line 3. Even within a single county, the city/state/zip line may not be in a consistent field.Parcels with addresses on fractional streets (5-1/2th Ave) cause issues for our address parser when parsing is needed for aggregation and may be missing some or all of the address data. Certain other oddly named streets can also cause this behavior.A maximum record count has been set on the mapping service. This limits the number of features that can be returned in a single request. It is set to balance usability and response time.
More details about each file are in the individual file descriptions.
This is a dataset hosted by the city of Oakland in California. The organization has an open data platform found here and they update their information according to the amount of data that is brought in. Explore Oakland's Data using Kaggle and all of the data sources available through the city of Oakland organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
Cover photo by Sarah Brink on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset is distributed under the following licenses: NA
The dataset contains the 2021 engineering infrastructure solutions from the Vilnius city general plan – zone for new centralized heat production sources.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Traits of Plants in Canada (TOPIC) open access empirical measurements module provides access to an open dataset of direct observations collected in the field, laboratory, greenhouse or garden. The Traits of Plants in Canada (TOPIC) database serves as a hub for centralizing knowledge on plant functional traits in Canada. Under the Canadian Trait Network, this database allows the integration of trait data from large, disconnected scientific sources to facilitate research on plant and forest ecology, community ecology and forest sustainability. Following international standards, the database ensures that the datasets are properly documented and archived, facilitating their re-use and discoverability. **Please cite TOPIC open as follows: ** Aubin, I., Boisvert-Marsh, L., Munson, A.D. 2021. Traits of plants in Canada (TOPIC) Open access - Traits des plantes au Canada (TOPIC) ouvert. doi: https://doi.org/10.23687/bb14c6bf-75f7-4ff2-b97e-689fa768905c
**And TOPIC as follows: ** Aubin, I, Cardou, F., Boisvert‐Marsh, L., Garnier, E., Strukelj, M, Munson, A.D. 2020. Managing data locally to answer questions globally: The role of collaborative science in ecology. Journal of Vegetation Science. 31: 509–517.
This asset is a derived view based on the system dataset 'Site Analytics: Asset Inventory' which is automatically generated by the data management platform and provides a comprehensive inventory of all assets on this site. This asset has been filtered to present an overview of the various types of data that are classified as public and have been published on the City of Austin Open Data Portal (data.austintexas.gov) by departmental data owners.
The columns of the Asset Inventory dataset contain information about every asset. These include metadata fields (e.g., Name, Description, and Category), as well as statistics, such as the number of visits, row count, column count, and downloads. This asset is updated at least once per day to sync any changes, additional assets, or removed assets.
Data provided by: Tyler Technologies Creation date of data source: November 1, 2022
*City of Austin Open Data Terms of Use – https://data.austintexas.gov/stories/s/ranj-cccq
This dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary