Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
This dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
A log of dataset alerts open, monitored or resolved on the open data portal. Alerts can include issues as well as deprecation or discontinuation notices.
A. SUMMARY This dataset is used to report on public dataset access and usage within the open data portal. Each row sums the amount of users who access a dataset each day, grouped by access type (API Read, Download, Page View, etc).
B. HOW THE DATASET IS CREATED This dataset is created by joining two internal analytics datasets generated by the SF Open Data Portal. We remove non-public information during the process.
C. UPDATE PROCESS This dataset is scheduled to update every 7 days via ETL.
D. HOW TO USE THIS DATASET This dataset can help you identify stale datasets, highlight the most popular datasets and calculate other metrics around the performance and usage in the open data portal.
Please note a special call-out for two fields: - "derived": This field shows if an asset is an original source (derived = "False") or if it is made from another asset though filtering (derived = "True"). Essentially, if it is derived from another source or not. - "provenance": This field shows if an asset is "official" (created by someone in the city of San Francisco) or "community" (created by a member of the community, not official). All community assets are derived as members of the community cannot add data to the open data portal.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Building a comprehensive data inventory as required by section 6.3 of the Directive on Open Government: “Establishing and maintaining comprehensive inventories of data and information resources of business value held by the department to determine their eligibility and priority, and to plan for their effective release.” Creating a data inventory is among the first steps in identifying federal data that is eligible for release. Departmental data inventories has been published on the Open Government portal, Open.Canada.ca, so that Canadians can see what federal data is collected and have the opportunity to indicate what data is of most interest to them, helping departments to prioritize data releases based on both external demand and internal capacity. The objective of the inventory is to provide a landscape of all federal data. While it is recognized that not all data is eligible for release due to the nature of the content, departments are responsible for identifying and including all datasets of business values as part of the inventory exercise with the exception of datasets whose title contains information that should not be released to be released to the public due to security or privacy concerns. These titles have been excluded from the inventory. Departments were provided with an open data inventory template with standardized elements to populate, and upload in the metadata catalogue, the Open Government Registry. These elements are described in the data dictionary file. Departments are responsible for maintaining up-to-date data inventories that reflect significant additions to their data holdings. For purposes of this open data inventory exercise, a dataset is defined as: “An organized collection of data used to carry out the business of a department or agency, that can be understood alone or in conjunction with other datasets”. Please note that the Open Data Inventory is no longer being maintained by Government of Canada organizations and is therefore not being updated. However, we will continue to provide access to the dataset for review and analysis.
Once PowerPivot has been installed, to load the large files, please follow the instructions below: Start Excel as normal Click on the PowerPivot tab Click on the PowerPivot Window icon (top left) In the PowerPivot Window, click on the "From Other Sources" icon In the Table Import Wizard e.g. scroll to the bottom and select Text File Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Microsoft PowerPivot add-on for Excel can be used to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section - https://www.microsoft.com/en-us/download/details.aspx?id=43348 Once PowerPivot has been installed, to load the large files, please follow the instructions below: 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Fewer than 5 Items Please be aware that I have decided not to release the exact number of items, where the total number of items falls below 5, for certain drugs/patient combinations. Where suppression has been applied a * is shown in place of the number of items, please read this as 1-4 items. Suppressions have been applied where items are lower than 5, for items and NIC and for quantity when quantity and items are both lower than 5 for the following drugs and identified genders as per the sensitive drug list; When the BNF Paragraph Code is 60401 (Female Sex Hormones & Their Modulators) and the gender identified on the prescription is Male When the BNF Paragraph Code is 60402 (Male Sex Hormones And Antagonists) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70201 (Preparations For Vaginal/Vulval Changes) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70202 (Vaginal And Vulval Infections) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70301 (Combined Hormonal Contraceptives/Systems) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70302 (Progestogen-only Contraceptives) and the gender identified on the prescription is Male When the BNF Paragraph Code is 80302 (Progestogens) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70405 (Drugs For Erectile Dysfunction) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70406 (Drugs For Premature Ejaculation) and the gender identified on the prescription is Female This is because the patients could be identified, when combined with other information that may be in the public domain or reasonably available. This information falls under the exemption in section 40 subsections 2 and 3A (a) of the Freedom of Information Act. This is because it would breach the first data protection principle as: a. it is not fair to disclose patients personal details to the world and is likely to cause damage or distress. b. these details are not of sufficient interest to the public to warrant an intrusion into the privacy of the patients. Please click the below web link to see the exemption in full.
Quarter by quarter updates of the number of maps, charts and datasets made available to the public
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This guide will introduce the open data resources available in the CA Nature website and familiarize you with key features and capabilities of the site.CA Nature is an online Geographic Information System (or GIS), that collects a suite of publicly accessible interactive digital mapping tools and data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
damien-johnston/open-data-project dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Traditionally, academic libraries have provided access to predominantly text-based materials. This project sought to identify the preferences of the ‘mobile’ SCU community in relation to accessing quality academic literature (in particular, journal articles). Varying learning styles provided the impetus for exploring audio-based alternatives to the academic literature.
41 participants, SCU students - aged between 18-60+, 14 questions survey
Data Processing: Excel and Word
Pt204_C_n_461 Pt204_C_n_472 Pt204_C_n_474 Pt204_C_n_490 Pt204_C_n_518 Pt204_C_n_519 Pt204_C_n_529 Pt204_C_n_535 Pt204_C_n_546 Pt204_C_n_547 Pt204_C_n_557 Pt227_C_n_2175 Pt227_C_n_2178 Pt227_C_n_2194 Pt227_C_n_2200 Pt227_C_n_2208 Pt227_C_n_2209 Pt227_C_n_2213 Pt227_C_n_2215 Pt227_C_n_2217 Pt227_C_n_2218 Pt227_C_n_2219 Pt227_C_n_2239 Pt227_C_n_2262 Pt227_C_n_2326 Pt227_C_n_2333 Pt227_C_n_2338 Pt227_C_n_2339 input_2D input_3D Pt227_C_n_2340 Pt227_C_n_2364 Pt227_C_n_2369 Pt227_C_n_2370 Pt227_C_n_2372 Pt227_C_n_2373 Pt227_C_n_2374 Pt227_C_n_2375 Pt227_C_n_2376 Pt227_C_n_2377 Pt227_C_n_2378 Pt227_C_n_2408 Pt227_C_n_2409 Pt227_C_n_2410 Pt227_C_n_2411 input_2D input_3D Pt227_C_n_2413 Pt227_C_n_2414 Pt227_C_n_2459 Pt227_C_n_2473 Pt227_C_n_2479 Pt227_C_n_2480 Pt230_C_n_0 Pt230_C_n_10 Pt230_C_n_101 Pt230_C_n_11 Pt230_C_n_123 Pt230_C_n_145 Pt230_C_n_181 Pt230_C_n_19 Pt230_C_n_20 Pt230_C_n_220 Pt230_C_n_25 Pt230_C_n_252 Pt230_C_n_255 Pt230_C_n_258 Pt230_C_n_259 input_2D input_3D Pt230_C_n_280 Pt230_C_n_281 Pt230_C_n_284 Pt230_C_n_290 Pt230_C_n_293 Pt230_C_n_300 Pt230_C_n_301 Pt230_C_n_306 Pt230_C_n_307 Pt230_C_n_308 Pt230_C_n_37 Pt230_C_n_43 Pt230_C_n_78 Pt230_C_n_80 Pt253_PD_n_3447 Pt253_PD_n_3450 Pt253_PD_n_3452 Pt253_PD_n_3482
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data was collected by the Geological Survey Ireland, the Department of Culture, Heritage and the Gaeltacht, the Discovery Programme, the Heritage Council, Transport Infrastructure Ireland, New York University, the Office of Public Works and Westmeath County Council. All data formats are provided as GeoTIFF rasters but are at different resolutions. Data resolution varies depending on survey requirements. Resolutions for each organisation are as follows: GSI – 1m DCHG/DP/HC - 0.13m, 0.14m, 1m NY – 1m TII – 2m OPW – 2m WMCC - 0.25m Both a DTM and DSM are raster data. Raster data is another name for gridded data. Raster data stores information in pixels (grid cells). Each raster grid makes up a matrix of cells (or pixels) organised into rows and columns. The grid cell size varies depending on the organisation that collected it. GSI data has a grid cell size of 1 meter by 1 meter. This means that each cell (pixel) represents an area of 1 meter squared.
Experiments on a milling machine for different speeds, feeds, and depth of cut. Records the wear of the milling insert, VB. The data set was provided by the UC Berkeley Emergent Space Tensegrities (BEST) Lab.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Greenhouse Gas Reporting Program (GHGRP) collects information on greenhouse gas (GHG) emissions annually from facilities across Canada. It is a mandatory program for those who meet the requirements. Facilities that emit 10 kilotonnes or more of GHGs, in carbon dioxide (CO2) equivalent (eq.) units, per year must report their emissions to Environment and Climate Change Canada. The emissions data is available in two files, each presenting emissions by different breakdowns and offered in two convenient formats for downloads: .xlsx and .csv. The Emissions by Gas file, covering data from 2004 to present, contains emissions (in tonnes and tonnes of CO2 eq.) for each facility categorized by gas type, including carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), hydrofluorocarbons (HFC), perfluorocarbons (PFC), and sulphur hexafluoride (SF6). The Emissions by Source file, starting from 2022, includes emissions data (in tonnes and tonnes of CO2 eq.) broken down by source category, encompassing Stationary Fuel Combustion, Industrial Process, On-site Transportation, Waste, Wastewater, Venting, Flaring, and Leakage. For additional information and usage guidelines, please refer to the accompanying "Lisez Moi - Read Me" file. Additionally, our data search tool can assist you in efficiently navigating and extracting specific information from the GHGRP's data. Supplemental Information Learn more about the GHGRP: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/facility-reporting.html Overview of Reported Emissions - An annual summary report of the facility-reported emissions and trends: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/facility-reporting/data.html Canada's Greenhouse Gas Emissions: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions.html Contact us: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/contact-team.html
NOTE: To review the latest plan, make sure to filter the "Report Year" column to the latest year.
Data on public websites maintained by or on behalf of the city agencies.
This dataset arises from the READ project (Horizon 2020).
The dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.
The training dataset is composed of 400 pages; most of the pages consist of a single block with many difficulties for line detection and extraction. The ground-truth in this set is in PAGE format and it is provided annotated at line level in the PAGE files.
The previous dataset is the same that is located at https://zenodo.org/record/218236#.WnLhaCHhBGF
The new file includes the test set corresponding to the HTR competition held at ICFHR 2016
Toselli, A.H., Romero, V., Villegas, M., Vidal, E., & Sánchez, J.A. (2018). HTR Dataset ICFHR 2016 (Version 1.2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1297399
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Following April 7, 2014 Executive Order from Mayor Walsh, an Open and Protected Data Policy was drafted to guide the City in defining, protecting, and ultimately making Open Data available and useful to the public. The policy provides working definitions for Open Data, along with information on how it is to be published, reviewed, and licensed.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.