100+ datasets found

Z
Data from: A Large-scale Dataset of (Open Source) License Text Variants
data.niaid.nih.gov
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
Explore at:
Dataset updated
Mar 31, 2022
Dataset authored and provided by
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
d
Addresses (Open Data)
catalog.data.gov
data.tempe.gov
+10more
Updated Jul 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2025). Addresses (Open Data) [Dataset]. https://catalog.data.gov/dataset/addresses-open-data
Explore at:
Dataset updated
Jul 19, 2025
Dataset provided by
City of Tempe
Description
This dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development.Data Source Type: ESRI ArcGIS Enterprise GeodatabasePreparation Method: N/APublish Frequency: WeeklyPublish Method: AutomaticData Dictionary
D
Dataset Alerts - Open and Monitoring
datasf.org
data.sfgov.org
+1more
application/rdfxml +5
Updated Jun 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset Alerts - Open and Monitoring [Dataset]. https://datasf.org/opendata/
Explore at:
json, application/rssxml, csv, tsv, xml, application/rdfxmlAvailable download formats
Dataset updated
Jun 20, 2025
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
A log of dataset alerts open, monitored or resolved on the open data portal. Alerts can include issues as well as deprecation or discontinuation notices.
D
Public Dataset Access and Usage
data.sfgov.org
s.cnmilf.com
+2more
application/rdfxml +5
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Public Dataset Access and Usage [Dataset]. https://data.sfgov.org/City-Infrastructure/Public-Dataset-Access-and-Usage/su99-qvi4
Explore at:
csv, application/rssxml, json, tsv, application/rdfxml, xmlAvailable download formats
Dataset updated
Jul 23, 2025
Description
A. SUMMARY This dataset is used to report on public dataset access and usage within the open data portal. Each row sums the amount of users who access a dataset each day, grouped by access type (API Read, Download, Page View, etc).

B. HOW THE DATASET IS CREATED This dataset is created by joining two internal analytics datasets generated by the SF Open Data Portal. We remove non-public information during the process.

C. UPDATE PROCESS This dataset is scheduled to update every 7 days via ETL.

D. HOW TO USE THIS DATASET This dataset can help you identify stale datasets, highlight the most popular datasets and calculate other metrics around the performance and usage in the open data portal.

Please note a special call-out for two fields: - "derived": This field shows if an asset is an original source (derived = "False") or if it is made from another asset though filtering (derived = "True"). Essentially, if it is derived from another source or not. - "provenance": This field shows if an asset is "official" (created by someone in the city of San Francisco) or "community" (created by a member of the community, not official). All community assets are derived as members of the community cannot add data to the open data portal.
Open Data Inventory
ouvert.canada.ca
open.canada.ca
csv, html, xls
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2024). Open Data Inventory [Dataset]. https://ouvert.canada.ca/data/dataset/4ed351cf-95d8-4c10-97ac-6b3511f359b7
Explore at:
html, csv, xlsAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
Building a comprehensive data inventory as required by section 6.3 of the Directive on Open Government: “Establishing and maintaining comprehensive inventories of data and information resources of business value held by the department to determine their eligibility and priority, and to plan for their effective release.” Creating a data inventory is among the first steps in identifying federal data that is eligible for release. Departmental data inventories has been published on the Open Government portal, Open.Canada.ca, so that Canadians can see what federal data is collected and have the opportunity to indicate what data is of most interest to them, helping departments to prioritize data releases based on both external demand and internal capacity. The objective of the inventory is to provide a landscape of all federal data. While it is recognized that not all data is eligible for release due to the nature of the content, departments are responsible for identifying and including all datasets of business values as part of the inventory exercise with the exception of datasets whose title contains information that should not be released to be released to the public due to security or privacy concerns. These titles have been excluded from the inventory. Departments were provided with an open data inventory template with standardized elements to populate, and upload in the metadata catalogue, the Open Government Registry. These elements are described in the data dictionary file. Departments are responsible for maintaining up-to-date data inventories that reflect significant additions to their data holdings. For purposes of this open data inventory exercise, a dataset is defined as: “An organized collection of data used to carry out the business of a department or agency, that can be understood alone or in conjunction with other datasets”. Please note that the Open Data Inventory is no longer being maintained by Government of Canada organizations and is therefore not being updated. However, we will continue to provide access to the dataset for review and analysis.
n
FOI 30990 - Datasets - Open Data Portal
opendata.nhsbsa.net
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). FOI 30990 - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/foi-30990
Explore at:
Dataset updated
Feb 13, 2023
Description
Once PowerPivot has been installed, to load the large files, please follow the instructions below: Start Excel as normal Click on the PowerPivot tab Click on the PowerPivot Window icon (top left) In the PowerPivot Window, click on the "From Other Sources" icon In the Table Import Wizard e.g. scroll to the bottom and select Text File Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Microsoft PowerPivot add-on for Excel can be used to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section - https://www.microsoft.com/en-us/download/details.aspx?id=43348 Once PowerPivot has been installed, to load the large files, please follow the instructions below: 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Please read the below notes to ensure correct understanding of the data. Fewer than 5 Items Please be aware that I have decided not to release the exact number of items, where the total number of items falls below 5, for certain drugs/patient combinations. Where suppression has been applied a * is shown in place of the number of items, please read this as 1-4 items. Suppressions have been applied where items are lower than 5, for items and NIC and for quantity when quantity and items are both lower than 5 for the following drugs and identified genders as per the sensitive drug list; When the BNF Paragraph Code is 60401 (Female Sex Hormones & Their Modulators) and the gender identified on the prescription is Male When the BNF Paragraph Code is 60402 (Male Sex Hormones And Antagonists) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70201 (Preparations For Vaginal/Vulval Changes) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70202 (Vaginal And Vulval Infections) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70301 (Combined Hormonal Contraceptives/Systems) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70302 (Progestogen-only Contraceptives) and the gender identified on the prescription is Male When the BNF Paragraph Code is 80302 (Progestogens) and the gender identified on the prescription is Male When the BNF Paragraph Code is 70405 (Drugs For Erectile Dysfunction) and the gender identified on the prescription is Female When the BNF Paragraph Code is 70406 (Drugs For Premature Ejaculation) and the gender identified on the prescription is Female This is because the patients could be identified, when combined with other information that may be in the public domain or reasonably available. This information falls under the exemption in section 40 subsections 2 and 3A (a) of the Freedom of Information Act. This is because it would breach the first data protection principle as: a. it is not fair to disclose patients personal details to the world and is likely to cause damage or distress. b. these details are not of sufficient interest to the public to warrant an intrusion into the privacy of the patients. Please click the below web link to see the exemption in full.
d
State of Hawaii Open Data growth
catalog.data.gov
opendata.hawaii.gov
+3more
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accounting and General Services (2024). State of Hawaii Open Data growth [Dataset]. https://catalog.data.gov/dataset/state-of-hawaii-open-data-growth
Explore at:
Dataset updated
Apr 10, 2024
Dataset provided by
Accounting and General Services
Area covered
Hawaii
Description
Quarter by quarter updates of the number of maps, charts and datasets made available to the public
c
Open Data User Guide
californianature.ca.gov
data.ca.gov
+5more
Updated Dec 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CA Nature Organization (2021). Open Data User Guide [Dataset]. https://www.californianature.ca.gov/datasets/open-data-user-guide
Explore at:
Dataset updated
Dec 28, 2021
Dataset authored and provided by
CA Nature Organization
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This guide will introduce the open data resources available in the CA Nature website and familiarize you with key features and capabilities of the site.CA Nature is an online Geographic Information System (or GIS), that collects a suite of publicly accessible interactive digital mapping tools and data.
h
open-data-project
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damien Johnston (2024). open-data-project [Dataset]. https://huggingface.co/datasets/damien-johnston/open-data-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Authors
Damien Johnston
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
damien-johnston/open-data-project dataset hosted on Hugging Face and contributed by the HF Datasets community
s
Data from: Read, listen and learn
researchportal.scu.edu.au
researchdata.edu.au
csv, xls
Updated Nov 25, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margie Wallin; Kate Kelly; Annika McGinley (2017). Read, listen and learn [Dataset]. https://researchportal.scu.edu.au/esploro/outputs/dataset/Read-listen-and-learn/991012820378402368
Explore at:
csv(22283 bytes), xls(60928 bytes)Available download formats
Dataset updated
Nov 25, 2017
Dataset provided by
Southern Cross University
Authors
Margie Wallin; Kate Kelly; Annika McGinley
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
2017
Description
Traditionally, academic libraries have provided access to predominantly text-based materials. This project sought to identify the preferences of the ‘mobile’ SCU community in relation to accessing quality academic literature (in particular, journal articles). Varying learning styles provided the impetus for exploring audio-based alternatives to the academic literature.

41 participants, SCU students - aged between 18-60+, 14 questions survey

Data Processing: Excel and Word
b
REMAP Open dataset/Turning - Datasets - data.bris
data.bris.ac.uk
Updated Oct 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). REMAP Open dataset/Turning - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/0017af0c809578b1696c9bf502cc3715
Explore at:
Dataset updated
Oct 10, 2023
Description
Pt204_C_n_461 Pt204_C_n_472 Pt204_C_n_474 Pt204_C_n_490 Pt204_C_n_518 Pt204_C_n_519 Pt204_C_n_529 Pt204_C_n_535 Pt204_C_n_546 Pt204_C_n_547 Pt204_C_n_557 Pt227_C_n_2175 Pt227_C_n_2178 Pt227_C_n_2194 Pt227_C_n_2200 Pt227_C_n_2208 Pt227_C_n_2209 Pt227_C_n_2213 Pt227_C_n_2215 Pt227_C_n_2217 Pt227_C_n_2218 Pt227_C_n_2219 Pt227_C_n_2239 Pt227_C_n_2262 Pt227_C_n_2326 Pt227_C_n_2333 Pt227_C_n_2338 Pt227_C_n_2339 input_2D input_3D Pt227_C_n_2340 Pt227_C_n_2364 Pt227_C_n_2369 Pt227_C_n_2370 Pt227_C_n_2372 Pt227_C_n_2373 Pt227_C_n_2374 Pt227_C_n_2375 Pt227_C_n_2376 Pt227_C_n_2377 Pt227_C_n_2378 Pt227_C_n_2408 Pt227_C_n_2409 Pt227_C_n_2410 Pt227_C_n_2411 input_2D input_3D Pt227_C_n_2413 Pt227_C_n_2414 Pt227_C_n_2459 Pt227_C_n_2473 Pt227_C_n_2479 Pt227_C_n_2480 Pt230_C_n_0 Pt230_C_n_10 Pt230_C_n_101 Pt230_C_n_11 Pt230_C_n_123 Pt230_C_n_145 Pt230_C_n_181 Pt230_C_n_19 Pt230_C_n_20 Pt230_C_n_220 Pt230_C_n_25 Pt230_C_n_252 Pt230_C_n_255 Pt230_C_n_258 Pt230_C_n_259 input_2D input_3D Pt230_C_n_280 Pt230_C_n_281 Pt230_C_n_284 Pt230_C_n_290 Pt230_C_n_293 Pt230_C_n_300 Pt230_C_n_301 Pt230_C_n_306 Pt230_C_n_307 Pt230_C_n_308 Pt230_C_n_37 Pt230_C_n_43 Pt230_C_n_78 Pt230_C_n_80 Pt253_PD_n_3447 Pt253_PD_n_3450 Pt253_PD_n_3452 Pt253_PD_n_3482
Open Topographic Lidar Data - Dataset - data.gov.ie
data.gov.ie
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.gov.ie (2021). Open Topographic Lidar Data - Dataset - data.gov.ie [Dataset]. https://data.gov.ie/dataset/open-topographic-lidar-data
Explore at:
Dataset updated
Oct 22, 2021
Dataset provided by
data.gov.ie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data was collected by the Geological Survey Ireland, the Department of Culture, Heritage and the Gaeltacht, the Discovery Programme, the Heritage Council, Transport Infrastructure Ireland, New York University, the Office of Public Works and Westmeath County Council. All data formats are provided as GeoTIFF rasters but are at different resolutions. Data resolution varies depending on survey requirements. Resolutions for each organisation are as follows: GSI – 1m DCHG/DP/HC - 0.13m, 0.14m, 1m NY – 1m TII – 2m OPW – 2m WMCC - 0.25m Both a DTM and DSM are raster data. Raster data is another name for gridded data. Raster data stores information in pixels (grid cells). Each raster grid makes up a matrix of cells (or pixels) organised into rows and columns. The grid cell size varies depending on the organisation that collected it. GSI data has a grid cell size of 1 meter by 1 meter. This means that each cell (pixel) represents an area of 1 meter squared.
Milling Wear - Dataset - NASA Open Data Portal
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Aug 28, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2007). Milling Wear - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/milling-wear
Explore at:
Dataset updated
Aug 28, 2007
Dataset provided by
NASAhttp://nasa.gov/
Description
Experiments on a milling machine for different speeds, feeds, and depth of cut. Records the wear of the milling insert, VB. The data set was provided by the UC Berkeley Emergent Space Tensegrities (BEST) Lab.
Greenhouse Gas Reporting Program (GHGRP) - Facility Greenhouse Gas (GHG)...
open.canada.ca
datasets.ai
+2more
csv, html, xls, xlsx
Updated Apr 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Climate Change Canada (2025). Greenhouse Gas Reporting Program (GHGRP) - Facility Greenhouse Gas (GHG) Data [Dataset]. https://open.canada.ca/data/en/dataset/a8ba14b7-7f23-462a-bdbb-83b0ef629823
Explore at:
html, xls, xlsx, csvAvailable download formats
Dataset updated
Apr 4, 2025
Dataset provided by
Environment And Climate Change Canadahttps://www.canada.ca/en/environment-climate-change.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 1, 2004 - Nov 17, 2022
Description
The Greenhouse Gas Reporting Program (GHGRP) collects information on greenhouse gas (GHG) emissions annually from facilities across Canada. It is a mandatory program for those who meet the requirements. Facilities that emit 10 kilotonnes or more of GHGs, in carbon dioxide (CO2) equivalent (eq.) units, per year must report their emissions to Environment and Climate Change Canada. The emissions data is available in two files, each presenting emissions by different breakdowns and offered in two convenient formats for downloads: .xlsx and .csv. The Emissions by Gas file, covering data from 2004 to present, contains emissions (in tonnes and tonnes of CO2 eq.) for each facility categorized by gas type, including carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), hydrofluorocarbons (HFC), perfluorocarbons (PFC), and sulphur hexafluoride (SF6). The Emissions by Source file, starting from 2022, includes emissions data (in tonnes and tonnes of CO2 eq.) broken down by source category, encompassing Stationary Fuel Combustion, Industrial Process, On-site Transportation, Waste, Wastewater, Venting, Flaring, and Leakage. For additional information and usage guidelines, please refer to the accompanying "Lisez Moi - Read Me" file. Additionally, our data search tool can assist you in efficiently navigating and extracting specific information from the GHGRP's data. Supplemental Information Learn more about the GHGRP: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/facility-reporting.html Overview of Reported Emissions - An annual summary report of the facility-reported emissions and trends: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/facility-reporting/data.html Canada's Greenhouse Gas Emissions: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions.html Contact us: https://www.canada.ca/en/environment-climate-change/services/climate-change/greenhouse-gas-emissions/contact-team.html
NYC Open Data Plan: Website Data
data.cityofnewyork.us
catalog.data.gov
application/rdfxml +5
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Technology and Innovation (OTI) (2024). NYC Open Data Plan: Website Data [Dataset]. https://data.cityofnewyork.us/City-Government/NYC-Open-Data-Plan-Website-Data/duz4-2gn9
Explore at:
application/rdfxml, csv, application/rssxml, tsv, xml, jsonAvailable download formats
Dataset updated
Oct 28, 2024
Dataset provided by
New York City Office of Technology and Innovationhttps://www.nyc.gov/content/oti/pages/
Authors
Office of Technology and Innovation (OTI)
Description
NOTE: To review the latest plan, make sure to filter the "Report Year" column to the latest year.

Data on public websites maintained by or on behalf of the city agencies.
P
READ 2016 Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
READ 2016 Dataset [Dataset]. https://paperswithcode.com/dataset/read-2016
Explore at:
Description
This dataset arises from the READ project (Horizon 2020).

The dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.

The training dataset is composed of 400 pages; most of the pages consist of a single block with many difficulties for line detection and extraction. The ground-truth in this set is in PAGE format and it is provided annotated at line level in the PAGE files.

The previous dataset is the same that is located at https://zenodo.org/record/218236#.WnLhaCHhBGF

The new file includes the test set corresponding to the HTR competition held at ICFHR 2016

Toselli, A.H., Romero, V., Villegas, M., Vidal, E., & Sánchez, J.A. (2018). HTR Dataset ICFHR 2016 (Version 1.2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1297399
Read Head Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Read Head Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
u
Data from: Use of long-read sequencing simulators to assess real-world...
agdatacommons.nal.usda.gov
datasets.ai
+1more
txt
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katrina L. Counihan; Siddhartha Kanrar; Shannon Tilman; Andrew Gehring (2024). Data from: Use of long-read sequencing simulators to assess real-world applications for food safety [Dataset]. http://doi.org/10.15482/USDA.ADC/1529447
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1529447
Dataset updated
Feb 28, 2024
Dataset provided by
Ag Data Commons
Authors
Katrina L. Counihan; Siddhartha Kanrar; Shannon Tilman; Andrew Gehring
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
World
Description
Shiga toxin-producing Escherichia coli (STEC) and Listeria monocytogenes are responsible for severe foodborne illnesses in the United States. Current identification methods require at least four days to identify STEC and six days for L. monocytogenes. Adoption of long-read, whole genome sequencing for testing could significantly reduce the time needed for identification, but method development costs are high. Therefore, the goal of this project was to use NanoSim-H software to simulate Oxford Nanopore sequencing reads to assess the feasibility of sequencing-based foodborne pathogen detection and guide experimental design. Sequencing reads were simulated for STEC, L. monocytogenes, and a 1:1 combination of STEC and Bos taurus genomes using NanoSim-H. This dataset includes all of the simulated reads generated by the project in fasta format. This dataset can be analyzed bioinformatically or used to test bioinformatic pipelines.
A
Open and Protected Data Policy
data.boston.gov
pdf
Updated May 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Innovation and Technology (2019). Open and Protected Data Policy [Dataset]. https://data.boston.gov/dataset/open-and-protected-data-policy
Explore at:
pdfAvailable download formats
Dataset updated
May 15, 2019
Dataset provided by
Illinois Department of Innovation and Technologyhttps://doit.illinois.gov/
Authors
Department of Innovation and Technology
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Following April 7, 2014 Executive Order from Mayor Walsh, an Open and Protected Data Policy was drafted to guide the City in defining, protecting, and ultimately making Open Data available and useful to the public. The policy provides working definitions for Open Data, along with information on how it is to be published, reviewed, and licensed.
n
Data from: Data reuse and the open data citation advantage
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Oct 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.781pv
Dataset updated
Oct 1, 2013
Dataset provided by
National Evolutionary Synthesis Center
Authors
Heather A. Piwowar; Todd J. Vision
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

Facebook

Twitter

Click to copy link

Link copied

Cite

Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Explore at:

Dataset updated

Mar 31, 2022

Dataset authored and provided by

Stefano Zacchiroli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

Clear search

Close search

Google apps

Main menu

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Addresses (Open Data)

Dataset Alerts - Open and Monitoring

Public Dataset Access and Usage

Open Data Inventory

FOI 30990 - Datasets - Open Data Portal

State of Hawaii Open Data growth

Open Data User Guide

open-data-project

Data from: Read, listen and learn

REMAP Open dataset/Turning - Datasets - data.bris

Open Topographic Lidar Data - Dataset - data.gov.ie

Milling Wear - Dataset - NASA Open Data Portal

Greenhouse Gas Reporting Program (GHGRP) - Facility Greenhouse Gas (GHG)...

NYC Open Data Plan: Website Data

READ 2016 Dataset

Read Head Import Data India – Buyers & Importers List

Data from: Use of long-read sequencing simulators to assess real-world...

Open and Protected Data Policy

Data from: Data reuse and the open data citation advantage

Data from: A Large-scale Dataset of (Open Source) License Text VariantsSee More Versions

Data from: A Large-scale Dataset of (Open Source) License Text Variants