Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Micro - climate sensors collect telemetry at set intervals throughout the day. Sensors are located at various locations in the City of Canning, Western Australia and each sensor has a unique ID. Contact us at opendata@canning.wa.gov.au for a larger data set (The data is supplied is the sensor reading for 30 days). The following lists the locations of each sensor:18zua9muwbb is located at Wharf Street Basin - Pavilion 2hq3byfebne is located at The City’s Civic and Administration Building uu90853psl is located at Wharf Street Basin - Leila Street entrance xd2su7w05m is located at Wharf Street Basin - Nature Play Area
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.
The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.
The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large go-around, also referred to as missed approach, data set. The data set is in support of the paper presented at the OpenSky Symposium on November the 10th.
If you use this data for a scientific publication, please consider citing our paper.
The data set contains landings from 176 (mostly) large airports from 44 different countries. The landings are labelled as performing a go-around (GA) or not. In total, the data set contains almost 9 million landings with more than 33000 GAs. The data was collected from OpenSky Network's historical data base for the year 2019. The published data set contains multiple files:
go_arounds_minimal.csv.gz
Compressed CSV containing the minimal data set. It contains a row for each landing and a minimal amount of information about the landing, and if it was a GA. The data is structured in the following way:
Column name | Type | Description |
---|---|---|
time | date time | UTC time of landing or first GA attempt |
icao24 | string | Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned |
callsign | string | Aircraft identifier in air-ground communications |
airport | string | ICAO airport code where the aircraft is landing |
runway | string | Runway designator on which the aircraft landed |
has_ga | string | "True" if at least one GA was performed, otherwise "False" |
n_approaches | integer | Number of approaches identified for this flight |
n_rwy_approached | integer | Number of unique runways approached by this flight |
The last two columns, n_approaches and n_rwy_approached, are useful to filter out training and calibration flight. These have usually a large number of n_approaches, so an easy way to exclude them is to filter by n_approaches > 2.
go_arounds_augmented.csv.gz
Compressed CSV containing the augmented data set. It contains a row for each landing and additional information about the landing, and if it was a GA. The data is structured in the following way:
Column name | Type | Description |
---|---|---|
time | date time | UTC time of landing or first GA attempt |
icao24 | string | Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned |
callsign | string | Aircraft identifier in air-ground communications |
airport | string | ICAO airport code where the aircraft is landing |
runway | string | Runway designator on which the aircraft landed |
has_ga | string | "True" if at least one GA was performed, otherwise "False" |
n_approaches | integer | Number of approaches identified for this flight |
n_rwy_approached | integer | Number of unique runways approached by this flight |
registration | string | Aircraft registration |
typecode | string | Aircraft ICAO typecode |
icaoaircrafttype | string | ICAO aircraft type |
wtc | string | ICAO wake turbulence category |
glide_slope_angle | float | Angle of the ILS glide slope in degrees |
has_intersection |
string | Boolean that is true if the runway has an other runway intersecting it, otherwise false |
rwy_length | float | Length of the runway in kilometre |
airport_country | string | ISO Alpha-3 country code of the airport |
airport_region | string | Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania) |
operator_country | string | ISO Alpha-3 country code of the operator |
operator_region | string | Geographical region of the operator of the aircraft (either Europe, North America, South America, Asia, Africa, or Oceania) |
wind_speed_knts | integer | METAR, surface wind speed in knots |
wind_dir_deg | integer | METAR, surface wind direction in degrees |
wind_gust_knts | integer | METAR, surface wind gust speed in knots |
visibility_m | float | METAR, visibility in m |
temperature_deg | integer | METAR, temperature in degrees Celsius |
press_sea_level_p | float | METAR, sea level pressure in hPa |
press_p | float | METAR, QNH in hPA |
weather_intensity | list | METAR, list of present weather codes: qualifier - intensity |
weather_precipitation | list | METAR, list of present weather codes: weather phenomena - precipitation |
weather_desc | list | METAR, list of present weather codes: qualifier - descriptor |
weather_obscuration | list | METAR, list of present weather codes: weather phenomena - obscuration |
weather_other | list | METAR, list of present weather codes: weather phenomena - other |
This data set is augmented with data from various public data sources. Aircraft related data is mostly from the OpenSky Network's aircraft data base, the METAR information is from the Iowa State University, and the rest is mostly scraped from different web sites. If you need help with the METAR information, you can consult the WMO's Aerodrom Reports and Forecasts handbook.
go_arounds_agg.csv.gz
Compressed CSV containing the aggregated data set. It contains a row for each airport-runway, i.e. every runway at every airport for which data is available. The data is structured in the following way:
Column name | Type | Description |
---|---|---|
airport | string | ICAO airport code where the aircraft is landing |
runway | string | Runway designator on which the aircraft landed |
n_landings | integer | Total number of landings observed on this runway in 2019 |
ga_rate | float | Go-around rate, per 1000 landings |
glide_slope_angle | float | Angle of the ILS glide slope in degrees |
has_intersection | string | Boolean that is true if the runway has an other runway intersecting it, otherwise false |
rwy_length | float | Length of the runway in kilometres |
airport_country | string | ISO Alpha-3 country code of the airport |
airport_region | string | Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania) |
This aggregated data set is used in the paper for the generalized linear regression model.
Downloading the trajectories
Users of this data set with access to OpenSky Network's Impala shell can download the historical trajectories from the historical data base with a few lines of Python code. For example, you want to get all the go-arounds of the 4th of January 2019 at London City Airport (EGLC). You can use the Traffic library for easy access to the database:
import datetime
from tqdm.auto import tqdm
import pandas as pd
from traffic.data import opensky
from traffic.core import Traffic
load minimum data set
df = pd.read_csv("go_arounds_minimal.csv.gz", low_memory=False)
df["time"] = pd.to_datetime(df["time"])
select London City Airport, go-arounds, and 2019-01-04
airport = "EGLC"
start = datetime.datetime(year=2019, month=1, day=4).replace(
tzinfo=datetime.timezone.utc
)
stop = datetime.datetime(year=2019, month=1, day=5).replace(
tzinfo=datetime.timezone.utc
)
df_selection = df.query("airport==@airport & has_ga
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Warning: Large file size (over 1GB). Each monthly data set is large (over 4 million rows), but can be viewed in standard software such as Microsoft WordPad (save by right-clicking on the file name and selecting 'Save Target As', or equivalent on Mac OSX). It is then possible to select the required rows of data and copy and paste the information into another software application, such as a spreadsheet. Alternatively add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets, can be used. The Microsoft PowerPivot add-on for Excel is available from the Microsoft Download Center, using the link in the 'Related Links' section below. Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. Start Excel as normal Click on the PowerPivot tab Click on the PowerPivot Window icon (top left) In the PowerPivot Window, click on the "From Other Sources" icon In the Table Import Wizard e.g. scroll to the bottom and select Text File Browse to the file you want to open and choose the file extension you require e.g. CSV Once the data has been imported you can view it in a spreadsheet. What does the data cover? General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month. A record will only be produced when this has occurred and there is no record for a zero total. For each practice in England, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name): the total number of items prescribed and dispensed the total net ingredient cost the total actual cost the total quantity The data covers NHS prescriptions written in England and dispensed in the community in the UK. Prescriptions written in England but dispensed outside England are included. The data includes prescriptions written by GPs and other non-medical prescribers (such as nurses and pharmacists) who are attached to GP practices. GP practices are identified only by their national code, so an additional data file - linked to the first by the practice code - provides further detail in relation to the practice. Presentations are identified only by their BNF code, so an additional data file - linked to the first by the BNF code - provides the chemical name for that presentation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
A Digital Terrain Model (DTM) is a digital file consisting of a grid of regularly spaced points of known height which, when used with other digital data such as maps or orthophotographs, can provide a 3D image of the land surface. 10m and 50m DTM’s are available. This is a large dataset and will take sometime to download. Please be patient. By download or use of this dataset you agree to abide by the LPS Open Government Data Licence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset serves to estimate the status, in particular the size, of a crowd given the impact on radio frequency communication links within a wireless sensor network. To quantify this relation, signal strengths across sub-GHz communication links are collected at the premises of the Tomorrowland music festival. The communication links are formed between the network nodes of wireless sensor networks deployed in three of the festival's stage environments.
The table below lists the eighteen dataset files. They are collected at the music festival's 2017 and 2018 editions. There are three environments, labeled: ‘Freedom Stage 2017’, ‘Freedom Stage 2018’, and ‘Main Comfort 2018’. Each environment has both 433 MHz and 868 MHz data. The measurements at each environment were collected over a period of three festival days. The dataset files are formatted as Comma-Separated Values (CSV).
Dataset file | Reference file | Number of messages |
---|---|---|
free17_433_fri.csv | None | 393 852 |
free17_868_fri.csv | None | 472 202 |
free17_433_sat.csv | free17_transactions.csv | 996 033 |
free17_868_sat.csv | free17_transactions.csv | 1 023 059 |
free17_433_sun.csv | free17_transactions.csv | 1 007 066 |
free17_868_sun.csv | free17_transactions.csv | 1 036 456 |
free18_433_fri.csv | None | 765 024 |
free18_868_fri.csv | None | 757 657 |
free18_433_sat.csv | free18_transactions.csv | 711 438 |
free18_868_sat.csv | free18_transactions.csv | 714 390 |
free18_433_sun.csv | free18_transactions.csv | 648 329 |
free18_868_sun.csv | free18_transactions.csv | 656 290 |
main18_433_fri.csv | None | 791 462 |
main18_868_fri.csv | None | 908 407 |
main18_433_sat.csv | main18_counts.csv | 863 666 |
main18_868_sat.csv | main18_counts.csv | 884 682 |
main18_433_sun.csv | main18_counts.csv | 903 862 |
main18_868_sun.csv | main18_counts.csv | 894 496 |
In addition to the datasets and reference files, a software example is provided to illustrate the data use and visualise the initial findings and relation between crowd size and network signal strength impact.
In order to use the software, please retain the following file structure:
. ├── data ├── data_reference ├── graphs └── software
The peer-reviewed data descriptor for this dataset has now been published in MDPI Data - an open access journal aiming at enhancing data transparency and reusability, and can be accessed here: https://doi.org/10.3390/data5020052. Please cite this when using the dataset.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
The English Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the English language, advancing the field of artificial intelligence.
Dataset Content: This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in English. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English Speaking people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.Data Format and Annotation Details: This fully labeled English Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.Both the question and answers in English are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description Dataset of melt pool geometry variability data in Powder Bed Fusion - Laser Beam of Ti-6Al-4V. This work was conducted on an EOS M290. Contents MTMeasurements.csv: A csv file with the multi track measurements including cap heights, remelt depths, and widths by orientation and velocity
STMeasurements.csv: A csv file with the single track measurements including cap heights, remelt depths, and widths by orientation and velocity
Note: These measurements were not used in the manuscript.
StWidths.csv: A csv file containing the widths as a function of lengths with the beginning and end of each track removed. These are labeled by location along the length, the measured width, velocity, and orientation.
WARNING: StWidths.csv is too large to open in excel. Saving it in excel will cause data loss.
figures.ipynb: jupyter notebook will generate all of the figures that were published with the article.
Additionally, all of the individual figure files are labeled as they occur in the manuscript and are generated by the code. Citation Please use the following reference in case you find this dataset useful.
@article{Miner2024, author = "Justin Miner and Sneha Prabha Narra", title = "{Dataset of Melt Pool Variability Measurements for Powder Bed Fusion - Laser Beam of Ti-6Al-4V}", year = "2024", month = "5", url = "https://kilthub.cmu.edu/articles/dataset/Dataset_of_Melt_Pool_Variability_Measurements_for_Powder_Bed_Fusion_-_Laser_Beam_of_Ti-6Al-4V/25696293", doi = "10.1184/R1/25696293.v1"}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:
Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.
Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles
Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection
Definitions for individual data fields:
pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine
doi: Digital Object Identifier, if available
year: Year the article was published
title: Title of the article
authors: List of author names
journal: Journal name (ISO abbreviation)
is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article
relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.
provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.
citation_count: Number of unique articles that have cited this one
citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.
field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.
expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.
nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.
human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
x_coord: X coordinate of the article on the Triangle of Biomedicine
y_coord: Y Coordinate of the article on the Triangle of Biomedicine
is_clinical: Flag indicating that this paper meets the definition of a clinical article.
cited_by_clin: PMIDs of clinical articles that this article has been cited by.
apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.
cited_by: PMIDs of articles that have cited this one.
references: PMIDs of articles in this article's reference list.
Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.
Comments and questions can be addressed to iCite@mail.nih.gov
This dataset contains all current and active business licenses issued by the Department of Business Affairs and Consumer Protection. This dataset contains a large number of records /rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.
Data fields requiring description are detailed below.
APPLICATION TYPE: 'ISSUE' is the record associated with the initial license application. 'RENEW' is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. 'C_LOC' is a change of location record. It means the business moved. 'C_CAPA' is a change of capacity record. Only a few license types my file this type of application. 'C_EXPA' only applies to businesses that have liquor licenses. It means the business location expanded.
LICENSE STATUS: 'AAI' means the license was issued.
Business license owners may be accessed at: http://data.cityofchicago.org/Community-Economic-Development/Business-Owners/ezma-pppn To identify the owner of a business, you will need the account number or legal name.
Data Owner: Business Affairs and Consumer Protection
Time Period: Current
Frequency: Data is updated daily
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A point-in-time ‘snapshot’ of all vehicles currently registered in New Zealand. The data relates to currently-registered vehicles as recorded on the Motor Vehicle Register (MVR). We update it monthly, so it's accurate up to the end of the previous month.Access Motor Vehicle Register data via APIRegistration is the process where we add a vehicle’s details to the MVR and issue its number plates. It is not the same thing as vehicle licensing, also called ‘rego’. To give you a quick overview of the data, see the charts in the ‘Attributes’ section below. These will give you information about each of the attributes (variables) in the dataset. Each chart is specific to a variable, and shows all data (without any filters applied). Motor Vehicle Register data - field descriptions Data reuse caveats: as per license. We’ve taken reasonable care in compiling this information, and provide it on an ‘as is, where is’ basis. We are not liable for any action taken on the basis of the information. For further information see the Waka Kotahi website, as well as the terms of the CC-BY 4.0 International license under which we publish this data. CC-BY 4.0 International licence details Variables in the dataset are formatted for analytical use. This can result in attribute charts that may not appear meaningful, and are not suitable for broader analysis or use. In addition, some variables are not mutually exclusive and should not be considered in isolation. As such, these charts should not be taken and used directly as analysis of the overall data. Data quality statement: this data relates to vehicles, not people. We have included some information about vehicle registered owners live. This is based on the most recent information we have about their physical address. To make sure it isn’t possible to identify a person in the data, we have provided this at Territorial Authority (TA) level. A TA is a broad geographical area defined under the Local Government Act 2002 as a city council or district council. There are 67 TAs consisting of 12 city councils, 53 districts, Auckland Council and Chatham Island Council. We haven’t included vehicles that belong to people with a confidential listing. We have restricted the Vehicle Identification Number (VIN) to the first 11 characters – these are generic and don’t identify specific vehicles. Data quality caveats: many of the fields in the (MVR) are free text fields, which means there may be spelling mistakes and other human errors. We have algorithmically cleaned the data to correct identified errors (particularly with respect to a vehicle’s make and model). However, due to the large number of vehicles on the Register we may not have corrected some information. Additionally, some variables may be subject to differences in how people have recorded details – for example, manufacturers release a variety of sub-models and these may not be referred to, or put into the system, in the same way. We have made our cleaning code open source.Vehicle make and model cleansing code (GitHub)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the raw experimental data and supplementary materials for the "Asymmetry Effects in Virtual Reality Rod and Frame Test". The materials included are:
• Raw Experimental Data: older.csv and young.csv
• Mathematica Notebooks: a collection of Mathematica notebooks used for data analysis and visualization. These notebooks provide scripts for processing the experimental data, performing statistical analyses, and generating the figures used in the project.
• Unity Package: a Unity package featuring a sample scene related to the project. The scene was built using Unity’s Universal Rendering Pipeline (URP). To utilize this package, ensure that URP is enabled in your Unity project. Instructions for enabling URP can be found in the Unity URP Documentation.
Requirements:
• For Data Files: software capable of opening CSV files (e.g., Microsoft Excel, Google Sheets, or any programming language that can read CSV formats).
• For Mathematica Notebooks: Wolfram Mathematica software to run and modify the notebooks.
• For Unity Package: Unity Editor version compatible with URP (2019.3 or later recommended). URP must be installed and enabled in your Unity project.
Usage Notes:
• The dataset facilitates comparative studies between different age groups based on the collected variables.
• Users can modify the Mathematica notebooks to perform additional analyses.
• The Unity scene serves as a reference to the project setup and can be expanded or integrated into larger projects.
Citation: Please cite this dataset when using it in your research or publications.
These datasets are a subset of the CMS Open data with 2021 data-taking conditions for education purposes. In this version, the data and simulation files are compressed into one big file for easy access. They are stored in two different formats (CSV and PKL) with the same content, therefore just use one of them. Once unzipped: - Data files, starting with output_data_CMS_Run2012B, correspond to 4429.37 /pb of data collected by the CMS Experiment. They are a subset of the dataset on reference [1]. - Simulation files, starting with output_sim_CMS_MonteCarlo2012, are a subset of the dataset referenced on [2]. The number of generated events in this case is 30458871, and the cross section is 3503.71. All the files were processed with a modified version of the AOD2NanoAODOutreachTool [3]. The small modifications are related to the number of triggers stored, and some objects like taus were removed. -------------------------------------------------------- [1] CMS collaboration (2017). DoubleMuParked primary dataset in AOD format from Run of 2012 (/DoubleMuParked/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.YLIC.86ZZ [2] Wunsch, Stefan; (2019). DYJetsToLL dataset in reduced NanoAOD format for education and outreach. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.SRRA.2GON [3] https://github.com/cms-opendata-analyses/AOD2NanoAODOutreachTool {"references": ["CMS collaboration (2017). DoubleMuParked primary dataset in AOD format from Run of 2012 (/DoubleMuParked/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.YLIC.86ZZ", "Wunsch, Stefan; (2019). DYJetsToLL dataset in reduced NanoAOD format for education and outreach. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.SRRA.2GON", "https://github.com/cms-opendata-analyses/AOD2NanoAODOutreachTool"]} For the CSV files you might need to open them using pandas as: pandas.read_csv('output_data.csv', index_col=['entry','subentry']) For the pickle files, you might need to use python3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
pNEUMA is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment by a swarm of drones in the congested downtown area of Athens, Greece. A unique observatory of traffic congestion, a scale an-order-of-magnitude higher than what was not available until now, that researchers from different disciplines around the globe can use to develop and test their own models.
How are the .csv files organized?
For more details about the pNEUMA dataset, please check our website at https://open-traffic.epfl.ch
Abstract copyright UK Data Service and data collection copyright owner.
The heat pump monitoring datasets are a key output of the Electrification of Heat Demonstration (EoH) project, a government-funded heat pump trial assessing the feasibility of heat pumps across the UK’s diverse housing stock. These datasets are provided in both cleansed and raw form and allow analysis of the initial performance of the heat pumps installed in the trial. From the datasets, insights such as heat pump seasonal performance factor (a measure of the heat pump's efficiency), heat pump performance during the coldest day of the year, and half-hourly performance to inform peak demand can be gleaned.
For the second edition (December 2024), the data were updated to include performance data collected between November 2020 and September 2023. The only documentation currently available with the study is the Excel data dictionary. Reports and other contextual information can be found on the Energy Systems Catapult website.
The EoH project was funded by the Department of Business, Energy and Industrial Strategy. From 2023, it is covered by the new Department for Energy Security and Net Zero.
Data availability
This study comprises the raw data from the EoH project, which is only available to registered UKDS users. Only the summary data file is available via standard UKDS EUL download, due to the large size of the full raw data files. To obtain the full set of raw data, registered UKDS users should:
When unzipped, the raw data available via FTP consists of 742 CSV files. Most of the individual CSV files are too large to open in Excel. Before requesting FTP, users should ensure they have sufficient computing facilities to analyse the data.
The UKDS also holds an accompanying open-access study, SN 9050 Electrification of Heat Demonstration Project: Heat Pump Performance Cleansed Data, 2020-2023. This contains the cleansed data from the EoH project, which does not require UKDS registration to access. However, since the data are similar in size to this study, only the summary dataset is available to download; an order must be placed for FTP delivery of the remaining cleansed data. Other studies in the set include SN 9209, which comprises 30-minute interval heat pump performance data, and SN 9210, which includes daily heat pump performance data.
The Python code used to cleanse the raw data and then perform the analysis is accessible via the Energy Systems Catapult Github
Heat Pump Performance across the BEIS funded heat pump trial, The Electrification of Heat (EoH) Demonstration Project. See the documentation for data contents.
The U.S. Geological Survey (USGS) Water Resources Mission Area (WMA) is working to address a need to understand where the Nation is experiencing water shortages or surpluses relative to the demand by delivering routine assessments of water supply and demand. A key part of these national assessments is identifying long-term trends in water availability, including groundwater and surface water quantity, quality, and use. This data release contains Mann-Kendall monotonic trend analyses for annual groundwater metrics at 54,932 wells located in the conterminous United States, Alaska, Hawaii, and Puerto Rico. The groundwater metrics include annual mean, maximum, and minimum water level and the timing of the annual maximum and minimum groundwater level. These metrics are computed from groundwater water levels from publicly available data from the National Water Information System (NWIS), the National Groundwater Monitoring Network (NGWMN) and the California Open Data Portal. Trend analyses are computed using annual groundwater metrics through the water year, which is defined as the 12-month period October 1, for any given year through September 30 of the following year (for example, October 2019 through September 2020). Trends at each well are available for up to four different periods: i) the longest possible period that meets completeness criteria at each well, (ii) 1980-2020, (iii) 1990-2020, (iv) 2000-2020. Annual mean, maximum, and minimum water-level metrics for wells screened in unconfined aquifers were determined only when a well's water-level time series was at least 70 percent complete. Additionally, each of these time series must have at least 70 percent complete records in the first and last decade. All longest possible period time series for wells in unconfined aquifer must be at least 10 years long and have annual metric values calculated for at least 70% of the years of the record. Annual mean, maximum, and minimum water-level metrics for wells screened in confined aquifers were determined only when a well's water-level time series was at least 50 percent complete. Additionally, each of these time series must have at least 50 percent complete records in the first and last decade. All longest possible period time series for wells in confined aquifer must be at least 10 years long and have annual metric values calculated for at least 50% of the years in the last 10 years of the record. Caution must be exercised when utilizing monotonic trend analyses conducted over periods of up to several decades (and in some places longer ones) due to the potential for confounding deterministic gradual trends with multi-decadal climatic fluctuations. This data release contains: six input files: NGWMN_gwl_meta_v2.0.csv, the metadata from the National Groundwater Monitoring Network NGWMN_gwl_data_v2.0.csv, the groundwater water level data from the National Groundwater Monitoring Network NWIS_gwl_meta_v2.0.csv, the metadata from the National Water Information System NWIS_gwl_data_v2.0.csv, the groundwater water level data from the National Water Information System CA_measurements_v2.0.csv, the groundwater level data from the California Open Data Portal CA_stations_v2.0.csv, the groundwater metadata from the California Open Data Portal two output files: GW_trendsout_v2.0.csv, the groundwater water level trend data from both the National Groundwater Monitoring Network and the National Water Information System GW_confband_out_v2.0.csv, the confidence bands associated with the groundwater water level trend data from both the National Monitoring Network and the National Water Information System A .zip file containing all of the code used to compute these trends along with a README file with information on using the code First posted - Feb 27, 2024 (available from author) Revised - Jan 30, 2025 (version 2.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: This is a large dataset. To download, go to ArcGIS Open Data Set and click the download button, and under additional resources select the geodatabase option. Data layer depicting periodical cicada distribution and expected year of emergence by cicada brood and county. The periodical cicada emerges in massive groups once every 13 or 17 years and is completely unique to North America. There are 15 of these mass groups, called broods, of periodical cicadas in the United States. This county-based data, complied by the USFS Northern Research Station, depict where and when the different broods of periodical cicadas are likely to emerge in the US through 2037. The data was compiled for the 2011 publication entitled "Avian predators are less abundant during periodical cicada emergences, but why?" (Koenig et al. https://dx.doi.org/10.1890/10-1583.1) using data from periodical cicada publications listed below. 1) Marlatt, C. L. 1907. "The periodical cicada". Bulletin of the USDA Bureau of Entomology 71:1?181. 2) Simon, C. 1988. "Evolution of 13- and 17-year periodical cicadas". (Homoptera: Cicadidae). Bulletin of the Entomological Society of America 34:163?176. 3) Liebhold, A. M., Bohne, M. J., and R. L. Lilja. 2013. "Active Periodical Cicada Broods of the United States". USDA Forest Service Northern Research Station, Northeastern Area State and Private Forestry. Metadata and DownloadsThis record was taken from the USDA Enterprise Data Inventory that feeds into the https://data.gov catalog. Data for this record includes the following resources: ISO-19139 metadata ArcGIS Hub Dataset ArcGIS GeoService OGC WMS CSV Shapefile GeoJSON KML For complete information, please visit https://data.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Micro - climate sensors collect telemetry at set intervals throughout the day. Sensors are located at various locations in the City of Canning, Western Australia and each sensor has a unique ID. Contact us at opendata@canning.wa.gov.au for a larger data set (The data is supplied is the sensor reading for 30 days). The following lists the locations of each sensor:18zua9muwbb is located at Wharf Street Basin - Pavilion 2hq3byfebne is located at The City’s Civic and Administration Building uu90853psl is located at Wharf Street Basin - Leila Street entrance xd2su7w05m is located at Wharf Street Basin - Nature Play Area