Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.
PV power measurements
Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).
Quality control routine
An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.
Objective
By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.
By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.
Data usage
If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Units
Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).
Power measurements are in Watt.
Installed capacities (DC and AC) are in Watt-peak.
Additional information
A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy. Corrections are discussed in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2024. Erratum: Open-source quality control routine and multiyear power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Acknowledgements
This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Covid19Kerala.info-Data is a consolidated multi-source open dataset of metadata from the COVID-19 outbreak in the Indian state of Kerala. It is created and maintained by volunteers of âCollective for Open Data Distribution-Keralamâ (CODD-K), a nonprofit consortium of individuals formed for the distribution and longevity of open-datasets. Covid19Kerala.info-Data covers a set of correlated temporal and spatial metadata of SARS-CoV-2 infections and prevention measures in Kerala. Static releases of this dataset snapshots are manually produced from a live database maintained as a set of publicly accessible Google sheets. This dataset is made available under the Open Data Commons Attribution License v1.0 (ODC-BY 1.0).
Schema and data package Datapackage with schema definition is accessible at https://codd-k.github.io/covid19kerala.info-data/datapackage.json. Provided datapackage and schema are based on Frictionless data Data Package specification.
Temporal and Spatial Coverage
This dataset covers COVID-19 outbreak and related data from the state of Kerala, India, from January 31, 2020 till the date of the publication of this snapshot. The dataset shall be maintained throughout the entirety of the COVID-19 outbreak.
The spatial coverage of the data lies within the geographical boundaries of the Kerala state which includes its 14 administrative subdivisions. The state is further divided into Local Self Governing (LSG) Bodies. Reference to this spatial information is included on appropriate data facets. Available spatial information on regions outside Kerala was mentioned, but it is limited as a reference to the possible origins of the infection clusters or movement of the individuals.
Longevity and Provenance
The dataset snapshot releases are published and maintained in a designated GitHub repository maintained by CODD-K team. Periodic snapshots from the live database will be released at regular intervals. The GitHub commit logs for the repository will be maintained as a record of provenance, and archived repository will be maintained at the end of the project lifecycle for the longevity of the dataset.
Data Stewardship
CODD-K expects all administrators, managers, and users of its datasets to manage, access, and utilize them in a manner that is consistent with the consortiumâs need for security and confidentiality and relevant legal frameworks within all geographies, especially Kerala and India. As a responsible steward to maintain and make this dataset accessibleâ CODD-K absolves from all liabilities of the damages, if any caused by inaccuracies in the dataset.
License
This dataset is made available by the CODD-K consortium under ODC-BY 1.0 license. The Open Data Commons Attribution License (ODC-By) v1.0 ensures that users of this dataset are free to copy, distribute and use the dataset to produce works and even to modify, transform and build upon the database, as long as they attribute the public use of the database or works produced from the same, as mentioned in the citation below.
Disclaimer
Covid19Kerala.info-Data is provided under the ODC-BY 1.0 license as-is. Though every attempt is taken to ensure that the data is error-free and up to date, the CODD-K consortium do not bear any responsibilities for inaccuracies in the dataset or any lossesâmonetary or otherwiseâthat users of this dataset may incur.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The COKI Open Access Dataset measures open access performance for 225 countries and 50,000 institutions and is available in JSON Lines format. The data is visualised at the COKI Open Access Dashboard: https://open.coki.ac/.
The COKI Open Access Dataset is created with the COKI Academic Observatory data collection pipeline, which fetches data about research publications from multiple sources, synthesises the datasets and creates the open access calculations for each country and institution.
Each week a number of specialised research publication datasets are collected. The datasets that are used for the COKI Open Access Dataset release include Crossref Metadata, OpenAlex, Unpaywall and the Research Organization Registry.
After fetching the datasets, they are synthesised to produce aggregate time series statistics for each country and institution in the dataset. The aggregate timeseries statistics include publication count, open access status and citation count.
See https://open.coki.ac/data/ for the dataset schema. A new version of the dataset is deposited every week.
Code
The COKI Academic Observatory data collection pipeline is used to create the dataset.
The COKI OA Website Github project contains the code for the web app that visualises the dataset at open.coki.ac. It can be found on Zenodo here.
LicenseCOKI Open Access Dataset © 2022 by Curtin University is licenced under CC BY 4.0.
AttributionsThis work contains information from:
OpenAlex which is made available under the CC0 license.
Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details.
Unpaywall. The Unpaywall Data Feed is used under license. Data is freely available from Unpaywall via the API, data dumps and as a data feed.
Research Organization Registry which is made available under a CC0 licence.
This data set is a collection of measurements of carbon dioxide (CO2) and non-CO2 greenhouse gases made across North America by nine independent atmospheric monitoring networks from 2000 - 2009. During this North American Carbon Program (NACP) sponsored activity, data were compiled from the following networks: AGAGE, COBRA, CSIRO, INTEX-A, INTEX B, Irvine Latitude Network, NOAA CMDL, SCRIPPS, and Stanley Tyler-UC Irvine. The files presented here are the products of merging multiple original measurement results files for selected sites across North America from each monitoring network. The primary focus of this effort was the compilation of non-CO2 greenhouse gases over North America, but numerous CO2 observations are also included. The data files for each network are accompanied by detailed readme documentation files prepared by the respective network investigators. Project descriptions, objectives, references, sampling and analysis methods, and data file descriptions are included in these READMEs. Table 1 in the documentation displays the monitoring network sites, sample types, analytes, and links to the detailed network README files. Network- and laboratory-specific data citations are included in the README documentation and should be used to acknowledge the use of these data as appropriate. The data files for each monitoring network and each sampling type (continuous or flasks) have been combined into one compressed (*.zip) file along with the detailed README document. There are 17 compressed files that when expanded contain data files which represent one yearĂąĂŻÂżÂœâąs data for that specific campaign and sampling method. The number of annual files that were compiled from a network into this collection varies.
The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs (https://earthdata.nasa.gov/about/competitive-programs/measures)) Program. The GFCC Surface Reflectance Estimates Multi-Year Global dataset is derived from the enhanced Global Land Survey (GLS) datasets for epochs centered on the years 1990, 2000, 2005, and 2010. The GLS datasets are composed of Landsat 5 Thematic Mapper (TM) and Landsat 7 Enhanced Thematic Mapper Plus (ETM+) images at 30 meter resolution. Data available for this product represent the best available âleaf-onâ date during the peak growing season. The original GLS datasets were enhanced with supplemental Landsat images when data were incomplete for the epoch or inadequate for analysis due to acquisition during âleaf-offâ seasons. The enhanced GLS data were acquired June 1984 through August 2011. Atmospheric corrections were applied to seven visible bands to estimate surface reflectance by compensating for the scattering and absorption of radiance by atmospheric conditions. GFCC30SR is a multi-file data product. The surface reflectance data products are used as source data for other datasets in the GFCC collection.For each available date, data files are delivered in a zip folder that consists of six surface reflectance bands, a Top of Atmosphere temperature band, an Atmospheric Opacity layer, and the Landsat Surface Reflectance Quality layer. Data follow the Worldwide Reference System-2 tiling scheme. Additional details regarding the methodology used to create the data are available in the Algorithm Theoretical Basis Document (ATBD).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Technological advances in mass spectrometry (MS) toward more accurate and faster data acquisition result in highly informative but also more complex data sets. Especially the hyphenation of liquid chromatography (LC) and MS yields large data files containing a high amount of compound specific information. Using electrospray-ionization for compounds such as polymers enables highly sensitive detection, yet results in very complex spectra, containing multiply charged ions and adducts. Recent years have seen the development of novel or updated data mining strategies to reduce the MS spectra complexity and to ultimately simplify the data analysis workflow. Among other techniques, the Kendrick mass defect analysis, which graphically highlights compounds containing a given repeating unit, has been revitalized with applications in multiple fields of study, such as lipids and polymers. Especially for the latter, various data mining concepts have been developed, which extend regular Kendrick mass defect analysis to multiply charged ion series. The aim of this work is to collect and subsequently implement these concepts in one of the most popular open-source MS data mining software, i.e., MZmine 2, to make them rapidly available for different MS based measurement techniques and various vendor formats, with a special focus on hyphenated techniques such as LCâMS. In combination with already existing data mining modules, an example data set was processed and simplified, enabling an ever faster evaluation and polymer characterization.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This open source package contains the relevant files to perform the checkerboard calibration. The supported data allows reproducing the calibration inside the large scale ocean aquarium at the Rotterdam Zoo.
The included folders contain: - Calibration The calibration files which can be reproduced. - MatlabCode The relevant Matlab files and functions to perform the calibration. - TestImage Test image to perform physical measurement inside the ocean aquarium.
The checkerboard calibration can be runned via the file main_calibration, where notes and instruction are included.
DESCRIPTION
The TAU Spatial Room Impulse Response Database (TAU-SRIR DB) database contains spatial room impulse responses (SRIRs) captured in various spaces of Tampere University (TAU), Finland, for a fixed receiver position and multiple source positions per room, along with separate recordings of spatial ambient noise captured at the same recording point. The dataset is intended for emulation of spatial multichannel recordings for evaluation and/or training of multichannel processing algorithms in realistic reverberant conditions and over multiple rooms. The major distinct properties of the database compared to other databases of room impulse responses are:
Capturing in a high resolution multichannel format (32 channels) from which multiple more limited application-specific formats can be derived (e.g. tetrahedral array, circular array, first-order Ambisonics, higher-order Ambisonics, binaural).
Extraction of densely spaced SRIRs along measurement trajectories, allowing emulation of moving source scenarios.
Multiple source distances, azimuths, and elevations from the receiver per room, allowing emulation of complex configurations for multi-source methods.
Multiple rooms, allowing evaluation of methods at various acoustic conditions, and training of methods with the aim of generalization on different rooms.
The RIRs were collected by staff of TAU between 12/2017 - 06/2018, and between 11/2019 - 1/2020. The data collection received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
NOTE: This database is a work-in-progress. We intend to publish additional rooms, additional formats, and potentially higher-fidelity versions of the captured responses in the near future, as new versions of the database in this repository.
REPORT AND REFERENCE
A compact description of the dataset, recording setup, recording procedure, and extraction can be found in:
Politis., Archontis, Adavanne, Sharath, & Virtanen, Tuomas (2020). A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), Tokyo, Japan.
available here. A more detailed report specifically focusing on the dataset collection and properties will follow.
AIM
The dataset can be used for generating multichannel or monophonic mixtures for testing or training of methods under realistic reverberation conditions, related to e.g. multichannel speech enhancement, acoustic scene analysis, and machine listening, among others. It is especially suitable for the follow application scenarios:
monophonic and multichannal reverberant single- or multi-source speech in multi-room reverberant conditions
monophonic and multichannel polyphonic sound events in multi-room reverberant conditions
single-source and multi-source localization in multi-room reverberant conditions, in static or dynamic scenarios
single-source and multi-source tracking in multi-room reverberant conditions, in static or dynamic scenarios
sound event localization and detection in multi-room reverberant conditions, in static or dynamic scenarios
SPECIFICATIONS
The SRIRs were captured using an Eigenmike spherical microphone array. A Genelec G Three loudspeaker was used to playback a maximum length sequence (MLS) around the Eigenmike. The SRIRs were obtained in the STFT domain using a least-squares regression between the known measurement signal (MLS) and far-field recording independently at each frequency. In this version of the dataset the SRIRs and ambient noise are downsampled to 24kHz for compactness.
The currently published SRIR set was recorded at nine different indoor locations inside the Tampere University campus at Hervanta, Finland. Additionally, 30 minutes of ambient noise recordings were collected at the same locations with the IR recording setup unchanged. SRIR directions and distances differ with the room. Possible azimuths span the whole range of $\phi\in[-180,180)$, while the elevations span approximately a range between $\theta\in[-45,45]$ degrees. The currently shared measured spaces are as follows:
Large open space in underground bomb shelter, with plastic-coated floor and rock walls. Ventilation noise. Circular source trajectory.
Large open gym space. Ambience of people using weights and gym equipment in adjacent rooms. Circular source trajectory.
Small classroom (PB132) with group work tables and carpet flooring. Ventilation noise. Circular source trajectory.
Meeting room (PC226) with hard floor and partially glass walls. Ventilation noise. Circular source trajectory.
Lecture hall (SA203) with inclined floor and rows of desks. Ventilation noise. Linear source trajectory.
Small classroom (SC203) with group work tables and carpet flooring. Ventilation noise. Linear source trajectory.
Large classroom (SE203) with hard floor and rows of desks. Ventilation noise. Linear source trajectory.
Lecture hall (TB103) with inclined floor and rows of desks. Ventilation noise. Linear source trajectory.
Meeting room (TC352) with hard floor and partially glass walls. Ventilation noise. Circular source trajectory.
The measurement trajectories were organised in groups, with each group being specified by a circular or linear trace at the floor at a certain distance from the z-axis of the microphone. For circular trajectories two ranges were measured, a close and a far one, except room TC352, where the same range was measured twice, but with different furniture configuration and open or closed doors. For linear trajectories also two ranges were measured, close and far, but with linear paths at either side of the array, resulting in 4 unique trajectory groups, with the exception of room SA203 where 3 ranges were measured resulting on 6 trajectory groups. Linear trajectory groups are always parallel to each other, in the same room.
Each trajectory group had multiple measurement trajectories, following the same floor path, but with the source at different heights.
The SRIRs are extracted from the noise recordings of the slowly moving source across those trajectories, at an angular spacing of approximately every 1 degree from the microphone. Instead of extracting SRIRs at equally spaced points along the path (e.g. every 20cm), this extraction scheme was found more practical for synthesis purposes, making emulation of moving sources at an approximately constant angular speed easier.
More details on the trajectory geometries can be found in the README file and the measinfo.mat file.
RECORDING FORMATS
As with the DCASE2019-2021 datasets, currently the database is provided in two formats, first-order Ambisonics, and a tetrahedral microphone array - both derived from the Eigenmike 32-channel recordings. For more details on the format specifications, check the README.
We intend to add additional formats of the database, of both higher resolution (e.g. higher-order Ambisonics), or lower resolution (e.g. binaural).
REFERENCE DOAs
For each extracted RIR across a measurement trajectory there is a direction-of-arrival (DOA) associated with it, which can be used as the reference direction for sound source spatialized using this RIR, for training or evaluation purposes. The DOAs were determined acoustically from the extracted RIRs, by windowing the direct sound part and applying a broadband version of the MUSIC localization algorithm on the windowed multichannel signal.
The DOAs are provided as Cartesian components [x, y, z] of unit length vectors.
SCENE GENERATOR
A set of routines is shared, here termed scene generator, that can spatialize a bank of sound samples using the SRIRs and noise recordings of this library, to emulate scenes for the two target formats. The code is similar to the one used to generate the TAU-NIGENS Spatial Sound Events 2021 dataset, and has been ported to Python from the original version written in Matlab.
The generator can be found here, along with more details on its use.
The generator at the moment is set to work with the NIGENS sound event sample database, and the FSD50K sound event database, but additional sample banks can be added with small modifications.
The dataset together with the generator has been used by the authors in the following public challenges:
DCASE 2019 Challenge Task 3, to generate the TAU Spatial Sound Events 2019 dataset (development/evaluation)
DCASE 2020 Challenge Task 3, to generate the TAU-NIGENS Spatial Sound Events 2020 dataset
DCASE2021 Challenge Task 3, to generate the TAU-NIGENS Spatial Sound Events 2021 dataset
DCASE2022 Challenge Task 3, to generate additional SELD synthetic mixtures for training the task baseline
NOTE: The current version of the generator is work-in-progress, with some code being quite "rough". If something does not work as intended or it is not clear what certain parts do, please contact us.
DATASET STRUCTURE
The dataset contains a folder of the SRIRs (TAU-SRIR_DB), with all the SRIRs per room in a single MAT file. The file rirdata.mat contains some general information such as sample rate, format specifications, and most importantly the DOAs of every extracted SRIR. The file measinfo.mat contains measurement and recording information in each room. Finally, the dataset contains a folder of spatial ambient noise recordings (TAU-SNoise_DB), with one subfolder per room having two audio recordings fo the spatial ambience, one for each format, FOA or MIC. For more information on how to SRIRs and DOAs are organized, check the README.
DOWNLOAD
The files TAU-SRIR_DB.z01, ..., TAU-SRIR_DB.zip contain the SRIRs and measurement info files.
The files TAU-SNoise_DB.z01, ..., TAU-SNoise_DB.zip
This dataset provides RF data from software defined ratio (SDR) measurement results of a few cases: A. one 4G LTE link, B. two LTE links, C. one LTE link and one Wi-Fi link. The LTE links were emulatedby USRP B210 units and an open-source software (srsRAN), and the Wi-Fi link was emulated by a pair ofWi-Fi commercial development boards. This dataset includes metadata and performance results (ina spreadsheet format) and I/Q baseband sample data (in a binary float point format). Though specifictrade names are mentioned, they should not be construed as an endorsement of that product. Other productsmay work as well or better.The spreadsheet files provide the mapping among some system parameters (such as the SDR received powerand SINR) and key performance indicators (KPIs), such as throughput and packet drop rate. The I/Q datafiles provide the digital samples of the received signals at the receivers (LTE or Wi-Fi).This dataset can be used to support research topics such as multi-cell LTE system performance evaluationand optimization, spectrum sensing and signal classification, and AI and machine learning, beside others.
By US Open Data Portal, data.gov [source]
This dataset provides a list of all Home Health Agencies registered with Medicare. Contained within this dataset is information on each agency's address, phone number, type of ownership, quality measure ratings and other associated data points. With this valuable insight into the operations of each Home Health Care Agency, you can make informed decisions about your care needs. Learn more about the services offered at each agency and how they are rated according to their quality measure ratings. From dedicated nursing care services to speech pathology to medical social services - get all the information you need with this comprehensive look at U.S.-based Home Health Care Agencies!
For more datasets, click here.
- đš Your notebook can be here! đš!
Are you looking to learn more about Home Health Care Agencies registered with Medicare? This dataset can provide quality measure ratings, addresses, phone numbers, types of services offered and other information that may be helpful when researching Home Health Care Agencies.
This guide will explain how to use the data in this dataset to gain a better understanding of Home Health Care Agencies registered with Medicare.
First, you will need to become familiar with the columns in the dataset. A list of all columns and their associated descriptions is provided above for your reference. Once you understand each columnâs purpose, it will be easier for you to decide what metrics or variables are most important for your own research.
Next, use this data to compare various facets between different Home Health Care Agencies such as type of ownership, services offered and quality measure ratings like star rating or CMS certification number (from 0-5 stars). Collecting information from multiple sources such as public reviews or customer feedback can help supplement these numerical metrics in order to paint a more accurate picture about each agency's performance and customer satisfaction level.
Finally once you have collected enough data points on one particular agency or a comparison between multiple agencies then conduct more analysis using statistical methods like correlation matrices in order to determine any patterns that exist within the data set which may reveal valuable insights into topic of research at hand
- Using the data to compare quality of care ratings between agencies, so people can make better informed decisions about which agency to hire for home health services.
- Analyzing the costs associated with different types of home health care services, such as nursing care and physical therapy, in order to determine where money could be saved in health care budgets.
- Evaluating the performance of certain agencies by analyzing the number of episodes billed to Medicare compared to their national averages, allowing agencies with lower numbers of billing episodes to be identified and monitored more closely if necessary
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: csv-1.csv | Column name | Description | |:----------------------------------------...
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.3/customlicense?persistentId=doi:10.7910/DVN/PUCD2Phttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.3/customlicense?persistentId=doi:10.7910/DVN/PUCD2P
These materials were produced as part of: Champion, Kaylea and Benjamin Mako Hill. (2021) "Underproduction: An approach for measuring risk in open source software.'' 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Preprint: https://arxiv.org/abs/2103.00352. DOI: 10.1109/SANER50967.2021.00043 In this archive, you'll find: inst_all_packages_full_results.tab Summary data on all packages as they appear in the paper. This is the place to look if you want to examine the underproduction factor associated with each package inst_all_packages_full_results-DESCRIPTION.txt A description of the fields in the inst_all_packages_full_results.tab file. R_Code.tar.gz Containing R code to reproduce figures and tables from fitted Bayesian hierarchical survival models: dfPrep.R, used to create datasets_for_modeling.RData models.R, a resource for model information model_visualization.R, the core code for presenting fitted models and relationships standalone_dsp.R, descriptive statistics standalone_bayes.R, to produce tables for the paper lib-00-utils.R, some utility functions datasets_for_modeling.RData, the core dataset used for this analysis Stan.tar.gz, a directory of STAN model output; on our supercomputing node these took multiple days to run and converge Figures.tar.gz, a directory of figures from the paper Raw_Data_Parsers.tar.gz, a directory of both the raw data and the parsers used to obtain the raw data. The dir contains a HowTo file if you would like to reproduce the scraping/cloning part of the project, however note that the original analysis included an rsync copy of the Debian bug database; if you conduct an analysis from scratch, the data you obtain will have changed since our rsync. Appendix.tar.gz, containing figures and data associated with our appendix using an alternate measure of importance ("vote" which represents recent usage but omits packages where usage does not update atime; the paper used "inst") appendix_with_vote.R, the code appendix_figures, a directory of figures similar to those in the paper but produced for the appendix vote_all_packages_full_results.csv -- summary data on all packages vote_all_packages_full_results.csv.DESCRIPTION A description of the fields in the inst_all_packages_full_results.csv file. For more information, please contact: Kaylea Champion (she/her) kaylea@uw.edu | khascall@gmail.com @kayleachampion Abstract: The widespread adoption of Free/Libre and Open Source Software (FLOSS) means that the ongoing maintenance of many widely used software components relies on the collaborative effort of volunteers who set their own priorities and choose their own tasks. We argue that this has created a new form of risk that we call `underproduction' which occurs when the supply of software engineering labor becomes out of alignment with the demand of people who rely on the software produced. We present a conceptual framework for identifying relative underproduction in software as well as a statistical method for applying our framework to a comprehensive dataset from the Debian GNU/Linux distribution that includes 21,902 source packages and the full history of 461,656 bugs. We draw on this application to present two experiments: (1) a demonstration of how our technique can be used to identify at-risk software packages in a large FLOSS repository and (2) a validation of these results using an alternate indicator of package risk. Our analysis demonstrates both the utility of our approach and reveals the existence of widespread underproduction in a range of widely-installed software components in Debian.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We are publishing a walking activity dataset including inertial and positioning information from 19 volunteers, including reference distance measured using a trundle wheel. The dataset includes a total of 96.7 Km walked by the volunteers, split into 203 separate tracks. The trundle wheel is of two types: it is either an analogue trundle wheel, which provides the total amount of meters walked in a single track, or it is a sensorized trundle wheel, which measures every revolution of the wheel, therefore recording a continuous incremental distance.
Each track has data from the accelerometer and gyroscope embedded in the phones, location information from the Global Navigation Satellite System (GNSS), and the step count obtained by the device. The dataset can be used to implement walking distance estimation algorithms and to explore data quality in the context of walking activity and physical capacity tests, fitness, and pedestrian navigation.
Methods
The proposed dataset is a collection of walks where participants used their own smartphones to capture inertial and positioning information. The participants involved in the data collection come from two sites. The first site is the Oxford University Hospitals NHS Foundation Trust, United Kingdom, where 10 participants (7 affected by cardiovascular diseases and 3 healthy individuals) performed unsupervised 6MWTs in an outdoor environment of their choice (ethical approval obtained by the UK National Health Service Health Research Authority protocol reference numbers: 17/WM/0355). All participants involved provided informed consent. The second site is at Malm Ìo University, in Sweden, where a group of 9 healthy researchers collected data. This dataset can be used by researchers to develop distance estimation algorithms and how data quality impacts the estimation.
All walks were performed by holding a smartphone in one hand, with an app collecting inertial data, the GNSS signal, and the step counting. On the other free hand, participants held a trundle wheel to obtain the ground truth distance. Two different trundle wheels were used: an analogue trundle wheel that allowed the registration of a total single value of walked distance, and a sensorized trundle wheel which collected timestamps and distance at every 1-meter revolution, resulting in continuous incremental distance information. The latter configuration is innovative and allows the use of temporal windows of the IMU data as input to machine learning algorithms to estimate walked distance. In the case of data collected by researchers, if the walks were done simultaneously and at a close distance from each other, only one person used the trundle wheel, and the reference distance was associated with all walks that were collected at the same time.The walked paths are of variable length, duration, and shape. Participants were instructed to walk paths of increasing curvature, from straight to rounded. Irregular paths are particularly useful in determining limitations in the accuracy of walked distance algorithms. Two smartphone applications were developed for collecting the information of interest from the participants' devices, both available for Android and iOS operating systems. The first is a web-application that retrieves inertial data (acceleration, rotation rate, orientation) while connecting to the sensorized trundle wheel to record incremental reference distance [1]. The second app is the Timed Walk app [2], which guides the user in performing a walking test by signalling when to start and when to stop the walk while collecting both inertial and positioning data. All participants in the UK used the Timed Walk app.
The data collected during the walk is from the Inertial Measurement Unit (IMU) of the phone and, when available, the Global Navigation Satellite System (GNSS). In addition, the step count information is retrieved by the sensors embedded in each participantâs smartphone. With the dataset, we provide a descriptive table with the characteristics of each recording, including brand and model of the smartphone, duration, reference total distance, types of signals included and additionally scoring some relevant parameters related to the quality of the various signals. The path curvature is one of the most relevant parameters. Previous literature from our team, in fact, confirmed the negative impact of curved-shaped paths with the use of multiple distance estimation algorithms [3]. We visually inspected the walked paths and clustered them in three groups, a) straight path, i.e. no turns wider than 90 degrees, b) gently curved path, i.e. between one and five turns wider than 90 degrees, and c) curved path, i.e. more than five turns wider than 90 degrees. Other features relevant to the quality of collected signals are the total amount of time above a threshold (0.05s and 6s) where, respectively, inertial and GNSS data were missing due to technical issues or due to the app going in the background thus losing access to the sensors, sampling frequency of different data streams, average walking speed and the smartphone position. The start of each walk is set as 0 ms, thus not reporting time-related information. Walks locations collected in the UK are anonymized using the following approach: the first position is fixed to a central location of the city of Oxford (latitude: 51.7520, longitude: -1.2577) and all other positions are reassigned by applying a translation along the longitudinal and latitudinal axes which maintains the original distance and angle between samples. This way, the exact geographical location is lost, but the path shape and distances between samples are maintained. The difference between consecutive points âas the crow fliesâ and path curvature was numerically and visually inspected to obtain the same results as the original walks. Computations were made possible by using the Haversine Python library.
Multiple datasets are available regarding walking activity recognition among other daily living tasks. However, few studies are published with datasets that focus on the distance for both indoor and outdoor environments and that provide relevant ground truth information for it. Yan et al. [4] introduced an inertial walking dataset within indoor scenarios using a smartphone placed in 4 positions (on the leg, in a bag, in the hand, and on the body) by six healthy participants. The reference measurement used in this study is a Visual Odometry System embedded in a smartphone that has to be worn at the chest level, using a strap to hold it. While interesting and detailed, this dataset lacks GNSS data, which is likely to be used in outdoor scenarios, and the reference used for localization also suffers from accuracy issues, especially outdoors. Vezovcnik et al. [5] analysed estimation models for step length and provided an open-source dataset for a total of 22 km of only inertial walking data from 15 healthy adults. While relevant, their dataset focuses on steps rather than total distance and was acquired on a treadmill, which limits the validity in real-world scenarios. Kang et al. [6] proposed a way to estimate travelled distance by using an Android app that uses outdoor walking patterns to match them in indoor contexts for each participant. They collect data outdoors by including both inertial and positioning information and they use average values of speed obtained by the GPS data as reference labels. Afterwards, they use deep learning models to estimate walked distance obtaining high performances. Their results share that 3% to 11% of the data for each participant was discarded due to low quality. Unfortunately, the name of the used app is not reported and the paper does not mention if the dataset can be made available.
This dataset is heterogeneous under multiple aspects. It includes a majority of healthy participants, therefore, it is not possible to generalize the outcomes from this dataset to all walking styles or physical conditions. The dataset is heterogeneous also from a technical perspective, given the difference in devices, acquired data, and used smartphone apps (i.e. some tests lack IMU or GNSS, sampling frequency in iPhone was particularly low). We suggest selecting the appropriate track based on desired characteristics to obtain reliable and consistent outcomes.
This dataset allows researchers to develop algorithms to compute walked distance and to explore data quality and reliability in the context of the walking activity. This dataset was initiated to investigate the digitalization of the 6MWT, however, the collected information can also be useful for other physical capacity tests that involve walking (distance- or duration-based), or for other purposes such as fitness, and pedestrian navigation.
The article related to this dataset will be published in the proceedings of the IEEE MetroXRAINE 2024 conference, held in St. Albans, UK, 21-23 October.
This research is partially funded by the Swedish Knowledge Foundation and the Internet of Things and People research center through the Synergy project Intelligent and Trustworthy IoT Systems.
The Third EGRET Catalog of High-Energy Gamma-Ray Sources is based on data obtained by the Energetic Gamma-Ray Experiment Telescope (EGRET) on board the Compton Gamma-Ray Observatory (CGRO) during the period from 1991 April 22 to 1995 October 3, corresponding to GRO Cycles 1, 2, 3, and 4. EGRET is sensitive to photons in the energy range from about 30 MeV to over 20 GeV, the highest energies accessible by the CGRO instruments, and, like COMPTEL, is an imaging instrument. In addition to including more data than the Second EGRET Catalog (2EG, Thompson et al. 1995, ApJS, 101, 259) and its supplement (2EGS, Thompson et al. 1996, ApJS, 107, 227), this catalog uses completely reprocessed data so as to correct a number of mostly minimal errors and problems. The 271 sources (E > 100 MeV) in the catalog include the single 1991 solar flare that was bright enough to detected as a source, the LMC, 5 pulsars, one probable radio galaxy detection (Cen A), and 66 high-confidence identifications of blazars (BL Lac objects, flat-spectrum radio quasars, or unidentified flat-spectrum radio sources). In addition, 27 lower-confidence potential blazar identifications are noted. Finally, the catalog contains 170 sources that are not yet firmly identified with known objects, although potential identifications have been suggested for a number of these. As already noted, there are 271 distinct sources in this catalog: since there are multiple measurements for these sources corresponding to the various viewing periods, there are 5246 entries in the HEASARC's version of the 3rd EGRET Catalog corresponding to the same number of lines in Table 4 of the published version. Thus, there are an average of about 20 entries for every distinct source. Notice that 14 sources reported in the 2nd EGRET Catalog or its supplement do not appear in this 3rd EGRET Catalog: 2EG J0403+3357, 2EG J0426+6618, 2EGS J0500+5902, 2EGS J0552-1026, 2EG J1136-0414, 2EGS J1236-0416, 2EG J1239+0441, 2EG J1314+5151, 2EG J1430+5356, 2EG J1443-6040, 2EG J1631-2845, 2EG J1709-0350, 2EG J1815+2950, and 2EG J2027+1054 due to the fact that the re-analysis of the EGRET data has dropped their statistical significance from just above the catalog threshold to just below it; additional information on these sources is provided in Table 5 of the published version of the 3rd EGRET Catalog. This database table was created by the HEASARC in June 1999, based on a machine-readable version of Table 4 of the 3rd EGRET Source Catalog that was provided by the CGRO Science Support Center (CGROSSC). Slight modifications to the Browse Object Classifications were later made in April 2001. This is a service provided by NASA HEASARC .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# New in Version 3.0 The dataset has been reorganized for improved accessibility and clarity:
This dataset contains multi-sensor measurements from an inverter-driven PMSM system under various fault conditions. It includes:
Keywords:
RF100-VL is a multi-domain benchmark for object detection. The benchmark is designed to measure the extent to which model architectures can generalise to different domains, from medical imagery to defect detection to document feature identification. RF100-VL was introduced by researchers from Roboflow and Carnegie Mellon University in the paper "Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models".
Roboflow 100 Vision Language (RF100-VL) is the first benchmark to ask, âHow well does your VLM understand the real world?â In pursuit of this question, RF100-VL introduces 100 open source datasets containing object detection bounding boxes and multimodal few shot instruction with visual examples and rich textual descriptions across novel image domains. The dataset is comprised of 164,149 images and 1,355,491, annotations across seven domains, including aerial, biological, and industrial imagery. 1693 labeling hours were spent labeling, reviewing, and preparing the dataset.
RF100-VL is a curated sample from Roboflow Universe, a repository of over 500,000+ datasets that collectively demonstrate how computer vision is being leveraged in production problems today. Current state-of-the-art models trained on web-scale data like QwenVL2.5 and GroundingDINO achieve as low as 2% AP in some categories represented in RF100-VL.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/insitu-gridded-observations-global-and-regional/insitu-gridded-observations-global-and-regional_15437b363f02bf5e6f41fc2995e3d19a590eb4daff5a7ce67d1ef6c269d81d68.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/insitu-gridded-observations-global-and-regional/insitu-gridded-observations-global-and-regional_15437b363f02bf5e6f41fc2995e3d19a590eb4daff5a7ce67d1ef6c269d81d68.pdf
This dataset provides high-resolution gridded temperature and precipitation observations from a selection of sources. Additionally the dataset contains daily global average near-surface temperature anomalies. All fields are defined on either daily or monthly frequency. The datasets are regularly updated to incorporate recent observations. The included data sources are commonly known as GISTEMP, Berkeley Earth, CPC and CPC-CONUS, CHIRPS, IMERG, CMORPH, GPCC and CRU, where the abbreviations are explained below. These data have been constructed from high-quality analyses of meteorological station series and rain gauges around the world, and as such provide a reliable source for the analysis of weather extremes and climate trends. The regular update cycle makes these data suitable for a rapid study of recently occurred phenomena or events. The NASA Goddard Institute for Space Studies temperature analysis dataset (GISTEMP-v4) combines station data of the Global Historical Climatology Network (GHCN) with the Extended Reconstructed Sea Surface Temperature (ERSST) to construct a global temperature change estimate. The Berkeley Earth Foundation dataset (BERKEARTH) merges temperature records from 16 archives into a single coherent dataset. The NOAA Climate Prediction Center datasets (CPC and CPC-CONUS) define a suite of unified precipitation products with consistent quantity and improved quality by combining all information sources available at CPC and by taking advantage of the optimal interpolation (OI) objective analysis technique. The Climate Hazards Group InfraRed Precipitation with Station dataset (CHIRPS-v2) incorporates 0.05° resolution satellite imagery and in-situ station data to create gridded rainfall time series over the African continent, suitable for trend analysis and seasonal drought monitoring. The Integrated Multi-satellitE Retrievals dataset (IMERG) by NASA uses an algorithm to intercalibrate, merge, and interpolate âall'' satellite microwave precipitation estimates, together with microwave-calibrated infrared (IR) satellite estimates, precipitation gauge analyses, and potentially other precipitation estimators over the entire globe at fine time and space scales for the Tropical Rainfall Measuring Mission (TRMM) and its successor, Global Precipitation Measurement (GPM) satellite-based precipitation products. The Climate Prediction Center morphing technique dataset (CMORPH) by NOAA has been created using precipitation estimates that have been derived from low orbiter satellite microwave observations exclusively. Then, geostationary IR data are used as a means to transport the microwave-derived precipitation features during periods when microwave data are not available at a location. The Global Precipitation Climatology Centre dataset (GPCC) is a centennial product of monthly global land-surface precipitation based on the ~80,000 stations world-wide that feature record durations of 10 years or longer. The data coverage per month varies from ~6,000 (before 1900) to more than 50,000 stations. The Climatic Research Unit dataset (CRU v4) features an improved interpolation process, which delivers full traceability back to station measurements. The station measurements of temperature and precipitation are public, as well as the gridded dataset and national averages for each country. Cross-validation was performed at a station level, and the results have been published as a guide to the accuracy of the interpolation. This catalogue entry complements the E-OBS record in many aspects, as it intends to provide high-resolution gridded meteorological observations at a global rather than continental scale. These data may be suitable as a baseline for model comparisons or extreme event analysis in the CMIP5 and CMIP6 dataset.
-> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234
@article{10.1093/comjnl/bxab179, author = {GözĂŒkara, Furkan and Ăzel, Selma AyĆe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }
-> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article
-> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes
-> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments
-> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...
The COKI Open Access Dataset measures open access performance for 142 countries and 5117 institutions and is available in JSON Lines format. The data is visualised at the COKI Open Access Dashboard: https://open.coki.ac/. The COKI Open Access Dataset is created with the COKI Academic Observatory data collection pipeline, which fetches data about research publications from multiple sources, synthesises the datasets and creates the open access calculations for each country and institution. Each week a number of specialised research publication datasets are collected. The datasets that are used for the COKI Open Access Dataset release include Crossref Metadata, Microsoft Academic Graph, Unpaywall and the Research Organization Registry. After fetching the datasets, they are synthesised to produce aggregate time series statistics for each country and institution in the dataset. The aggregate timeseries statistics include publication count, open access status and citation count. See https://open.coki.ac/data/ for the dataset schema. A new version of the dataset is deposited every week. Code The COKI Academic Observatory data collection pipeline is used to create the dataset. The COKI OA Website Github project contains the code for the web app that visualises the dataset at open.coki.ac. It can be found on Zenodo here. License COKI Open Access Dataset © 2022 by Curtin University is licenced under CC BY 4.0. Attributions This work contains information from: Microsoft Academic Graph which is made available under the ODC Attribution Licence. Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details. Unpaywall. The Unpaywall Data Feed is used under license. Data is freely available from Unpaywall via the API, data dumps and as a data feed. Research Organization Registry which is made available under a CC0 licence. The Curtin Open Knowledge Initiative (COKI) is a strategic initiative of the Research Office at Curtin, the Faculty of Humanities, School of Media, Creative Arts and Social Inquiry and the Curtin Institute for Computation, with additional support from the Andrew W. Mellon Foundation and the Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMajor depressive disorder (MDD) is a leading cause of disability worldwide, and is commonly treated with antidepressant drugs (AD). Although effective, many patients fail to respond to AD treatment, and accordingly identifying factors that can predict AD response would greatly improve treatment outcomes. In this study, we developed a machine learning tool to integrate multi-omic datasets (gene expression, DNA methylation, and genotyping) to identify biomarker profiles associated with AD response in a cohort of individuals with MDD.Materials and methodsIndividuals with MDD (N = 111) were treated for 8 weeks with antidepressants and were separated into responders and non-responders based on the MontgomeryâĂ sberg Depression Rating Scale (MADRS). Using peripheral blood samples, we performed RNA-sequencing, assessed DNA methylation using the Illumina EPIC array, and performed genotyping using the Illumina PsychArray. To address this rich multi-omic dataset with high dimensional features, we developed integrative Geneset-Embedded non-negative Matrix factorization (iGEM), a non-negative matrix factorization (NMF) based model, supplemented with auxiliary information regarding gene sets and gene-methylation relationships. In particular, we factorize the subjects by features (i.e., gene expression or DNA methylation) into subjects-by-factors and factors-by-features. We define the factors as the meta-phenotypes as they represent integrated composite scores of the molecular measurements for each subject.ResultsUsing our model, we identified a number of meta-phenotypes which were related to AD response. By integrating geneset information into the model, we were able to relate these meta-phenotypes to biological processes, including a meta-phenotype related to immune and inflammatory functions as well as other genes related to depression or AD response. The meta-phenotype identified several genes including immune interleukin 1 receptor like 1 (IL1RL1) and interleukin 5 receptor (IL5) subunit alpha (IL5RA), AKT/PIK3 pathway related phosphoinositide-3-kinase regulatory subunit 6 (PIK3R6), and sphingomyelin phosphodiesterase 3 (SMPD3), which has been identified as a target of AD treatment.ConclusionsThe derived meta-phenotypes and associated biological functions represent both biomarkers to predict response, as well as potential new treatment targets. Our method is applicable to other diseases with multi-omic data, and the software is open source and available on Github (https://github.com/li-lab-mcgill/iGEM).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The correction in Digital Elevation Models (DEMs) has always been a crucial aspect of remote sensing geoscience research. The burgeoning development of new machine learning methods in recent years has provided novel solutions for the correction of DEM elevation errors. Given the reliance of machine learning and other artificial intelligence methods on extensive training data, and considering the current lack of publicly available, unified, large-scale, and standardized multisource DEM elevation error prediction datasets for large areas, the multi-source DEM Elevation Error Prediction Dataset (DEEP-Dataset) is introduced in this paper. This dataset comprises four sub-datasets, based on the TerraSAR-X add-on for Digital Elevation Measurements (TanDEM-X) DEM and Advanced land observing satellite World 3D-30 m (AW3D30) DEM in the Guangdong Province study area of China, and the Shuttle Radar Topography Mission (SRTM) DEM and Advanced Spaceborne Thermal Emission and reflection Radiometer (ASTER) DEM in the Northern Territory study area of Australia. The Guangdong Province sample comprises approximately 40 000 instances, while the Northern Territory sample includes about 1 600 000 instances. Each sample in the dataset consists of ten features, encompassing geographic spatial information, land cover types, and topographic attributes. The effectiveness of the DEEP-Dataset in actual model training and DEM correction has been validated through a series of comparative experiments, including machine learning model testing, DEM correction, and feature importance assessment. These experiments demonstrate the datasetâs rationality, effectiveness, and comprehensiveness.Citation:YU Cuilin, WANG Qingsong, ZHONG Zixuan, ZHANG Junhao, LAI Tao, HUANG Haifeng. Elevation Error Prediction Dataset Using Global Open-source Digital Elevation Model[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3445-3455. doi: 10.11999/JEIT240062ćæïŒhttps://jeit.ac.cn/cn/article/doi/10.11999/JEIT240062
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.
PV power measurements
Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).
Quality control routine
An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.
Objective
By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.
By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.
Data usage
If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Units
Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).
Power measurements are in Watt.
Installed capacities (DC and AC) are in Watt-peak.
Additional information
A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy. Corrections are discussed in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2024. Erratum: Open-source quality control routine and multiyear power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Acknowledgements
This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.