Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All three datasets are in CSV file format and contain renewable energy and load demand data from August 1, 2024 to August 30, 2024. The dataset is sourced from California ISO and is stored through an open-source repository https://github.com/gridstatus/gridstatus. Renewable energy sources include wind and solar energy, and the load demand is based on data from the Trinity Public Utility District (TIDC) and the Trook Irrigation District (TPWR). The key data in the dataset are: the "MW" column represents power measured in megawatts, "OPR_DT" represents the operation date in YYYY-MM-DD format, and "OPR_HR" represents the 24 hours within a day (1-24). The datasets provide high-resolution temporal information on renewable energy generation and variations in load demand, which are valuable for analyzing grid stability and demand forecasting in electrical grid system.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Company Datasets for valuable business insights!
Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.
These datasets are sourced from top industry providers, ensuring you have access to high-quality information:
We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:
You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.
Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.
With Oxylabs Datasets, you can count on:
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
This dataset was created by Rawan Mostafa Rakha
Conversations Dataset
This dataset contains conversational data formatted as CSV for easy loading and processing.
Dataset Structure
The dataset is a CSV file with the following columns:
conversation_id: Unique identifier for each conversation message_id: The position of the message in the conversation role: The sender of the message (human, gpt, system) content: The content of the message
Usage
from datasets import load_dataset
dataset =… See the full description on the dataset page: https://huggingface.co/datasets/mugivara1/conversations-dataset.
https://data.gov.tw/licensehttps://data.gov.tw/license
Provide "Statistics of Import and Export Trade Volume of Each Park" to let the public understand the import and export and its growth trend of each park. In addition to updating this information every month, CSV file format is also provided for free download and use by the public.The dataset includes statistics on the import and export trade volume of parks such as Nanzih, Kaohsiung, Taichung, Zhonggang, Pingtung, and other parks (Lingguang, Chenggong, Gaoruan), with main fields including "Park, Import and Export (This Month, Year-to-Date)", "Export (This Month, Year-to-Date)", "Import (This Month, Year-to-Date)", and other important information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.
Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.
Overview statistics:
Measurement group statistics:
Included files:
CSV file columns:
CSV file columns, frequency-specific:
Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.
References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.
Contact information: info@edi.lv
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by animou123
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
a renewable energy resource-based sustainable microgrid model for a residential area is designed by HOMER PRO microgrid software. A small-sized residential area of 20 buildings of about 60 families with 219 MWh and an electric vehicle charging station of daily 10 batteries with 18.3MWh annual energy consumption considered for Padma residential area, Rajshahi (24°22.6'N, 88°37.2'E) is selected as our case study. Solar panels, natural gas generator, inverter and Li-ion batteries are required for our proposed model. The HOMER PRO microgrid software is used to optimize our designed microgrid model. Data were collected from HOMER PRO for the year 2007. We have compared our daily load demand 650KW with the results varying the load by 10%, 5%, 2.5% more and less to find out the best case according to our demand. We have a total of 7 different datasets for different load conditions where each dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data file contents: Data 1:: original_load.csv: This file contains data for 650KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data arrangement is given below: Column 1: Date and time of data recording in the format of MM-DD- YYYY [hh]:[mm]. Time is in 24-hour format. Column 2: Solar power output in KW unit. Column 3: Generator power output in KW unit. Column 4: Total Electrical load served in KW unit. Column 5: Excess electrical production in KW unit. Column 6: Li-ion battery energy content in KWh unit. Column 7: Li-ion battery state of charge in % unit.
Data 2:: 2.5%_more_load.csv: This file contains data for 677KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Data 3:: 2.5%_less_load.csv: This file contains data for 622KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Data 4:: 5%_more_load.csv: This file contains data for 705KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 5:: 5%_less_load.csv: This file contains data for 595KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 6:: 10%_more_load.csv: This file contains data for the 760KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 7:: 10%_less_load.csv: This file contains data for 540KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
INTRODUCTION
The goal of these LPWAN datasets is to provide the global research community with a benchmark tool to evaluate fingerprint localization algorithms in large outdoor environments with various properties. An identical collection methodology was used for all datasets: during a period of three months, numerous devices containing a GPS receiver periodically obtained new location data, which was sent to a local data server via a Sigfox or LoRaWAN message. Together with network information such as the receiving time of the message, base station IDs' of all receiving base stations and the Received Signal Strength Indicator (RSSI) per base station, this location data was stored in one of the three LPWAN datasets:
lorawan_dataset_antwerp.csv
130 430 LoRaWAN messages, obtained in the city center of Antwerp
sigfox_dataset_antwerp.csv
14 378 Sigfox messages, obtained in the city center of Antwerp
sigfox_dataset_rural.csv
25 638 Sigfox messages, obtained in a rural area between Antwerp and Ghent
As the rural and urban Sigfox datasets were recorded in adjacent areas, many base stations that are located at the border of these areas can be found in both datasets. However, they do not necessarily share the same identifier: e.g. ‘BS 1’ in the urban Sigfox dataset could be the same base station as ‘BS 36’ in the rural Sigfox dataset. If the user intends to combine both Sigfox datasets, the mapping of the ID's of these base stations can be found in the file:
sigfox_bs_mapping.csv
The collection methodology of the datasets, and the first results of a basic fingerprinting implementation are documented in the following journal paper: http://www.mdpi.com/2306-5729/3/2/13
UPDATES IN VERSION 1.2
In this version of the LPWAN dataset, only the LoRaWAN set has been updated. The Sigfox datasets remain identical to version 1.0 and 1.1. The main updates in the LoRaWAN set are the following:
New data: the LoRaWAN messages in the new set are collected 1 year after the previous dataset version. To be consistent with the previous versions, the new LoRaWAN set is uploaded in the same .CSV format as before. This upload can still be found in this repository as ‘lorawan_dataset_antwerp.csv’.
More gateways: Compared to the previous dataset, 4 gateways were added to the LoRaWAN network. The RSSI of these gateways are shown in columns ‘BS 69’, ‘BS 70’,‘BS 71’ and ‘BS 72’. All other ‘BS’ columns are in the same order as in previous dataset versions.
More metadata: In the previous LoRaWAN dataset, metadata was limited to 3 receiving gateways per message. In the new dataset version, metadata from all receiving gateways is included in every message. Moreover, some gateways provide a timestamp with nanosecond precision, which can be used to evaluate Time Difference of Arrival localization methods with LoRaWAN.
2 file formats: As more metadata becomes available, we find it important to share the dataset in a clearer overview. This also allows researchers to evaluate the performance of LoRaWAN in an urban environment. Therefore, we publish the new LoRaWAN dataset as a .CSV file as described above, but also as a .JSON file (lorawan_antwerp_2019_dataset.json.txt, the .txt file type had to be appended, otherwise the file could not be uploaded to Zenodo) An example of one message in this JSON format can be seen below:
JSON format description:
HDOP: Horizontal Dilution of Precision
dev_addr: LoRaWAN device address
dev_eui: LoRaWAN device EUI
sf: Spreading factor
channel: TX channel (EU region)
payload: application payload
adr: Adaptive Data Rate (1 = enabled, 0= disabled)
counter: device uplink message counter
latitude: Groundtruth TX location latitude
longitude: Groundtruth TX location longitude
airtime: signal airtime (seconds)
gateways:
rssi: Received Signal Strength
esp: Estimated Signal Power
snr: Signal-to-Noise Ratio
ts_type: Timestamp type. If this says "GPS_RADIO", a nanosecond precise timestamp is available
time: time of arrival at the gateway
id: gateway ID
JSON example
{ "hdop": 0.7, "dev_addr": "07000EFE", "payload": "008d000392d54c4284d18c403333333f04682aa9410500e8fd4106cabdbc420f00db0d470ce32ac93f0d582be93f0bfa3f8d3f", "adr": 1, "latitude": 51.20856475830078, "counter": 31952, "longitude": 4.400575637817383, "airtime": 0.112896, "gateways": [ { "rssi": -115, "esp": -115.832695, "snr": 6.75, "rx_time": { "ts_type": "None", "time": "2019-01-04T08:59:53.079+01:00" }, "id": "08060716" }, { "rssi": -116, "esp": -125.51497, "snr": -9.0, "rx_time": { "ts_type": "GPS_RADIO", "time": "2019-01-04T08:59:53.962029179+01:00" }, "id": "FF0178DF" } ], "dev_eui": "3432333853376B18", "sf": 7, "channel": 8 }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
You can also access an API version of this dataset.
TMS
(traffic monitoring system) daily-updated traffic counts API
Important note: due to the size of this dataset, you won't be able to open it fully in Excel. Use notepad / R / any software package which can open more than a million rows.
Data reuse caveats: as per license.
Data quality
statement: please read the accompanying user manual, explaining:
how
this data is collected identification
of count stations traffic
monitoring technology monitoring
hierarchy and conventions typical
survey specification data
calculation TMS
operation.
Traffic
monitoring for state highways: user manual
[PDF 465 KB]
The data is at daily granularity. However, the actual update
frequency of the data depends on the contract the site falls within. For telemetry
sites it's once a week on a Wednesday. Some regional sites are fortnightly, and
some monthly or quarterly. Some are only 4 weeks a year, with timing depending
on contractors’ programme of work.
Data quality caveats: you must use this data in
conjunction with the user manual and the following caveats.
The
road sensors used in data collection are subject to both technical errors and
environmental interference.Data
is compiled from a variety of sources. Accuracy may vary and the data
should only be used as a guide.As
not all road sections are monitored, a direct calculation of Vehicle
Kilometres Travelled (VKT) for a region is not possible.Data
is sourced from Waka Kotahi New Zealand Transport Agency TMS data.For
sites that use dual loops classification is by length. Vehicles with a length of less than 5.5m are
classed as light vehicles. Vehicles over 11m long are classed as heavy
vehicles. Vehicles between 5.5 and 11m are split 50:50 into light and
heavy.In September 2022, the National Telemetry contract was handed to a new contractor. During the handover process, due to some missing documents and aged technology, 40 of the 96 national telemetry traffic count sites went offline. Current contractor has continued to upload data from all active sites and have gradually worked to bring most offline sites back online. Please note and account for possible gaps in data from National Telemetry Sites.
The NZTA Vehicle
Classification Relationships diagram below shows the length classification (typically dual loops) and axle classification (typically pneumatic tube counts),
and how these map to the Monetised benefits and costs manual, table A37,
page 254.
Monetised benefits and costs manual [PDF 9 MB]
For the full TMS
classification schema see Appendix A of the traffic counting manual vehicle
classification scheme (NZTA 2011), below.
Traffic monitoring for state highways: user manual [PDF 465 KB]
State highway traffic monitoring (map)
State highway traffic monitoring sites
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015
Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.
Data Collection Methodology:
The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.
Secondary/Related Resources:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2019. For a description of the data collection, processing, and output methods, please see the "methods" section below.
Methods
Data Collection
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.
As a result, two CSV datasets are provided for each month of published data:
page-clicks:
The data in these CSV files correspond to the page-level data, and include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “page-clicks”. For example, the file named 2019-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2019.
country-device-info:
The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
index: The Elasticsearch index corresponding to country and device access information data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data end with “country-device-info”. For example, the file named 2019-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2019.
References
Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux)
Dataset Description
This is a VQA dataset based on Charts from ChartQA dataset from ChartQA.
Load the dataset
from datasets import load_dataset import csv
def load_beir_qrels(qrels_file): qrels = {} with open(qrels_file) as f: tsvreader = csv.DictReader(f, delimiter="\t") for row in tsvreader: qid = row["query-id"] pid = row["corpus-id"] rel = int(row["score"]) if qid in qrels:… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/VisRAG-Ret-Test-ChartQA.
This digital dataset was created as part of a U.S. Geological Survey study, done in cooperation with the Monterey County Water Resource Agency, to conduct a hydrologic resource assessment and develop an integrated numerical hydrologic model of the hydrologic system of Salinas Valley, CA. As part of this larger study, the USGS developed this digital dataset of geologic data and three-dimensional hydrogeologic framework models, referred to here as the Salinas Valley Geological Framework (SVGF), that define the elevation, thickness, extent, and lithology-based texture variations of nine hydrogeologic units in Salinas Valley, CA. The digital dataset includes a geospatial database that contains two main elements as GIS feature datasets: (1) input data to the 3D framework and textural models, within a feature dataset called “ModelInput”; and (2) interpolated elevation, thicknesses, and textural variability of the hydrogeologic units stored as arrays of polygonal cells, within a feature dataset called “ModelGrids”. The model input data in this data release include stratigraphic and lithologic information from water, monitoring, and oil and gas wells, as well as data from selected published cross sections, point data derived from geologic maps and geophysical data, and data sampled from parts of previous framework models. Input surface and subsurface data have been reduced to points that define the elevation of the top of each hydrogeologic units at x,y locations; these point data, stored in a GIS feature class named “ModelInputData”, serve as digital input to the framework models. The location of wells used a sources of subsurface stratigraphic and lithologic information are stored within the GIS feature class “ModelInputData”, but are also provided as separate point feature classes in the geospatial database. Faults that offset hydrogeologic units are provided as a separate line feature class. Borehole data are also released as a set of tables, each of which may be joined or related to well location through a unique well identifier present in each table. Tables are in Excel and ascii comma-separated value (CSV) format and include separate but related tables for well location, stratigraphic information of the depths to top and base of hydrogeologic units intercepted downhole, downhole lithologic information reported at 10-foot intervals, and information on how lithologic descriptors were classed as sediment texture. Two types of geologic frameworks were constructed and released within a GIS feature dataset called “ModelGrids”: a hydrostratigraphic framework where the elevation, thickness, and spatial extent of the nine hydrogeologic units were defined based on interpolation of the input data, and (2) a textural model for each hydrogeologic unit based on interpolation of classed downhole lithologic data. Each framework is stored as an array of polygonal cells: essentially a “flattened”, two-dimensional representation of a digital 3D geologic framework. The elevation and thickness of the hydrogeologic units are contained within a single polygon feature class SVGF_3DHFM, which contains a mesh of polygons that represent model cells that have multiple attributes including XY location, elevation and thickness of each hydrogeologic unit. Textural information for each hydrogeologic unit are stored in a second array of polygonal cells called SVGF_TextureModel. The spatial data are accompanied by non-spatial tables that describe the sources of geologic information, a glossary of terms, a description of model units that describes the nine hydrogeologic units modeled in this study. A data dictionary defines the structure of the dataset, defines all fields in all spatial data attributer tables and all columns in all nonspatial tables, and duplicates the Entity and Attribute information contained in the metadata file. Spatial data are also presented as shapefiles. Downhole data from boreholes are released as a set of tables related by a unique well identifier, tables are in Excel and ascii comma-separated value (CSV) format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides data for new prescription drugs introduced to market in California with a Wholesale Acquisition Cost (WAC) that exceeds the Medicare Part D specialty drug cost threshold. Prescription drug manufacturers submit information to HCAI within a specified time period after a drug is introduced to market. Key data elements include the National Drug Code (NDC) administered by the FDA, a narrative description of marketing and pricing plans, and WAC, among other information. Manufacturers may withhold information that is not in the public domain. Note that prescription drug manufacturers are able to submit new drug reports for a prior quarter at any time. Therefore, the data set may include additional new drug report(s) from previous quarter(s).
There are two types of New Drug data sets: Monthly and Annual. The Monthly data sets include the data in completed reports submitted by manufacturers for calendar year 2025, as of March 12, 2025. The Annual data sets include data in completed reports submitted by manufacturers for the specified year. The data sets may include reports that do not meet the specified minimum thresholds for reporting.
The program regulations are available here: https://hcai.ca.gov/wp-content/uploads/2024/03/CTRx-Regulations-Text.pdf
The data format and file specifications are available here: https://hcai.ca.gov/wp-content/uploads/2024/03/Format-and-File-Specifications-version-2.0-ada.pdf
DATA NOTES: Due to recent changes in Excel capabilities, it is not recommended that you save these files to .csv format. If you do, when importing back into Excel the leading zeros in the NDC number column will be dropped. If you need to save it into a different format other than .xlsx it must be .txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All three datasets are in CSV file format and contain renewable energy and load demand data from August 1, 2024 to August 30, 2024. The dataset is sourced from California ISO and is stored through an open-source repository https://github.com/gridstatus/gridstatus. Renewable energy sources include wind and solar energy, and the load demand is based on data from the Trinity Public Utility District (TIDC) and the Trook Irrigation District (TPWR). The key data in the dataset are: the "MW" column represents power measured in megawatts, "OPR_DT" represents the operation date in YYYY-MM-DD format, and "OPR_HR" represents the 24 hours within a day (1-24). The datasets provide high-resolution temporal information on renewable energy generation and variations in load demand, which are valuable for analyzing grid stability and demand forecasting in electrical grid system.