25 datasets found
  1. d

    Gulf Shrimp Control Data Tables

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southeast Fisheries Science Center (Resource Provider) (2024). Gulf Shrimp Control Data Tables [Dataset]. https://catalog.data.gov/dataset/gulf-shrimp-control-data-tables
    Explore at:
    Dataset updated
    Jul 2, 2024
    Dataset provided by
    Southeast Fisheries Science Center (Resource Provider)
    Description

    These are tables used to process the loads of gulf shrimp data. It contains pre-validation tables, error tables and information about statistics on data loads. It contains no data tables and no code tables. This information need not be published data set contains catch (landed catch) and effort for fishing trips made by the larger vessels that fish near and offshore for the various species of shrimp in the Gulf of Mexico. The data set also contains landings by the smaller boats that fish in the bays, lakes, bayous, and rivers for saltwater shrimp species; however, these landings data may be aggregated for multiple trip and may not provide effort data similar to the data for the larger vessels. The landings statistics in this data set consist of the quantity and value for the individual species of shrimp by size category type and quantity of gear, fishing duration and fishing area The data collection procedures for the catch/effort data for the large vessels consist of two parts. The landings statistics are collected from the seafood dealers after the trips are unloaded; whereas, the data on fishing effort and area are collected by interviews with the captain or crew while the trip is being unloaded.

  2. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  3. Musical Scale Classification Dataset using Chroma

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Avashia (2025). Musical Scale Classification Dataset using Chroma [Dataset]. https://www.kaggle.com/datasets/omavashia/synthetic-scale-chromagraph-tensor-dataset
    Explore at:
    zip(392580911 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Om Avashia
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Dataset Description

    Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale

    This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.

    What’s Inside

    • chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
      • 12 = the 12 pitch classes (C, C#, D, ... B)
      • T = time steps
    • scale_index: An integer label from 0–23 identifying the scale the sample belongs to

    Use Cases

    This dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows

    Labels

    IndexScale
    0C major
    1C# major
    ......
    11B major
    12C minor
    ......
    23B minor

    Quick Load Example (PyTorch)

    Chroma tensors are of shape [1, 12, T], where: - 1 is the channel dimension (for CNN input) - 12 represents the 12 pitch classes (C through B) - T is the number of time frames

    import torch
    import pandas as pd
    from tqdm import tqdm
    
    df = pd.read_csv("/content/scale_dataset.csv")
    
    # Reconstruct chroma tensors
    X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
    y = df['scale_index'].tolist()
    

    Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.

    import torch
    import pandas as pd
    
    data = torch.load("chroma_tensors.pt")
    X_pt = data['X'] # list of [1, 12, 302] tensors
    y_pt = data['y'] # list of scale indices
    

    How It Was Built

    • Notes generated from random melodies using music21
    • MIDI converted to WAV via FluidSynth
    • Chromagrams extracted with librosa.feature.chroma_stft
    • Tensors flattened and saved alongside scale index labels

    File Format

    ColumnTypeDescription
    chroma_tensorstrFlattened 1D chroma tensor [1×12×T]
    scale_indexintLabel from 0 to 23

    Notes

    • Data is synthetic but musically valid and well-balanced
    • Each of the 24 scales appears 300 times
    • All tensors have fixed length (T) for easy batching
  4. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  5. D

    Code for EchoTables (IKILeUS)

    • darus.uni-stuttgart.de
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadeen Fathallah; Steffen Staab (2025). Code for EchoTables (IKILeUS) [Dataset]. http://doi.org/10.18419/DARUS-4774
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2025
    Dataset provided by
    DaRUS
    Authors
    Nadeen Fathallah; Steffen Staab
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Time period covered
    Aug 1, 2022 - Nov 30, 2024
    Dataset funded by
    German Federal Ministry of Education and Research (BMBF)
    Description

    EchoTables is an innovative accessibility tool developed as part of the IKILeUS project at the University of Stuttgart. It is designed to improve the usability of tabular data for visually impaired users by converting structured tables into concise, auditory-friendly textual summaries. Traditional screen readers navigate tables linearly, which imposes a high cognitive load on users. EchoTables alleviates this issue by summarizing tables, facilitating quicker comprehension and more efficient information retrieval. Initially utilizing RUCAIBox (LLM), EchoTables transitioned to Mistral-7B, a more powerful open-source model, to enhance processing efficiency and scalability. The tool has been tested with widely used screen readers such as VoiceOver to ensure accessibility. EchoTables has been adapted to process diverse data sources, including lecture materials, assignments, and WikiTables, making it a valuable resource for students navigating complex datasets.

  6. PSP Solar Wind Electrons Alphas and Protons (SWEAP) SPAN-B Electron Energy...

    • catalog.data.gov
    • heliophysicsdata.gsfc.nasa.gov
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA Space Physics Data Facility (SPDF) Data Services (2025). PSP Solar Wind Electrons Alphas and Protons (SWEAP) SPAN-B Electron Energy Spectra, Level 2 (L2), 1.74 s Data [Dataset]. https://catalog.data.gov/dataset/psp-solar-wind-electrons-alphas-and-protons-sweap-span-b-electron-energy-spectra-level-2-l
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    SPAN-E Level 2 ELectron Energy Spectra Data-------------------------------------------File Naming Format: psp_swp_spb_sf1_L2_32E_YYYYMMDD_v01.cdfThe SF1 product is an energy spectrum produced on the spacecraft by summing over the Theta and Phi directions. The units are differential energy flux and eV. The sample filename above includes 32 Energies.The larger Theta angles (deflection angles) are artificially enhanced in the "sf1" energy spectra data products due to the method of spectra production on the SPAN-E instrument (straight summing). Thus, SF1 energy spectra are not recommended for rigid statistical analysis.Parker Solar Probe SWEAP Solar Probe Analyzer, SPAN, Electron Data Release Notes--------------------------------------------------------------------------------November 19, 2019 Initial Data Release--------------------------------------Overview of Measurements------------------------The SWEAP team is pleased to release the data from Encounter 1 and Encounter 2. The files contain data from the time range October 31, 2018 - June 18, 2019.The prime mission of Parker Solar Probe is to take data when within 0.25 AU of the Sun during its orbit. However, there has been some extended campaign measurements outside of this distance. The data are available for those days that are within 0.25 AU as well as those days when the instruments were operational outside of 0.25 AU.Each SWEAP data file includes a set of a particular type of measurements over a single observing day. Measurements are provided in Common Data Format (CDF), a self-documenting data framework for which convenient open source tools exist across most scientific computing platforms. Users are strongly encouraged to consult the global metadata in each file, and the metadata that are linked to each variable. The metadata includes comprehensive listings of relevant information, including units, coordinate systems, qualitative descriptions, measurement uncertainties, methodologies, links to further documentation, and so forth.SPAN-E Level 2 Version 01 Release Notes---------------------------------------The SPAN-Ae and SPAN-B instruments together have fields of view covering >90% of the sky; major obstructions to the FOV include the spacecraft heat shield and other intrusions by spacecraft components. Each individual SPAN-E has FOV of ±60° in Theta and 240° in Phi. The rotation matrices to convert into the spacecraft frame can be found in the individual CDF files, or in the instrument paper.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in ion mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B when the spacecraft is pointed away from the sun are taken in sunlight.The data quality flags for the SPAN data can be found in the CDF files as: QUALITY_FLAG (0=good, 1=bad)General Remarks for Version 01 Data-----------------------------------Users interested in field-aligned electrons should take care regarding potential blockages from the heat shield when B is near radial, especially in SPAN-Ae. Artificial reductions in strahl width can result.Due to the relatively high electron temperature in the inner heliosphere, many secondary electrons are generated from spacecraft and instrument surfaces. As a result, electron measurements in this release below 30 eV are not advised for scientific analysis.The fields of view in SPAN-Ae and SPAN-B have many intrusions by the spacecraft, and erroneous pixels discovered in analysis, in particular near the edges of the FOV, should be viewed with skepticism. Details on FOV intrusion are found in the instrument paper, forthcoming, or by contacting the SPAN-E instrument scientist.The instrument mechanical attentuators are engaged during the eight days around perihelia 1 and perihelia 2, which results in a factor of about 10 reduction of the total electron flux into the instrument. During these eight days, halo electron measurements are artificially enhanced in the L2 products as a result of the reduced instrument geometric factor and subsequent ground corrections.A general note for Encounter 1 and Encounter 2 data: a miscalculation in the deflection tables loaded to both SPAN-Ae and SPAN-B resulted in over-deflection of the outermost Theta angles during these encounters. As such, pixels at large Thetas should be ignored. This error was corrected by a table upload prior to Encounter 3.Lastly, when viewing time gaps in the SPAN-E measurements, be advised that the first data point produced by the instrument after a power-on is the maximum value permitted by internal instrument counters. Therefore, the first data point after powerup is erroneous and should be discarded, as indicated by quality flags.SPAN-E Encounter 1 Remarks--------------------------SPAN-E operated nominally for the majority of the first encounter. Exceptions to this include: a few instances of corrupted, higher-energy sweep tables, and an instrument commanding error for the two hours surrounding perihelion 1. These and other instrument diagnostic tests are indicated with the QUALITY_FLAG variable in the CDFs.The mechanical attentuator was engaged for the 8 days around perihelion 1: as a result the microchannel plate, MCP, noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution with this data release if looking for halo electrons when the mechanical attenuator is engaged.SPAN-E Cruise Phase Remarks---------------------------The cruise mode rates of SPAN-E are greatly reduced compared to the encounter mode rates. When the PSP spacecraft is in a communications slew, the SPAN-B instrument occasionally reaches its maximum allowable operating temperature and is powered off by SWEM.Timing for the SF1 products in cruise phase is not corrected in v01, and thus it is not advised to use the data at this time for scientific analysis. The typical return of SF0 products is one spectrum out of every 32 survey spectra is returned every 15 minutes or so. One out of every four 27.75 s SF1 spectra is produced every 111 s.SPAN-E Encounter 2 Remarks--------------------------SPAN-E operated nominally for the majority of the second encounter. Exceptions include instrument diagnostic and health checks and a few instances of corrupted high-energy sweep tables. These tests and corrupted table loads are indicated with the QUALITY_FLAG parameter.The mechanical attentuator was engaged for the 8 days around perihelion 2: as a result the MCP noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution in this data release if looking for halo electrons when the mechanical attenuator is engaged.Parker Solar Probe SWEAP Rules of the Road------------------------------------------As part of the development of collaboration with the broader Heliophysics community, the mission has drafted a "Rules of the Road" to govern how PSP instrument data are to be used. 1) Users should consult with the PI to discuss the appropriate use of instrument data or model results and to ensure that the users are accessing the most recently available versions of the data and of the analysis routines. Instrument team Science Operations Centers, SOCs, and/or Virtual Observatories, VOs, should facilitate this process serving as the contact point between PI and users in most cases. 2) Users should heed the caveats of investigators to the interpretations and limitations of data or model results. Investigators supplying data or models may insist that such caveats be published. Data and model version numbers should also be specified. 3) Browse products, Quicklook, and Planning data are not intended for science analysis or publication and should not be used for those purposes without consent of the PI. 4) Users should acknowledge the sources of data used in all publications, presentations, and reports: "We acknowledge the NASA Parker Solar Probe Mission and the SWEAP team led by J. Kasper for use of data.".* 5) Users are encouraged to provide the PI a copy of each manuscript that uses the PI data prior to submission of that manuscript for consideration of publication. On publication, the citation should be transmitted to the PI and any other providers of data.

  7. YouTube Video Analytics Dataset

    • kaggle.com
    zip
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kundan Sagar Bedmutha (2025). YouTube Video Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/kundanbedmutha/youtube-video-analytics-dataset
    Explore at:
    zip(1130623 bytes)Available download formats
    Dataset updated
    Nov 19, 2025
    Authors
    Kundan Sagar Bedmutha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    This dataset contains 30,000 YouTube video analytics records, created to simulate realistic YouTube Studio performance data from the last 12 months. It provides per-video metrics such as impressions, click-through rate (CTR), average view duration, watch time, likes, comments, and traffic sources.

    This dataset is useful for:

    YouTube trend analysis Predictive modeling Engagement analysis Audience retention studies Recommender systems Machine learning and EDA Content performance optimization

    All upload dates fall within the previous 365 days, making the dataset aligned with recent YouTube trends.

    COLUMN DESCRIPTIONS

    Post_ID – Unique video ID used to join with other tables. Upload_Date – Video upload date within the last 1 year. Video_Duration_Min – Total length of the video in minutes. Avg_View_Duration_Sec – Average watch time per viewer. Avg_View_Percentage – Percentage of the video that users watched. Subscribers_Gained – Number of subscribers gained from this video. Traffic_Source – How viewers discovered the video (Search, Suggested, Browse, External, etc.). CTR_Percentage – Click-through rate of the thumbnail impressions. Impressions – How many users saw the video thumbnail across YouTube surfaces. Likes – Total number of likes received. Comments – Number of comments posted. Shares – Number of times the video was shared. Total_Watch_Time_Hours – Total accumulated watch time in hours (critical YouTube ranking signal).

    WHY THIS DATASET MATTERS

    YouTube’s recommendation system prioritizes: high watch time high CTR strong audience retention strong engagement (likes, comments, shares)

    This dataset includes all of these metrics, allowing deep analysis of: what makes videos perform well which traffic sources are strongest how video length affects watch time how engagement influences discoverability seasonal or monthly patterns in video performance

  8. d

    Data for: Inadequate sampling of the soundscape leads to overoptimistic...

    • search.dataone.org
    • datadryad.org
    • +1more
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Lewis (2025). Data for: Inadequate sampling of the soundscape leads to overoptimistic estimates of recogniser performance: A case study of two sympatric macaw species [Dataset]. http://doi.org/10.5061/dryad.5x69p8d7j
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Thomas Lewis
    Time period covered
    Jan 1, 2023
    Description

    Passive acoustic monitoring (PAM) offers the potential to dramatically increase the scale and robustness of species monitoring in rainforest ecosystems. PAM generates large volumes of data that require automated methods of target species detection. Species-specific recognisers, which often use supervised machine learning, can achieve this goal. However, they require a large training dataset of both target and non-target signals, which is time-consuming and challenging to create. Unfortunately, very little information about creating training datasets for supervised machine learning recognisers is available, especially for tropical ecosystems. Here we show an iterative approach to creating a training dataset that improved recogniser precision from 0.12 to 0.55. By sampling background noise using an initial small recogniser, we addressed one of the significant challenges of training dataset creation in acoustically diverse environments. Our work demonstrates that recognisers will likely f..., Raw data used to create this dataset was collected from autonomous recording units in northern Costa Rica. A template-matching process was used to identify candidate signals, then a one-second window was put around each candidate signal. We extracted a total of 113 acoustic features using the warbler package in R (R Core Team, 2020): 20 measurements of frequency, time, and amplitude parameters, and 93 Mel-frequency cepstral coefficients (MFCCs) (Araya†Salas and Smith†Vidaurre, 2017). This dataset also includes the results of manually checking detections that were the output of a trained random forest. These were initially output as selection tables, individual sound files were loaded in Raven Lite, selection tables were loaded, and each detection was manually checked and labelled. There is also the random forest model, which is a .rds format model created using tidymodels in R. , Following the code associated with this data will require R; the outputs from the machine learning require Raven Lite to open. The raw recordings are not included in this dataset.

  9. Point-DeepONet dataset

    • kaggle.com
    zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jangseop Park (2024). Point-DeepONet dataset [Dataset]. https://www.kaggle.com/datasets/jangseop/point-deeponet-dataset
    Explore at:
    zip(55198205686 bytes)Available download formats
    Dataset updated
    Dec 20, 2024
    Authors
    Jangseop Park
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    For more details about the dataset and its applications, please refer to our GitHub repository.

    The Point-DeepONet Dataset is meticulously curated to support advanced research in nonlinear structural analysis and operator learning, specifically tailored for applications in structural mechanics of jet engine brackets. This dataset encompasses a diverse collection of non-parametric three-dimensional (3D) geometries subjected to varying load conditions—vertical, horizontal, and diagonal. It includes high-fidelity simulation results, such as displacement fields and von Mises stress distributions, derived from nonlinear finite element analyses (FEA).

    Key Features:

    • Point Clouds: Detailed point cloud representations of complex 3D geometries without mesh parameterization.
    • Load Conditions: Diverse, directionally varying load scenarios to simulate real-world engineering applications.
    • Simulation Results: High-resolution displacement and von Mises stress fields, enabling accurate surrogate modeling.
    • Material Properties: Based on Ti–6Al–4V with realistic material properties for accurate structural response simulations.
    • Nonlinear Analysis: Incorporates elastic–plastic material models with isotropic hardening to capture complex structural behaviors.

    This dataset was utilized to develop and train the Point-DeepONet model, which integrates PointNet within the DeepONet framework to achieve rapid and accurate predictions for structural analyses. By leveraging this dataset, researchers can explore operator-learning techniques, optimize design processes, and enhance decision-making in complex engineering workflows.

    Dataset Generation

    We utilize the DeepJEB dataset [1], a synthetic dataset specifically designed for 3D deep learning applications in structural mechanics, focusing on jet engine brackets. This dataset includes various bracket geometries subjected to different load cases—vertical, horizontal, and diagonal—providing a diverse range of scenarios to train and evaluate deep learning models for predicting field values. While the original DeepJEB dataset offers solutions from linear static analyses, in this study we extend its applicability by performing our own nonlinear static finite element analyses to predict displacement fields ($u_x$, $u_y$, $u_z$) and von Mises stress under varying geometric and loading conditions.

    Finite element analyses (FEA) are conducted using Altair OptiStruct [2] to simulate the structural response under nonlinear static conditions. Each bracket geometry is discretized using second-order tetrahedral elements with an average element size of 2 mm, enhancing the precision of the displacement and stress predictions. The material properties for the brackets are based on Ti–6Al–4V, specified with a density of $4.47 \times 10^{-3}$ g/mm³, a Young's modulus ($E$) of 113.8 GPa, and a Poisson’s ratio ($ u$) of 0.342, representing realistic behavior under the applied loads.

    An elastic–plastic material model with linear isotropic hardening is employed to capture the nonlinear response, characterized by a yield stress of 227.6 MPa and a hardening modulus of 355.56 MPa. The nonlinear analysis settings include a maximum iteration limit of 10 and a convergence tolerance of 1%, ensuring accurate simulation of the structural response to complex loading conditions.

    Bracket geometry and load directionBolted and loaded interfacesBoundary conditions and constraints
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F24260639%2F73283e810974884af30685692324a3e6%2Ffigure_1_a.jpg?generation=1734862372047697&alt=media" alt="">https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F24260639%2Fb6df4ab05c60cedc3dfa78f7a9b462cb%2Ffigure_1_b.jpg?generation=1734862437917594&alt=media" alt="">https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F24260639%2Fae2924985b918e9d9d926fda014b817d%2Ffigure_1_c.jpg?generation=1734862464561668&alt=media" alt="">

    Data Preprocessing

    The dataset comprises a range of jet engine bracket geometries with varying structural properties and masses. The node counts range from 127,634 to 380,714, and the mass spans from 0.56 kg to 2.41 kg, ensuring a diverse set of structural complexities and weights.

    MetricMinimumMaximumAverage
    Number of nodes127,634380,714209,974
    Number of edges468,7081,453,872787,658
    Number of cells78,118242,312131,276
    Mass (kg)0.562.411.23

    To facilitate effective model training and evaluation, the dataset was divided into training and validation subsets, with 80% allocated for training...

  10. Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open...

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Van Poucke; Zhongheng Zhang; Martin Schmitz; Milan Vukicevic; Margot Vander Laenen; Leo Anthony Celi; Cathy De Deyne (2023). Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform [Dataset]. http://doi.org/10.1371/journal.pone.0145791
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sven Van Poucke; Zhongheng Zhang; Martin Schmitz; Milan Vukicevic; Margot Vander Laenen; Leo Anthony Celi; Cathy De Deyne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. In this study, the authors address this problem by focusing on open, visual environments, suited to be applied by the medical community. Moreover, we review code free applications of big data technologies. As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner’s Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. As use case, correlation of platelet count and ICU survival was quantitatively assessed. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, we developed robust processes for automatic building, parameter optimization and evaluation of various predictive models, under different feature selection schemes. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research.

  11. Data from: Fishing intensity in the Atlantic Ocean (from Global Fishing...

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Fishing intensity in the Atlantic Ocean (from Global Fishing Watch) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-13791296?locale=lv
    Explore at:
    unknown(156942336)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. MISSION ATLANTIC The MISSION ATLANTIC project is an EU-funded initiative that focuses on understanding the impacts of climate change and human activities on these ecosystems. The project aims to map and assess the current and future status of Atlantic marine ecosystems, develop tools for sustainable management, and support ecosystem-based governance to ensure the resilience and sustainable use of ocean resources. The project brings together experts from 33 partner organizations across 14 countries, including Europe, Africa, North, and South America. MISSION ATLANTIC includes ten work packages. The present published dataset is included in WP3, which focuses on mapping the pelagic ecosystems, resources, and pressures in the Atlantic Ocean. This WP aims to collect extensive spatial and temporal data to create 3D maps of the water column, identify key vertical ecosystem domains, and assess the pressures from climate change and human activities. More specifically, the dataset corresponds to the fishing intensity presented in the Deliverable 3.2, which integrates data from various sources to map the distribution and dynamics of present ecosystem pressures over time, providing crucial insights for sustainable management strategies. 2. Data description 2.1. Data Source Fishing intensity estimates from the Global Fishing Watch initiative (GFW) (Kroodsma et al. 2018), who applies machine learning algorithms to data from Automatic Identification Systems (AIS), Vessel Monitoring Systems (VMS), and vessel registries, has been used for the year 2020. This machine learning approach has been able to distinguish between fishing and routing activity of individual vessels, while using pattern recognition to differentiate seven main fishing gear types at the Atlantic Ocean scale (Taconet et al., 2019). The seven main fishing vessel types considered are: trawlers, purse seiners, drifting longliners, set gillnets, squid jiggers, pots and traps, and other. In this work we have aggregated these into pelagic, seabed and passive fishing activities to align with our grouping of ecosystem components. The GFW data has some limitations: AIS is only required for large vessels. The International Maritime Organization requires AIS use for all vessels of 300 gross tonnage and upward, although some jurisdictions mandate its use in smaller vessels. For example, within the European Union it is required for fishing vessels at least 15m in length. This means that in some areas the fishing intensity estimates will not include the activity of small vessels operating near shore. AIS can be intentionally turned off, for example, when vessels carry out illegal fishing activities (Kurekin et al. 2019). In the GFW dataset, vessels classified as trawlers include both pelagic and bottom trawlers. As trawlers are included in the bottom fishing category, it is highly likely that the data overestimates the effort on the seafloor and underestimates it on the water column. 2.2. Data Processing 1. Data download from the GFW portal. 2. Using R: Add daily files and aggregate fishing hours by fishing gear and coordinates: library(data.table)## Load data fileIdx = list.files(".../fleet-daily-csvs-100-v2-2020/", full.names = T) ## Loop colsIdx = c("geartype", "hours", "fishing_hours", "x", "y") lapply(fileIdx, function(xx) { out = data.table (x = NA_real_, y = NA_real_, geartype = NA_character_) tmp = fread(xx) tmp[, ":=" (y = floor(cell_ll_lat * 10L) / 10L, x = floor(cell_ll_lon * 10L) / 10L)] tmp = tmp[, ..colsIdx] h = tmp[, c(.N, lapply(.SD, sum, na.rm = T)), by = .(x, y, geartype)] outh = data.table::merge.data.table(out, h, by = c("x", "y", "geartype"), all=TRUE) fwrite(outh, ".../GFW_2020_0.1_degrees_and_gear_all.csv", nThread = 14, append = T) }) Group fishing gears into main fishing groups: library(dplyr)library(tidyr)## Load data fishing <- read.csv(".../GFW_2020_0.1_degrees_and_gear_all.csv", sep=",", dec=".", header=T, stringsAsFactors = FALSE) ## Grouping fishing gears (fishing, pelagic, bottom, passive) # unique(fishing$geartype) fishing$group <- NA fishing$group[which(fishing$geartype == "fishing")] = "fishing" # Unknown fishing$group[fishing$geartype %in% c("trollers", "squid_jigger", "pole_and_line", "purse_seines", "tuna_purse_seines", "seiners", "other_purse_seines", "other_seines", "set_longlines", "drifting_longlines")] <- "pelagic" fishing$group[fishing$geartype %in% c("trawlers", "dredge_fishing")] <- "bottom" fishing$group[fishing$geartype %in% c("set_gillnets", "fixed_gear", "pots_and_traps")] <- "passive" ## Total fishing hours (by fishing and position) fish_gr <- fishing %>% group_by(x, y, group) %>% summarise(gfishing_hours = sum(fishing_hours)) Pivot table in order to have fishing groups in columns. Each row corresponds to the coordinates of the left corner of the grid cell (0.1 decimal degrees): ## Pivoting table (fishing groups in columns) fish_gr3 <- fish_gr %>% pivot_wider(names_from = "group", values_from = "gfishing_hours", va
  12. PUDL Data Release v1.0.0

    • zenodo.org
    application/gzip, bin +1
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell (2023). PUDL Data Release v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.3653159
    Explore at:
    application/gzip, bin, shAvailable download formats
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zane A. Selvans; Zane A. Selvans; Christina M. Gosnell; Christina M. Gosnell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first data release from the Public Utility Data Liberation (PUDL) project. It can be referenced & cited using https://doi.org/10.5281/zenodo.3653159

    For more information about the free and open source software used to generate this data release, see Catalyst Cooperative's PUDL repository on Github, and the associated documentation on Read The Docs. This data release was generated using v0.3.1 of the catalystcoop.pudl python package.

    Included Data Packages

    This release consists of three tabular data packages, conforming to the standards published by Frictionless Data and the Open Knowledge Foundation. The data are stored in CSV files (some of which are compressed using gzip), and the associated metadata is stored as JSON. These tabular data can be used to populate a relational database.

    • pudl-eia860-eia923:
      Data originally collected and published by the US Energy Information Administration (US EIA). The data from EIA Form 860 covers the years 2011-2018. The Form 923 data covers 2009-2018. A large majority of the data published in the original data sources has been included, but some parts, like fuel stocks on hand, and EIA 923 schedules 6, 7, & 8 have not yet been integrated.
    • pudl-eia860-eia923-epacems:
      This data package contains all of the same data as the pudl-eia860-eia923 package above, as well as the Hourly Emissions data from the US Environmental Protection Agency's (EPA's) Continuous Emissions Monitoring System (CEMS) from 1995-2018. The EPA CEMS data covers thousands of power plants at hourly resolution for decades, and contains close to a billion records.
    • pudl-ferc1:
      Seven data tables from FERC Form 1 are included, primarily relating to individual power plants, and covering the years 1994-2018 (the entire span of time for which FERC provides this data). These tables are the only ones which have been subjected to any cleaning or organization for programmatic use within PUDL. The complete, raw FERC Form 1 database contains 116 different tables with many thousands of columns of mostly financial data. We will archive a complete copy of the multi-year FERC Form 1 Database as a file-based SQLite database at Zenodo, independent of this data release. It can also be re-generated using the catalystcoop.pudl Python package and the original source data files archived as part of this data release.

    Contact Us

    If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. You can also:

    Using the Data

    The data packages are just CSVs (data) and JSON (metadata) files. They can be used with a variety of tools on many platforms. However, the data is organized primarily with the idea that it will be loaded into a relational database, and the PUDL Python package that was used to generate this data release can facilitate that process. Once the data is loaded into a database, you can access that DB however you like.

    Make sure conda is installed

    None of these commands will work without the conda Python package manager installed, either via Anaconda or miniconda:

    Download the data

    First download the files from the Zenodo archive into a new empty directory. A couple of them are very large (5-10 GB), and depending on what you're trying to do you may not need them.

    • If you don't want to recreate the data release from scratch by re-running the entire ETL process yourself, and you don't want to create a full clone of the original FERC Form 1 database, including all of the data that has not yet been integrated into PUDL, then you don't need to download pudl-input-data.tgz.
    • If you don't need the EPA CEMS Hourly Emissions data, you do not need to download pudl-eia860-eia923-epacems.tgz.

    Load All of PUDL in a Single Line

    Use cd to get into your new directory at the terminal (in Linux or Mac OS), or open up an Anaconda terminal in that directory if you're on Windows.

    If you have downloaded all of the files from the archive, and you want it all to be accessible locally, you can run a single shell script, called load-pudl.sh:

    bash pudl-load.sh
    

    This will do the following:

    • Load the FERC Form 1, EIA Form 860, and EIA Form 923 data packages into an SQLite database which can be found at sqlite/pudl.sqlite.
    • Convert the EPA CEMS data package into an Apache Parquet dataset which can be found at parquet/epacems.
    • Clone all of the FERC Form 1 annual databases into a single SQLite database which can be found at sqlite/ferc1.sqlite.

    Selectively Load PUDL Data

    If you don't want to download and load all of the PUDL data, you can load each of the above datasets separately.

    Create the PUDL conda Environment

    This installs the PUDL software locally, and a couple of other useful packages:

    conda create --yes --name pudl --channel conda-forge \
      --strict-channel-priority \
      python=3.7 catalystcoop.pudl=0.3.1 dask jupyter jupyterlab seaborn pip
    conda activate pudl
    

    Create a PUDL data management workspace

    Use the PUDL setup script to create a new data management environment inside this directory. After you run this command you'll see some other directories show up, like parquet, sqlite, data etc.

    pudl_setup ./
    

    Extract and load the FERC Form 1 and EIA 860/923 data

    If you just want the FERC Form 1 and EIA 860/923 data that has been integrated into PUDL, you only need to download pudl-ferc1.tgz and pudl-eia860-eia923.tgz. Then extract them in the same directory where you ran pudl_setup:

    tar -xzf pudl-ferc1.tgz
    tar -xzf pudl-eia860-eia923.tgz
    

    To make use of the FERC Form 1 and EIA 860/923 data, you'll probably want to load them into a local database. The datapkg_to_sqlite script that comes with PUDL will do that for you:

    datapkg_to_sqlite \
      datapkg/pudl-data-release/pudl-ferc1/datapackage.json \
      datapkg/pudl-data-release/pudl-eia860-eia923/datapackage.json \
      -o datapkg/pudl-data-release/pudl-merged/
    

    Now you should be able to connect to the database (~300 MB) which is stored in sqlite/pudl.sqlite.

    Extract EPA CEMS and convert to Apache Parquet

    If you want to work with the EPA CEMS data, which is much larger, we recommend converting it to an Apache Parquet dataset with the included epacems_to_parquet script. Then you can read those files into dataframes directly. In Python you can use the pandas.DataFrame.read_parquet() method. If you need to work with more data than can fit in memory at one time, we recommend using Dask dataframes. Converting the entire dataset from datapackages into Apache Parquet may take an hour or more:

    tar -xzf pudl-eia860-eia923-epacems.tgz
    epacems_to_parquet datapkg/pudl-data-release/pudl-eia860-eia923-epacems/datapackage.json
    

    You should find the Parquet dataset (~5 GB) under parquet/epacems, partitioned by year and state for easier querying.

    Clone the raw FERC Form 1 Databases

    If you want to access the entire set of original, raw FERC Form 1 data (of which only a small subset has been cleaned and integrated into PUDL) you can extract the original input data that's part of the Zenodo archive and run the ferc1_to_sqlite script using the same settings file that was used to generate the data release:

    tar -xzf pudl-input-data.tgz
    ferc1_to_sqlite data-release-settings.yml
    

    You'll find the FERC Form 1 database (~820 MB) in sqlite/ferc1.sqlite.

    Data Quality Control

    We have performed basic sanity checks on much but not all of the data compiled in PUDL to ensure that we identify any major issues we might have introduced through our processing

  13. Google Ads Transparency Center

    • console.cloud.google.com
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=de (2023). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center?hl=de
    Explore at:
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Description

    This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  14. Bat Recording Manager 7.2

    • figshare.com
    Updated Mar 23, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barbastellus barbastellus; Justin Halls (2019). Bat Recording Manager 7.2 [Dataset]. http://doi.org/10.6084/m9.figshare.5972296.v34
    Explore at:
    application/x-dosexecAvailable download formats
    Dataset updated
    Mar 23, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Barbastellus barbastellus; Justin Halls
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A program for managing collections of full spectrum recordings of bats.v6.2.6660 incorporates the import and export of collections of pictures in the image compare window.v6.2.6661 fixes some bugs and speed issues in 6660.v6.2.6680 tries to fix some database updating problems and adds additional debugging in this area.v7.0.6760 - Major improvements and changes.First define the additional shortvut key in Audacity - CTRL-SHIFT-M=Open menu in focussed track. New item in 'View' menu- Analyse and Import, will open a folder of .wav files and sequentially open them in Audacity. When annotated and the label file saved and Audacity closed the next file will be opened. If the label file is not saved then the process stops and will resume on the next invocation of Analyse and Import on that folder. As each file is opened the label track wil be automatically created and named.and the view ill zoom to the first 5 seconds of the .wav track.7.0.6764 also includes a new report format which (for one or more sessions) gives number of minutes in each ten minute window throughout the day in which a species of bat was detected. Rows are given for each species in the recordings. In Excel looks good as a bar chart or a radar chart.7.06789 hopefully fixes the problems when trying to update a database that caused the program to crash on startup if the database did not contain the more recent Version table.7.0.6799 cosmetic changes to use the normal file selection dialog instead of the folder browser dialog, and also when using Analyse and Import, you no longer need to pick a file when selecting the .wav file folder.7.0.6820 Adds session data to all report formats, including pass statistics for all species found in that session.7.0.6844 Adds the ability to add, save, adjust and include in exported images, Fiducial lines. Lines can be added, deleted or adjusted in the image comparison window and are saved to the database when the window is closed. For exported images the lines are permanently overlaid on the image and are no longer adjustable.7.0.6847 Makes slight improvements to the aspect ratio of images in the comparison window and when images are exported the fiducial lines are only included if the FIDS button is deptessed.7.0.6850 Fixes an occasional bug when saving images through Analyse and Import - using filenames in the caption has priority over bat's names. Also improvements in file handling when changing databases - now attempts to recognise if a db is the right type.7.0.6858 Makes some improvements to image handling, including a modification to the database structure to allow long descriptions for images (previously description+caption had to be less than 250 chars) and the ability to copy images within the application (but not to external applications). A single image may now be used simultaneously as a bat image, a call image or a segment image. Changes to it in one location will be reflected in all the other locations. On deletion the link is removed and if there are no remaining links for the image then the image itself will be removed from the database.7.0.6859 has some improvements to the image handling system. In the batReference view the COMP button now adds all bat and call images for all selected bats to the comparison window. Double clicking on a bat adds all bat, call and segment images for all the bats selected to the comparison window.7.0.6860 removed the COMP button from the bat reference view. Double-clicking in this view transfers all images of bat, calls and recordings to the comparison window. Double-clicking in the ListByBats view transfers all recording images but not the bat and call images to the comparison window. Exported images for recordings use the recording filename plus the start offset of the segment as a filename, or alternatively the image caption. 7.0.6866 Improvements to the grids and to grid scaling and movement especially for the sonagram grids.7.0.6876 Added the ability to right-click on a labelled segment in the recordings detail list control, to open that recording in Audacity and scroll to the location of that labelled segment. Only one instance of Audacity may be opened at a time or the scrolling does not work. Also made some improvements to the scrolling behaviour of the recording detail window.Version 7.1 makes significant changes to the way in which the recordingSessions list is displayed. Because this list can get quite large and therefore takes a long time to load, it now loads the data in discrete pages.At the top of the RecordingSessions List is a new navigation bar with a set of buttons and two combo-boxes. The rightmost combobox is used to set the number of items that will be loaded and displayed on a page. The selections are currently 10, 25, 50 and 100. Slower machines may find it advantageous to use smaller page sizes in order to speed up load times and reduce the demand for memory and cpu-time.The other combobox allows the selection of a sort field for the session list. Sessions are displayed in columns in a DataGrid which allows columns to be re-sized, moved and sorted. These functions all now only apply to the subset of data that has been loaded as a page. The Combo-box allows you to sort the full set of data in the database before loading the page. Thus if the combobox is set to sort on DATE with a Page size of 10, then only the 10 earliest (or the 10 latest depending on the direction of sorting) sessions in the database will be loaded. The displayed set of sessions can be sorted on the screen by clicking the column headers but this only changes the order on the screen, it does not load any other sessions from the database.The four buttons can be used to load the next or previous pages or to move to the start or end of the complete database collection. The Next or Previous buttons move the selection by 2/3 of the Page Size so that there will always be some visual overlap between pages.The sort combo-box has two entries for each field, one with a suffix of ^ and one with a suffix of v . These sort the database in Ascending or Descending order. Selecting a sort field will update the display and sort the display entries on the same field, but the sort direction of the displayed items will be whatever was last used. Clicking the column header will change the direction of sort for the displayed items.v7.1.6885 Updates the database to DB version 6.2 by the addition of two link tables between bats and recordings and between bats and sessions. These tables enable much faster access to bat specific data. Also various improvements to improve the speed of loading data when switching to List By Bats view, especially with very large databases.v7.1.6891 Further performance improvements in loading ListByBats and in loading imagesv7.1.6901 Has the ability to perform screen grabs of images without needing an external screen grabber program. Shift-Click on the 'PASTE' button and drag and resize the semi-transparent window to select a screen area, right click in the window to capture that portion of the screen. For details refer to Import/Import Picturesv7.1.6913 Fixed some scaling issues on fiducial lines in the comparison windowv7.1.6915 Bugfix for adjusting fiducial lines - 7.1.6913 removedv7.1.6941 Improvements and adjustments to grid and fiducial line handlingv7.1.6951 Fixes some problems with the Search dialogv7.2.6970 Introduces the ability to replay segments at reduced speed or in heterodyne 'bat detector' mode.v7.2.6971 When opening a recording or segment in Audacity the corresponding .txt file will be opened as a label track. NB this only works if there is only a single copy of Audacity open - subsequent calls with Audacity still open do not open the label track.v7.2.6978 Improvements to Heterodyne playback to use pure sinewave.7.2.6984 Bug fixes and mods to image handling - image captions can now have a region appended in seconds after the file name.---BRM-Aud-Setup_v7_2_7000.exeThis version includes its only private copy of Audacity 2.3.0 portable, which will be placed in the same folder as BRM and has its own pre-configured configuration file appropriate for use with BRM. This will not interfere with any existing installation of Audacity but provides all the Audacity features required by BRM with no further action by the user. BRM will use this version to display .wav files.v7.2.7000 also includes a new report format which is tailored to provide data for the Hertfordshire Mammals, Amphibians and Reptiles survey. It also displays the GPS co-ordinates for the Recording Session as an OS Grid Reference as well as latitude and longitude.v7.2.7010 Speed improvements and bug-fixes to opening and running Audacity through BRM. Audacity portable is now located in C:\audacity-win-portable instead of under the BRM program folder.v7.2.7012 Fixed some bugs in Report generation when producing he Frequency Table. Enabled the AddTag button in the BatReference pane.v7.2.7021 Upgrades the Audacity component to version 2.3.1 and a few minor bug fixes.

  15. Romanian Institute of Statistics

    • hosted-metadata.bgs.ac.uk
    Updated Dec 13, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romanian Institute of Statistics, Bd Libertatii nr.16 sector 5, +4021 3181824; +4021 3181842 , romstat@insse.ro (2012). Romanian Institute of Statistics [Dataset]. https://hosted-metadata.bgs.ac.uk/geonetwork/srv/api/records/70c8d9f8-a012-4a19-8663-b66d2e4c8e61
    Explore at:
    Dataset updated
    Dec 13, 2012
    Dataset provided by
    NSI Romaniahttp://www.insse.ro/cms/ro
    British Geological Surveyhttps://www.bgs.ac.uk/
    Area covered
    Description

    TEMPO-Online provides the following functions and services: Free access to statistical information.Export of tables in .csv and .xls formats and its printing. What is the content of TEMPO-Online? The National Institute of Statistics offers a statistical database, TEMPO-Online, that gives the possibility to access a large range of information.The content of the above-mentioned database consists of:Approximately 1100 statistical indicators, divided in socio-economical fields and sub-fields; Metadata associated to the statistical indicators (definition, starting and ending year of the time series, the last period of data loading, statistical methodology, the last updating); Detailed indicators at statistical characteristics group and/or sub-group level ( ex. The total number of employees at the end of the year by employee category, activities of the national economy - sections, sexes, areas and counties); Time series starting with 1990 - till today: With a monthly, quarterly, semi-annual and annual frequency; At national level, development region level, county and commune level. Search according to key words The search key words allows the finding of various objects (tables with statistical variables divided on time series). The search will give back results based on the matrix code and on the key words in the title or in the definition of a matrix. The result of the search will show on a list with specific objects. For a key word, one can use the searching section from the menu bar on the left.Tables As a whole, the tables that result following an interrogation have a flexible structure. For instance, the user may select the variables and attributes with the help of the interrogation interface, according to his needs.The user can save the table that results following an interrogation in .csv and .xls formats and its printingNote: in order to access tables at place level (very large), the user has to select each county with the respective places, so that the access be faster and avoid technical blocks.

    Website: http://statistici.insse.ro/shop/?lang=en

  16. S

    Data set of the effects of endogenous potassium and calcium metal ions on...

    • scidb.cn
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cai han le (2025). Data set of the effects of endogenous potassium and calcium metal ions on the characteristics of pyrolysis gas, solid and liquid products of corn stover [Dataset]. http://doi.org/10.57760/sciencedb.j00124.00271
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2025
    Dataset provided by
    Science Data Bank
    Authors
    cai han le
    Description

    This dataset originates from the experimental study titled "Effects of Endogenous Potassium and Calcium Metal Ions in Corn Stalk on the Characteristics of Its Pyrolysis Gaseous, Solid, and Liquid Products," aiming to obtain correlated data between metal ion concentration, pyrolysis temperature, and the yield and characteristics of pyrolysis products (biochar, bio-oil, and syngas) through systematic experiments, with all data generated via experimental determination and standardized processing. The data generation process is as follows: first, experimental materials were prepared—corn stalks used as raw material were collected from farms in Guannan County, Lianyungang City, Jiangsu Province, crushed into powder with a particle size of 80–120 meshes using a crusher, dried to absolute dryness in an oven at 105°C, and then sealed for later use; the corn stalk powder (CS-Raw) was subjected to acid washing for ash removal by immersing it in 1 mol/L hydrochloric acid at a solid-to-liquid ratio of 1:10, stirring at room temperature for 12 hours, and then undergoing filtration, rinsing, and drying to obtain the ash-removed sample (CS-AW); subsequently, using KCl and CaCl₂ as metal sources, CS-AW was immersed in deionized water containing the corresponding metal salts at a solid-to-liquid ratio of 1:10, with metal ion concentrations (mass ratio relative to the raw material) set at 2%, 5%, and 7%, and after drying, metal-loaded samples (CS-K-2%/5%/7%, CS-Ca-2%/5%/7%) were obtained. Pyrolysis experiments were conducted using a self-made fixed-bed device (comprising a gas supply system with high-purity nitrogen cylinders, high-purity oxygen cylinders, and gas flow controllers; a reaction system with a temperature controller and a heating reactor; a liquid collection system with a low-temperature bath and a condenser; and a gas collection system with desiccants and gas collection bags); for each experiment, 3 g of sample was weighed and placed in a quartz tube, purged with N₂ for 10 minutes, then heated to 400°C, 500°C, and 600°C at a heating rate of 20°C/min, held at the target temperature for 20 minutes, and after cooling, the solid residual char was weighed (M1); the mass of the liquid product (M2) was obtained from the mass difference of the condenser, and the gas yield was calculated as 100% minus the solid yield minus the liquid yield. During data processing, the net organic char yield of biochar was calculated using the formula: "(mass of residual char − mass of metal salt) / (mass of raw material − mass of metal salt) × 100%"; the oxygen (O) content on a dry basis was derived by "100% − C − H − N − S − ash content"; the components of bio-oil were qualitatively analyzed via the peak area normalization method using a gas chromatography-mass spectrometry (GC/MS) instrument combined with the NIST spectral library; and the gas components were determined using a gas chromatography (GC) instrument equipped with a thermal conductivity detector (TCD) and a flame ionization detector (FID). Experimental characterization relied on various instruments: elemental analysis was performed using an Elementary Vario EL III Automatic Elemental Analyzer (Elementar, Germany); higher heating value was measured using a ZDHW-300A Microcomputer-Automatic Calorimeter (Keda Instrument Co., Ltd., Hebi City); proximate analysis was conducted in accordance with the GB/T 28731—2012 standard; the content of alkali and alkaline earth metals (AAEMs) was determined using an iCAP 7000 Inductively Coupled Plasma Optical Emission Spectrometer (ICP-OES, Thermo Fisher Scientific, USA); thermogravimetric analysis was carried out using a TG209 F1 Libra Thermogravimetric Analyzer (Netzsch, Germany) with an N₂ flow rate of 40 mL/min and a heating rate of 20°C/min up to 800°C; gas component analysis was performed using a GC9890B Gas Chromatograph (Renhua Chromatography Technology Co., Ltd., Nanjing) equipped with a Porapak Q column and a 13X molecular sieve; and liquid component analysis was conducted using an ISQ7000 Gas Chromatography-Mass Spectrometry (GC/MS) Instrument (Thermo Fisher Scientific, USA) with an HP-5MS capillary column and high-purity helium as the carrier gas. In terms of spatiotemporal information, there is no continuous time-series data in the time dimension, only instantaneous experimental data at three pyrolysis temperatures (400°C, 500°C, and 600°C), and all experimental operations were completed within the same time period to ensure consistent conditions; in the spatial dimension, the raw material was collected from a specific farm in Guannan County, Lianyungang City, Jiangsu Province (single-point sampling, no spatial gradient distribution), all experiments were conducted in the laboratories of the Bamboo Industry Institute and the College of Environment and Resources, Zhejiang A & F University, and the spatial resolution focuses on the laboratory experimental equipment and the raw material sampling point, with no large-scale spatial extension data. The table data includes 5 structured data tables (Table 1 to Table 5): Table 1, titled "Elemental and Proximate Analysis of Corn Stalk Before and After Acid Washing for Ash Removal and Metal Ion Loading," contains 8 records (covering CS-Raw, CS-AW, and 6 metal-loaded samples), with column labels including elemental analysis (C, H, O, N, S, unit: wt%, on a dry and ash-free basis (daf)), proximate analysis (volatiles, fixed carbon, ash, unit: wt%, on a dry basis (db)), and higher heating value (unit: MJ/kg); Table 2, titled "AAEM Contents in Corn Stalk Before and After Acid Washing for Ash Removal," includes 2 records (CS-Raw, CS-AW), with column labels including AAEM contents (K, Na, Ca, Mg, unit: μg/g) and removal rates (unit: %); Table 3, titled "Residual Char Rate of Corn Stalk Pyrolysis Under Different Concentrations of Potassium Ions and Calcium Ions," has 8 records (the same samples as in Table 1), with column labels being total residual char rate (unit: %) and net organic residual char rate (unit: %); Table 4, titled "Effects of Different Concentrations of Potassium Ions and Calcium Ions on the Basic Characteristics of Corn Stalk Pyrolysis Char," contains 8 records (the same samples as in Table 1), with column labels consistent with those of Table 1 (O is marked as O*, indicating a calculated value on a dry basis); Table 5, titled "Effects of Pyrolysis Temperature on the Elemental and Proximate Analysis of Corn Stalk Pyrolysis Biochar," includes 6 records (400-7% K, 500-7% K, 600-7% K, 400-7% Ca, 500-7% Ca, 600-7% Ca), with column labels consistent with those of Table 1. In terms of data integrity, there is no obvious data missing, and all samples designed in the experiment (8 basic samples and 6 temperature-metal combination samples) have undergone testing for key indicators such as elemental analysis, proximate analysis, yield determination, and component analysis; the sources of errors mainly include mass errors caused by balance precision during sample weighing (e.g., weighing of M1 and M2), leakage errors that may be caused by the tightness of gas collection bags during gas collection, relative analysis errors from the GC/MS peak area normalization method, and weight loss rate errors caused by sample uniformity in thermogravimetric analysis; the experiment reduced errors by controlling conditions such as N₂ purging time (≥10 minutes), heating rate stability (20°C/min), metal salt weighing precision, and consistency of the solid-to-liquid ratio for acid washing, and although no specific error range is clearly given, the data meet the precision requirements of conventional laboratory experiments (e.g., mass weighing error ≤ 0.001 g, temperature control error ≤ ±5°C). The types of data files include structured table data (Excel format), experimental graph data (thermogravimetric curves, product yield and component distribution diagrams, with source files in Origin format), device schematic diagrams (PPT format), and original instrument data (e.g., .raw format of GC/MS, which can be opened using Excel); there are no files in niche formats, and all files are compatible with conventional scientific research software.

  17. o

    Long Term Development Statement (LTDS) Table 2b Transformer Data - 3W

    • ukpowernetworks.opendatasoft.com
    Updated Nov 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Long Term Development Statement (LTDS) Table 2b Transformer Data - 3W [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ltds-table-2b-transformer-data-3w/
    Explore at:
    Dataset updated
    Nov 28, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction The Long Term Development Statements (LTDS) report on a 0-5 year period, describing a forecast of load on the network and envisioned network developments. The LTDS is published at the end of May and November each year. This is Table 2b from our current LTDS report (published 28 November 2025), showing the Transformer information for three winding (1x High Voltage, 2x Low Voltage) transformers associated to each Grid and Primary substation where applicable. More information and full reports are available from the landing page below: Long Term Development Statement and Network Development Plan Landing Page

    Methodological Approach

    Site Functional Locations (FLOCs) are used to associate the Substation which the transformer is located to Key characteristics of active Grid and Primary sites — UK Power Networks ID field added to identify row number for reference purposes

    Quality Control Statement Quality Control Measures include:

    Verification steps to match features only with confirmed functional locations. Manual review and correction of data inconsistencies. Use of additional verification steps to ensure accuracy in the methodology.

    Assurance Statement The Open Data Team and Network Insights Team worked together to ensure data accuracy and consistency.

    Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.

  18. H

    National Survey on Drug Use and Health (NSDUH)

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). National Survey on Drug Use and Health (NSDUH) [Dataset]. http://doi.org/10.7910/DVN/ZIGNUL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the national survey on drug use and health (nsduh) with r the national survey on drug use and health (nsduh) monitors illicit drug, alcohol, and tobacco use with more detail than any other survey out there. if you wanna know the average age at first chewing tobacco dip, the prevalence of needle-sharing, the family structure of households with someone abusing pain relievers, even the health insurance coverage of peyote users, you are in the right place. the substance abuse and mental health services administration (samhsa) contracts with the north carolinians over at research triangle institute to run the survey, but the university of michigan's substance abuse and mental health data archive (samhda) holds the keys to this data castle. nsduh in its current form only goes back about a decade, when samhsa re-designed the methodology and started paying respondents thirty bucks a pop. before that, look for its predecessor - the national household survey on drug abuse (nhsda) - with public use files available back to 1979 (included in these scripts). be sure to read those changes in methodo logy carefully before you start trying to trend smokers' virginia slims brand loyalty back to 1999. although (to my knowledge) only the national health interview survey contains r syntax examples in its documentation, the friendly folks at samhsa have shown promise. since their published data tables were run on a restricted-access data set, i requested that they run the same sudaan analysis code on the public use files to confirm that this new r syntax does what it should. they delivered, i matched, pats on the back all around. if you need a one-off data point, samhda is overflowing with options to analyze the data online. you even might find some restricted statistics that won't appear in the public use files. still, that's no substitute for getting your hands dirty. when you tire of menu-driven online query tools and you're ready to bark with the big data dogs, give these puppies a whirl. the national survey on drug use and health targets the civilian, noninstitutionalized population of the united states aged twelve and older. this new github repository contains three scripts: 1979-2011 - download all microdata.R authenticate the university of michi gan's "i agree with these terms" page download, import, save each available year of data (with documentation) back to 1979 convert ea ch pre-packaged stata do-file (.do) into r, run the damn thing, get NAs where they belong 2010 single-year - analysis examples.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object run enough example analyses to make a kitchen sink jealous replicate sam hsa puf.R load a single year of data limit the table to the variables needed for an example analysis construct the complex sample survey object print statistics and standard errors matching the target replicati on table click here to view these three scripts for more detail about the national survey on drug use and health, visit: the substance abuse and mental health services administration's nsduh homepage research triangle in stitute's nsduh homepage the university of michigan's nsduh homepage notes: the 'download all microdata' program intentionally breaks unless you complete the clearly-defined, one-step instruction to authenticate that you have read and agree with the download terms. the script will download the entire public use file archive, but only after this step has been completed. if you c ontact me for help without reading those instructions, i reserve the right to tease you mercilessly. also: thanks to the great hadley wickham for figuring out how to authenticate in the first place. confidential to sas, spss, stata, and sudaan users: did you know that you don't have to stop reading just because you've run out of candlewax? maybe it's time to switch to r. :D

  19. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  20. Data from: PSP Solar Wind Electrons Alphas and Protons (SWEAP) SPAN-A Full...

    • catalog.data.gov
    • heliophysicsdata.gsfc.nasa.gov
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA Space Physics Data Facility (SPDF) Data Services (2025). PSP Solar Wind Electrons Alphas and Protons (SWEAP) SPAN-A Full 3D Electron Spectra, Level 2 (L2), 14 s Data [Dataset]. https://catalog.data.gov/dataset/psp-solar-wind-electrons-alphas-and-protons-sweap-span-a-full-3d-electron-spectra-level-2-
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    SPAN-E Level 2 Electron Full 3D Spectra Data--------------------------------------------File Naming Format: psp_swp_spa_sf0_L2_16Ax8Dx32E_YYYYMMDD_v01.cdfThe SF0 products are the Full 3D Electron spectra from each individual SPAN-E instrument, SPAN-Ae and SPAN-B. Units are in differential energy flux, degrees, and eV. One spectrum comprises decreasing steps in Energy specified by the number in the filename, alternating sweeps in Theta/Deflection, also specified by the number in the filename, and a number of Phi/Anode directions, also specified by the number in the filename. The sample filename above includes 16 Anodes, 8 Deflections, and 32 Energies.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in "Full Sweep", normal cadence survey mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B during cruise phase periods when the spacecraft is pointed away from the sun are taken in sunlight.Parker Solar Probe SWEAP Solar Probe Analyzer, SPAN, Electron Data Release Notes--------------------------------------------------------------------------------November 19, 2019 Initial Data Release--------------------------------------Overview of Measurements------------------------The SWEAP team is pleased to release the data from Encounter 1 and Encounter 2. The files contain data from the time range October 31, 2018 - June 18, 2019.The prime mission of Parker Solar Probe is to take data when within 0.25 AU of the Sun during its orbit. However, there has been some extended campaign measurements outside of this distance. The data are available for those days that are within 0.25 AU as well as those days when the instruments were operational outside of 0.25 AU.Each SWEAP data file includes a set of a particular type of measurements over a single observing day. Measurements are provided in Common Data Format (CDF), a self-documenting data framework for which convenient open source tools exist across most scientific computing platforms. Users are strongly encouraged to consult the global metadata in each file, and the metadata that are linked to each variable. The metadata includes comprehensive listings of relevant information, including units, coordinate systems, qualitative descriptions, measurement uncertainties, methodologies, links to further documentation, and so forth.SPAN-E Level 2 Version 01 Release Notes---------------------------------------The SPAN-Ae and SPAN-B instruments together have fields of view covering >90% of the sky; major obstructions to the FOV include the spacecraft heat shield and other intrusions by spacecraft components. Each individual SPAN-E has FOV of ±60° in Theta and 240° in Phi. The rotation matrices to convert into the spacecraft frame can be found in the individual CDF files, or in the instrument paper.This data set covers all periods for which the instrument was turned on and taking data in the solar wind in ion mode. This includes maneuvers affecting the spacecraft attitude and orientation. Measurements taken by SPAN-B when the spacecraft is pointed away from the sun are taken in sunlight.The data quality flags for the SPAN data can be found in the CDF files as: QUALITY_FLAG (0=good, 1=bad)General Remarks for Version 01 Data-----------------------------------Users interested in field-aligned electrons should take care regarding potential blockages from the heat shield when B is near radial, especially in SPAN-Ae. Artificial reductions in strahl width can result.Due to the relatively high electron temperature in the inner heliosphere, many secondary electrons are generated from spacecraft and instrument surfaces. As a result, electron measurements in this release below 30 eV are not advised for scientific analysis.The fields of view in SPAN-Ae and SPAN-B have many intrusions by the spacecraft, and erroneous pixels discovered in analysis, in particular near the edges of the FOV, should be viewed with skepticism. Details on FOV intrusion are found in the instrument paper, forthcoming, or by contacting the SPAN-E instrument scientist.The instrument mechanical attentuators are engaged during the eight days around perihelia 1 and perihelia 2, which results in a factor of about 10 reduction of the total electron flux into the instrument. During these eight days, halo electron measurements are artificially enhanced in the L2 products as a result of the reduced instrument geometric factor and subsequent ground corrections.A general note for Encounter 1 and Encounter 2 data: a miscalculation in the deflection tables loaded to both SPAN-Ae and SPAN-B resulted in over-deflection of the outermost Theta angles during these encounters. As such, pixels at large Thetas should be ignored. This error was corrected by a table upload prior to Encounter 3.Lastly, when viewing time gaps in the SPAN-E measurements, be advised that the first data point produced by the instrument after a power-on is the maximum value permitted by internal instrument counters. Therefore, the first data point after powerup is erroneous and should be discarded, as indicated by quality flags.SPAN-E Encounter 1 Remarks--------------------------SPAN-E operated nominally for the majority of the first encounter. Exceptions to this include: a few instances of corrupted, higher-energy sweep tables, and an instrument commanding error for the two hours surrounding perihelion 1. These and other instrument diagnostic tests are indicated with the QUALITY_FLAG variable in the CDFs.The mechanical attentuator was engaged for the 8 days around perihelion 1: as a result the microchannel plate, MCP, noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution with this data release if looking for halo electrons when the mechanical attenuator is engaged.SPAN-E Cruise Phase Remarks---------------------------The cruise mode rates of SPAN-E are greatly reduced compared to the encounter mode rates. When the PSP spacecraft is in a communications slew, the SPAN-B instrument occasionally reaches its maximum allowable operating temperature and is powered off by SWEM.Timing for the SF1 products in cruise phase is not corrected in v01, and thus it is not advised to use the data at this time for scientific analysis. The typical return of SF0 products is one spectrum out of every 32 survey spectra is returned every 15 minutes or so. One out of every four 27.75 s SF1 spectra is produced every 111 s.SPAN-E Encounter 2 Remarks--------------------------SPAN-E operated nominally for the majority of the second encounter. Exceptions include instrument diagnostic and health checks and a few instances of corrupted high-energy sweep tables. These tests and corrupted table loads are indicated with the QUALITY_FLAG parameter.The mechanical attentuator was engaged for the 8 days around perihelion 2: as a result the MCP noise due to thermal effects and cosmic rays are artificially enhanced and are particularly obvious at higher energies. Exercise caution in this data release if looking for halo electrons when the mechanical attenuator is engaged.Parker Solar Probe SWEAP Rules of the Road------------------------------------------As part of the development of collaboration with the broader Heliophysics community, the mission has drafted a "Rules of the Road" to govern how PSP instrument data are to be used. 1) Users should consult with the PI to discuss the appropriate use of instrument data or model results and to ensure that the users are accessing the most recently available versions of the data and of the analysis routines. Instrument team Science Operations Centers, SOCs, and/or Virtual Observatories, VOs, should facilitate this process serving as the contact point between PI and users in most cases. 2) Users should heed the caveats of investigators to the interpretations and limitations of data or model results. Investigators supplying data or models may insist that such caveats be published. Data and model version numbers should also be specified. 3) Browse products, Quicklook, and Planning data are not intended for science analysis or publication and should not be used for those purposes without consent of the PI. 4) Users should acknowledge the sources of data used in all publications, presentations, and reports: "We acknowledge the NASA Parker Solar Probe Mission and the SWEAP team led by J. Kasper for use of data.".* 5) Users are encouraged to provide the PI a copy of each manuscript that uses the PI data prior to submission of that manuscript for consideration of publication. On publication, the citation should be transmitted to the PI and any other providers of data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Southeast Fisheries Science Center (Resource Provider) (2024). Gulf Shrimp Control Data Tables [Dataset]. https://catalog.data.gov/dataset/gulf-shrimp-control-data-tables

Gulf Shrimp Control Data Tables

Explore at:
Dataset updated
Jul 2, 2024
Dataset provided by
Southeast Fisheries Science Center (Resource Provider)
Description

These are tables used to process the loads of gulf shrimp data. It contains pre-validation tables, error tables and information about statistics on data loads. It contains no data tables and no code tables. This information need not be published data set contains catch (landed catch) and effort for fishing trips made by the larger vessels that fish near and offshore for the various species of shrimp in the Gulf of Mexico. The data set also contains landings by the smaller boats that fish in the bays, lakes, bayous, and rivers for saltwater shrimp species; however, these landings data may be aggregated for multiple trip and may not provide effort data similar to the data for the larger vessels. The landings statistics in this data set consist of the quantity and value for the individual species of shrimp by size category type and quantity of gear, fishing duration and fishing area The data collection procedures for the catch/effort data for the large vessels consist of two parts. The landings statistics are collected from the seafood dealers after the trips are unloaded; whereas, the data on fishing effort and area are collected by interviews with the captain or crew while the trip is being unloaded.

Search
Clear search
Close search
Google apps
Main menu