100+ datasets found
  1. Metadata for Pavlovic et al. - Machine Learning Critical Loads

    • catalog.data.gov
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). Metadata for Pavlovic et al. - Machine Learning Critical Loads [Dataset]. https://catalog.data.gov/dataset/metadata-for-pavlovic-et-al-machine-learning-critical-loads
    Explore at:
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).

  2. Data from: Metadata Classification Machine Learning Data

    • osti.gov
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collier, Hannah; Enright, Eric (2024). Metadata Classification Machine Learning Data [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2446583
    Explore at:
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Archive; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Data Center
    Authors
    Collier, Hannah; Enright, Eric
    Description

    This GitLab project contains the training data that was used for the metadata machine learning classification project.

  3. d

    Indexed NLP Article Metadata Dataset

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canchila, Santiago; Meneses-Eraso, Carlos; Casanoves-Boix, Javier; Cortés-Pellicer, Pascual; Castelló-Sirvent, Fernando (2023). Indexed NLP Article Metadata Dataset [Dataset]. http://doi.org/10.7910/DVN/5YIGNG
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Canchila, Santiago; Meneses-Eraso, Carlos; Casanoves-Boix, Javier; Cortés-Pellicer, Pascual; Castelló-Sirvent, Fernando
    Time period covered
    Jan 1, 1987 - Apr 1, 2023
    Description

    his dataset consists of a curated collection of published, indexed articles (N=75527) related to Natural Language Processing (NLP) collected from Web Of Science, along with a classification into one of five categories depending on the approach to NLP used. Category 4: The abstract does not mention a particular model or technique. Some papers analyzing frameworks, surveys, papers centered the computer vision component of NLP and dataset proposals among others fall into this category. Category 0 (Rule-Based): A model based on rules or symbolic analysis is used. Category 1 (Statistical Methods): An approach using statistical methods is used. This includes BoWs, N-Grams, TF-IDF, along with other machine learning techniques like SVMs, Logistic Regression, LDA and others. Shallow neural network models like word2vec also belong in this category. Category 2 (Deep Learning): Approaches that use Deep Learning and other Deep Neural Network architectures such as RNNs, CNNs and LSTM are included in this category. Category 3 (Transformer Models): The approach proposed uses transformer based models, like BERT, GPT, T5 and others. It is to note that the classification could be imprecise, is not strictly defined and should be used only as a starting point. Fields: 'Authors', 'Article Title', 'Volume', 'Issue', 'Special Issue', 'Start Page', 'End Page', 'DOI', 'Book DOI', 'Publication Date', 'Times Cited', 'ISSN', 'eISSN', 'Author Full Names', 'Book Author Full Names', 'Language', 'Author Keywords', 'Keywords', 'Funding Orgs', 'Funding Text', 'Cited References', 'DOI Link', 'Number of Pages', 'Categories', 'Research Areas', 'bert_preds', 'setfit_preds', 'knn_preds', 'abstract_hash'. The dataset is provided in different formats. To address potential copyright, licensing, and data privacy concerns, we have replaced the original abstracts with SHA-256 hashes, cryptographic representations of the abstracts' content. Please note that the copyright and licensing status of the original articles may vary, and users should respect any applicable terms and restrictions associated with the source publications.

  4. Statistics and Evaluation Data for Publication "Using Supervised Learning to...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa (2020). Statistics and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Discipline of Research" [Dataset]. http://doi.org/10.5281/zenodo.3490468
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This publication contains aggregated data for the paper. It also contains the evaluation data of all model/hyper-parameter training and test runs.

  5. Z

    Data from: A metadata-based approach for research discipline prediction...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ALI-ELDIN, Amr (2023). A metadata-based approach for research discipline prediction using machine learning techniques and distance metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944966
    Explore at:
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    ALI-ELDIN, Amr
    PHAM, Hoang-Son
    POELMANS, Hanne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is based on the paper:

    Hoang-Son Pham, Hanne Poelmans and Amr Ali-Eldin ‘’A metadata-based approach for research discipline prediction using machine learning techniques and distance metrics’’, IEEE Access (2023).

    The dataset includes:

    1. a list of project metadata extracted from FRIS portal

    2. a list of VODS disciplines

    3. a distance matrix

    • Kindly refer to our paper for more details on the dataset.

    https://ieeexplore.ieee.org/document/10156853

  6. f

    Table 1_Artificial intelligence in breast cancer survival prediction: a...

    • frontiersin.figshare.com
    docx
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zohreh Javanmard; Saba Zarean Shahraki; Kosar Safari; Abbas Omidi; Sadaf Raoufi; Mahsa Rajabi; Mohammad Esmaeil Akbari; Mehrad Aria (2025). Table 1_Artificial intelligence in breast cancer survival prediction: a comprehensive systematic review and meta-analysis.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1420328.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Frontiers
    Authors
    Zohreh Javanmard; Saba Zarean Shahraki; Kosar Safari; Abbas Omidi; Sadaf Raoufi; Mahsa Rajabi; Mohammad Esmaeil Akbari; Mehrad Aria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundBreast cancer (BC), as a leading cause of cancer mortality in women, demands robust prediction models for early diagnosis and personalized treatment. Artificial Intelligence (AI) and Machine Learning (ML) algorithms offer promising solutions for automated survival prediction, driving this study’s systematic review and meta-analysis.MethodsThree online databases (Web of Science, PubMed, and Scopus) were comprehensively searched (January 2016-August 2023) using key terms (“Breast Cancer”, “Survival Prediction”, and “Machine Learning”) and their synonyms. Original articles applying ML algorithms for BC survival prediction using clinical data were included. The quality of studies was assessed via the Qiao Quality Assessment tool.ResultsAmongst 140 identified articles, 32 met the eligibility criteria. Analyzed ML methods achieved a mean validation accuracy of 89.73%. Hybrid models, combining traditional and modern ML techniques, were mostly considered to predict survival rates (40.62%). Supervised learning was the dominant ML paradigm (75%). Common ML methodologies included pre-processing, feature extraction, dimensionality reduction, and classification. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs), emerged as the preferred modern algorithm within these methodologies. Notably, 81.25% of studies relied on internal validation, primarily using K-fold cross-validation and train/test split strategies.ConclusionThe findings underscore the significant potential of AI-based algorithms in enhancing the accuracy of BC survival predictions. However, to ensure the robustness and generalizability of these predictive models, future research should emphasize the importance of rigorous external validation. Such endeavors will not only validate the efficacy of these models across diverse populations but also pave the way for their integration into clinical practice, ultimately contributing to personalized patient care and improved survival outcomes.Systematic Review Registrationhttps://www.crd.york.ac.uk/prospero/, identifier CRD42024513350.

  7. o

    Zenodo Open Metadata snapshot - Training dataset for records and communities...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
    Explore at:
    Dataset updated
    Dec 14, 2022
    Authors
    Zenodo team
    Description

    This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted. The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community. Records dataset Filename: zenodo_open_metadata_{ date of export }.jsonl.gz Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json. In addition, some terms have been altered: The term files contains a list of dictionaries containing filetype, size, and filename only. The term license contains a short Zenodo ID of the license (e.g. "cc-by"). Communities dataset Filename: zenodo_community_metadata_{ date of export }.jsonl.gz Each object contains the terms: id, title, description, curation_policy, page which correspond to the fields with the same name available in Zenodo's community creation form. Notes for all datasets For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff. Some values for the top-level terms, which were missing in the metadata may contain a null value. A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.

  8. Understanding machine learning dataset search behaviors: A survey

    • zenodo.org
    csv, pdf, txt
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Edgerton; Joe Edgerton (2025). Understanding machine learning dataset search behaviors: A survey [Dataset]. http://doi.org/10.5281/zenodo.15359924
    Explore at:
    pdf, txt, csvAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joe Edgerton; Joe Edgerton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 7, 2025
    Description

    These files represent the data and accompanying documents of an independent research study by a student researcher examining the searchability and usability of machine learning dataset metadata.

    The purpose of this exploratory study was to understand how machine learning (ML) practitioners are searching for and evaluating datasets for use in their work. This research will help inform development of the ML dataset metadata standard Croissant, which is actively being developed by the Croissant MLCommons working group, so it can aid ML practitioners' workflows and promote best practices like Responsible Artificial Intelligence (RAI).

    The study consisted of a pre-interview Qualtrics survey ("Survey_questions_pre_interview.pdf") that focused on ranking various metadata elements on a Likert importance scale.

    The interview consisted of open questions ("Interview_script_and_questions.pdf") on a range of topics from search of datasets to interoperability to AI used in dataset search. Additionally, participants were asked to share their screen at one point and recall a recent dataset search they had performed.

    The resulting survey dataset ("Survey_p1.csv") and interview ("Interview_p1.txt") of participants are presented in open standard formats for accessibility. Identifying data has been removed from the files so there will be missing columns and rows potentially referenced in the files.

  9. d

    Waveform Data and Metadata used to National Earthquake Information Center...

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Waveform Data and Metadata used to National Earthquake Information Center Deep-Learning Models [Dataset]. https://catalog.data.gov/dataset/waveform-data-and-metadata-used-to-national-earthquake-information-center-deep-learning-mo
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This is the supporting data used to train machine learning models used by the National Earthquake Information Center to improve pick times and classify source characteristics.

  10. Metadata record for: Compendiums of cancer transcriptomes for machine...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim (2023). Metadata record for: Compendiums of cancer transcriptomes for machine learning applications [Dataset]. http://doi.org/10.6084/m9.figshare.9901763.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor Compendiums of cancer transcriptomes for machine learning applications. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format 
         3. machine readable metadata file in ISA-Tab format (zipped folder)Versioning Note:A revised version was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
    
  11. Metadata record for: Global soil moisture data derived through machine...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: Global soil moisture data derived through machine learning trained with in-situ measurements [Dataset]. http://doi.org/10.6084/m9.figshare.14790510.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor Global soil moisture data derived through machine learning trained with in-situ measurements. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format
    
  12. A

    Active Metadata Management Solution Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Active Metadata Management Solution Report [Dataset]. https://www.marketreportanalytics.com/reports/active-metadata-management-solution-53710
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Active Metadata Management Solution market is experiencing robust growth, driven by the increasing need for efficient data governance, improved data quality, and enhanced data discoverability across diverse industries. The market's expansion is fueled by the rising volume and velocity of data generated by organizations, necessitating sophisticated solutions to manage and leverage this information effectively. Key trends include the adoption of cloud-based solutions, the integration of AI and machine learning for automated metadata management, and a growing focus on data security and compliance. While the initial investment in implementing these solutions can be substantial, the long-term benefits in terms of reduced operational costs, improved data-driven decision-making, and minimized regulatory risks outweigh these initial expenses. We estimate the current market size (2025) to be around $5 billion, projecting a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This growth is largely attributed to the increasing adoption across various sectors, including finance, healthcare, and manufacturing, where data-driven insights are critical for operational efficiency and competitive advantage. The segmentation within the market reflects the diversity of applications and solution types, with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. North America and Europe currently dominate the market share, but the Asia-Pacific region is poised for significant growth in the coming years driven by increasing digitalization and technological advancements. Market restraints include the complexity of implementing and integrating these solutions with existing IT infrastructure, a potential skills gap in managing these systems effectively, and concerns about data privacy and security. However, the ongoing technological advancements and increasing awareness about the importance of data governance are expected to mitigate these challenges. The competitive landscape is marked by a mix of established players and emerging technology providers, constantly innovating to meet the evolving needs of businesses. The market is expected to witness strategic partnerships, mergers and acquisitions, and product enhancements throughout the forecast period, driving further consolidation and innovation.

  13. A Dataset for Machine Learning Algorithm Development

    • fisheries.noaa.gov
    • catalog.data.gov
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaska Fisheries Science Center (AFSC) (2021). A Dataset for Machine Learning Algorithm Development [Dataset]. https://www.fisheries.noaa.gov/inport/item/63322
    Explore at:
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    Alaska Fisheries Science Center
    Authors
    Alaska Fisheries Science Center (AFSC)
    Area covered
    Kotzebue Sound, Chukchi Sea, Beaufort Sea, Alaska
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

  14. Interior Design Images & Metadata

    • kaggle.com
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GalinaKG (2025). Interior Design Images & Metadata [Dataset]. https://www.kaggle.com/datasets/galinakg/interior-design-images-and-metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GalinaKG
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains a curated collection of interior design images categorized by room type and design style. The images are sourced from Pinterest and labeled with relevant metadata for machine learning applications, including image classification, style prediction, and aesthetic analysis.

    Dataset Structure

    The dataset is organized into directories based on room types:

    • bathroom/
    • bedroom/
    • kitchen/
    • living_room/

    Each room type further contains subdirectories for different design styles, such as:

    • boho
    • industrial
    • minimalist
    • modern
    • scandinavian

    Files Included

    • metadata.csv → Contains file paths and labels for room type and design style.
    • train_data.csv → Training split of the dataset.
    • val_data.csv → Validation split of the dataset.
    • test_data.csv → Test split for evaluation.

    Metadata Format

    Each row in metadata.csv contains:

    • image_path: Relative path to the image.
    • room_type: The category of the room (e.g., bathroom, bedroom).
    • style: The interior design style (e.g., boho, modern).
  15. Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB)

    • zenodo.org
    bin
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jörg Lehmann; Jörg Lehmann; Sophie Schneider; Sophie Schneider (2024). Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB) [Dataset]. http://doi.org/10.5281/zenodo.12783814
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jörg Lehmann; Jörg Lehmann; Sophie Schneider; Sophie Schneider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 22, 2024
    Area covered
    Berlin
    Description

    This dataset was created with the intent to provide a single larger set of metadata from Berlin State Library for research purposes and the development of AI applications.

    The dataset comprises of descriptive metadata of 2.619.397 titles, which together form the "Alte Realkatalog" of Berlin State Library, which may be translated to "Old Subject Catalogue". The data are stored in columnar format, containing 375 columns. They were downloaded in December 2023 from the German central library system (CBS). Exemplary tasks which can be served by this dataset comprise studies on the history of books between 1500 and 1955, on the paratextual formatting of scientific books between 1800 and 1955, and on pattern recognition on the basis of bibliographical metadata.

  16. f

    Metadata record for: A shell dataset, for shell features extraction and...

    • springernature.figshare.com
    txt
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Zhang; Jianhang Zhou; Jing He; Xiaodong Cun; Shaoning Zeng; Bob Zhang (2024). Metadata record for: A shell dataset, for shell features extraction and recognition [Dataset]. http://doi.org/10.6084/m9.figshare.9939353.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    figshare
    Authors
    Qi Zhang; Jianhang Zhou; Jing He; Xiaodong Cun; Shaoning Zeng; Bob Zhang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor A shell dataset, for shell features extraction and recognition. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON formatVersioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
    
  17. P

    Meta-Album Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ihsan Ullah; Dustin Carrión-Ojeda; Sergio Escalera; Isabelle Guyon; Mike Huisman; Felix Mohr; Jan N van Rijn; Haozhe Sun; Joaquin Vanschoren; Phan Anh Vu, Meta-Album Dataset [Dataset]. https://paperswithcode.com/dataset/meta-album
    Explore at:
    Authors
    Ihsan Ullah; Dustin Carrión-Ojeda; Sergio Escalera; Isabelle Guyon; Mike Huisman; Felix Mohr; Jan N van Rijn; Haozhe Sun; Joaquin Vanschoren; Phan Anh Vu
    Description

    Meta Album is a meta-dataset created for few-shot learning, meta-learning, continual learning and so on. Meta Album consists of 40 datasets from 10 unique domains. Datasets are arranged in sets (10 datasets, one dataset from each domain). It is a continuously growing meta-dataset.

    We repurposed datasets that were generously made available by original creators. All datasets are free for use for academic purposes, provided that proper credits are given. For your convenience, you may cite our paper, which references all original creators.

    Meta-Album is released under a CC BY-NC 4.0 license permitting non-commercial use for research purposes, provided that you cite us. Additionally, redistributed datasets have their own license.

    The recommended use of Meta-Album is to conduct fundamental research on machine learning algorithms and conduct benchmarks, particularly in: few-shot learning, meta-learning, continual learning, transfer learning, and image classification.

  18. Metadata record for: All urban areas’ energy use data across 640 districts...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: All urban areas’ energy use data across 640 districts in India [Dataset]. http://doi.org/10.6084/m9.figshare.13516925.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    India
    Description

    This dataset contains key characteristics about the data described in the Data Descriptor All urban areas’ energy use data across 640 districts in India. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format
    
  19. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  20. o

    Data from: Contextualized, Metadata-Empowered, Coarse-to-Fine...

    • explore.openaire.eu
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheeraj Mekala (2021). Contextualized, Metadata-Empowered, Coarse-to-Fine Weakly-Supervised Text Classification [Dataset]. https://explore.openaire.eu/search/other?orpId=od_325::9c725f6a4398ba70deb7e866a782f5fd
    Explore at:
    Dataset updated
    Jan 1, 2021
    Authors
    Dheeraj Mekala
    Description

    Text classification plays a fundamental role in transforming unstructured text data to structured knowledge. State-of-the-art text classification techniques rely on heavy domain-specific annotations to build massive machine(deep) learning models. Although these deep learning models exhibit superior performance, the lack of training data and expensive human effort in the manual annotation is a key bottleneck that forbids them from being adopted in many practical scenarios. To address this bottleneck, our research exploits the data and develops a family of data-driven text classification frameworks with minimal supervision, for e.g. class names, a few label-indicative seed words per class.The massive volume of text data and complexity of natural language pose significant challenges to categorizing the text corpus without human annotations. For instance, the user- provided seed words can have multiple interpretations depending on the context, and their respective user-intended interpretation has to be identified for accurate classification. Moreover, metadata information like author, year, and location is widely available in addition to the text data, and it could serve as a strong, complementary source of supervision. However, leveraging metadata is challenging because (1) metadata is multi-typed, therefore it requires systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. And also, the label set is typically assumed to be fixed in traditional text classification problems. However, in many real-world applications, new classes especially more fine-grained ones will be introduced as the data volume increases. The goal of our research is to create general data-driven methods that transform real-world text data into structured categories of human knowledge with minimal human effort.This thesis outlines a family of weakly supervised text classification approaches, which upon combining can automatically categorize huge text corpus into coarse and fine-grained classes, with just label hierarchy and a few label-indicative seed words as supervision. Specifically, it first leverages contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of a seed word, and thus result- ing in contextualized weak supervision. Then, to leverage metadata, it organizes the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Finally, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. We have performed extensive experiments on real-world datasets from different domains. The results demonstrate significant advantages of using contextualized weak supervision and leveraging metadata, and superior performance over baselines.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2024). Metadata for Pavlovic et al. - Machine Learning Critical Loads [Dataset]. https://catalog.data.gov/dataset/metadata-for-pavlovic-et-al-machine-learning-critical-loads
Organization logo

Metadata for Pavlovic et al. - Machine Learning Critical Loads

Explore at:
Dataset updated
Feb 8, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).

Search
Clear search
Close search
Google apps
Main menu