100+ datasets found

Metadata for Pavlovic et al. - Machine Learning Critical Loads
catalog.data.gov
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). Metadata for Pavlovic et al. - Machine Learning Critical Loads [Dataset]. https://catalog.data.gov/dataset/metadata-for-pavlovic-et-al-machine-learning-critical-loads
Explore at:
Dataset updated
Feb 8, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).
Data from: Metadata Classification Machine Learning Data
osti.gov
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collier, Hannah; Enright, Eric (2024). Metadata Classification Machine Learning Data [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2446583
Explore at:
Dataset updated
Sep 18, 2024
Dataset provided by
United States Department of Energyhttp://energy.gov/
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Archive; Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Atmospheric Radiation Measurement (ARM) Data Center
Authors
Collier, Hannah; Enright, Eric
Description
This GitLab project contains the training data that was used for the metadata machine learning classification project.
d
Indexed NLP Article Metadata Dataset
search.dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canchila, Santiago; Meneses-Eraso, Carlos; Casanoves-Boix, Javier; Cortés-Pellicer, Pascual; Castelló-Sirvent, Fernando (2023). Indexed NLP Article Metadata Dataset [Dataset]. http://doi.org/10.7910/DVN/5YIGNG
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/5YIGNG
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Canchila, Santiago; Meneses-Eraso, Carlos; Casanoves-Boix, Javier; Cortés-Pellicer, Pascual; Castelló-Sirvent, Fernando
Time period covered
Jan 1, 1987 - Apr 1, 2023
Description
his dataset consists of a curated collection of published, indexed articles (N=75527) related to Natural Language Processing (NLP) collected from Web Of Science, along with a classification into one of five categories depending on the approach to NLP used. Category 4: The abstract does not mention a particular model or technique. Some papers analyzing frameworks, surveys, papers centered the computer vision component of NLP and dataset proposals among others fall into this category. Category 0 (Rule-Based): A model based on rules or symbolic analysis is used. Category 1 (Statistical Methods): An approach using statistical methods is used. This includes BoWs, N-Grams, TF-IDF, along with other machine learning techniques like SVMs, Logistic Regression, LDA and others. Shallow neural network models like word2vec also belong in this category. Category 2 (Deep Learning): Approaches that use Deep Learning and other Deep Neural Network architectures such as RNNs, CNNs and LSTM are included in this category. Category 3 (Transformer Models): The approach proposed uses transformer based models, like BERT, GPT, T5 and others. It is to note that the classification could be imprecise, is not strictly defined and should be used only as a starting point. Fields: 'Authors', 'Article Title', 'Volume', 'Issue', 'Special Issue', 'Start Page', 'End Page', 'DOI', 'Book DOI', 'Publication Date', 'Times Cited', 'ISSN', 'eISSN', 'Author Full Names', 'Book Author Full Names', 'Language', 'Author Keywords', 'Keywords', 'Funding Orgs', 'Funding Text', 'Cited References', 'DOI Link', 'Number of Pages', 'Categories', 'Research Areas', 'bert_preds', 'setfit_preds', 'knn_preds', 'abstract_hash'. The dataset is provided in different formats. To address potential copyright, licensing, and data privacy concerns, we have replaced the original abstracts with SHA-256 hashes, cryptographic representations of the abstracts' content. Please note that the copyright and licensing status of the original articles may vary, and users should respect any applicable terms and restrictions associated with the source publications.
Statistics and Evaluation Data for Publication "Using Supervised Learning to...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa (2020). Statistics and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Discipline of Research" [Dataset]. http://doi.org/10.5281/zenodo.3490468
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3490468
Dataset updated
May 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Weber; Tobias Weber; Michael Fromm; Michael Fromm; Nelson Tavares de Sousa; Nelson Tavares de Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This publication contains aggregated data for the paper. It also contains the evaluation data of all model/hyper-parameter training and test runs.
Z
Data from: A metadata-based approach for research discipline prediction...
data.niaid.nih.gov
zenodo.org
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ALI-ELDIN, Amr (2023). A metadata-based approach for research discipline prediction using machine learning techniques and distance metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944966
Explore at:
Dataset updated
Jul 27, 2023
Dataset provided by
ALI-ELDIN, Amr
PHAM, Hoang-Son
POELMANS, Hanne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is based on the paper:

Hoang-Son Pham, Hanne Poelmans and Amr Ali-Eldin ‘’A metadata-based approach for research discipline prediction using machine learning techniques and distance metrics’’, IEEE Access (2023).

The dataset includes:

a list of project metadata extracted from FRIS portal

a list of VODS disciplines

a distance matrix

Kindly refer to our paper for more details on the dataset.

https://ieeexplore.ieee.org/document/10156853
f
Table 1_Artificial intelligence in breast cancer survival prediction: a...
frontiersin.figshare.com
docx
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zohreh Javanmard; Saba Zarean Shahraki; Kosar Safari; Abbas Omidi; Sadaf Raoufi; Mahsa Rajabi; Mohammad Esmaeil Akbari; Mehrad Aria (2025). Table 1_Artificial intelligence in breast cancer survival prediction: a comprehensive systematic review and meta-analysis.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1420328.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2024.1420328.s001
Dataset updated
Jan 7, 2025
Dataset provided by
Frontiers
Authors
Zohreh Javanmard; Saba Zarean Shahraki; Kosar Safari; Abbas Omidi; Sadaf Raoufi; Mahsa Rajabi; Mohammad Esmaeil Akbari; Mehrad Aria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundBreast cancer (BC), as a leading cause of cancer mortality in women, demands robust prediction models for early diagnosis and personalized treatment. Artificial Intelligence (AI) and Machine Learning (ML) algorithms offer promising solutions for automated survival prediction, driving this study’s systematic review and meta-analysis.MethodsThree online databases (Web of Science, PubMed, and Scopus) were comprehensively searched (January 2016-August 2023) using key terms (“Breast Cancer”, “Survival Prediction”, and “Machine Learning”) and their synonyms. Original articles applying ML algorithms for BC survival prediction using clinical data were included. The quality of studies was assessed via the Qiao Quality Assessment tool.ResultsAmongst 140 identified articles, 32 met the eligibility criteria. Analyzed ML methods achieved a mean validation accuracy of 89.73%. Hybrid models, combining traditional and modern ML techniques, were mostly considered to predict survival rates (40.62%). Supervised learning was the dominant ML paradigm (75%). Common ML methodologies included pre-processing, feature extraction, dimensionality reduction, and classification. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs), emerged as the preferred modern algorithm within these methodologies. Notably, 81.25% of studies relied on internal validation, primarily using K-fold cross-validation and train/test split strategies.ConclusionThe findings underscore the significant potential of AI-based algorithms in enhancing the accuracy of BC survival predictions. However, to ensure the robustness and generalizability of these predictive models, future research should emphasize the importance of rigorous external validation. Such endeavors will not only validate the efficacy of these models across diverse populations but also pave the way for their integration into clinical practice, ultimately contributing to personalized patient care and improved survival outcomes.Systematic Review Registrationhttps://www.crd.york.ac.uk/prospero/, identifier CRD42024513350.
o
Zenodo Open Metadata snapshot - Training dataset for records and communities...
explore.openaire.eu
data.niaid.nih.gov
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo team (2022). Zenodo Open Metadata snapshot - Training dataset for records and communities classifier building [Dataset]. http://doi.org/10.5281/zenodo.7438358
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7438358
Dataset updated
Dec 14, 2022
Authors
Zenodo team
Description
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted. The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community. Records dataset Filename: zenodo_open_metadata_{ date of export }.jsonl.gz Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json. In addition, some terms have been altered: The term files contains a list of dictionaries containing filetype, size, and filename only. The term license contains a short Zenodo ID of the license (e.g. "cc-by"). Communities dataset Filename: zenodo_community_metadata_{ date of export }.jsonl.gz Each object contains the terms: id, title, description, curation_policy, page which correspond to the fields with the same name available in Zenodo's community creation form. Notes for all datasets For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff. Some values for the top-level terms, which were missing in the metadata may contain a null value. A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Understanding machine learning dataset search behaviors: A survey
zenodo.org
csv, pdf, txt
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joe Edgerton; Joe Edgerton (2025). Understanding machine learning dataset search behaviors: A survey [Dataset]. http://doi.org/10.5281/zenodo.15359924
Explore at:
pdf, txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15359924
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joe Edgerton; Joe Edgerton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 7, 2025
Description
These files represent the data and accompanying documents of an independent research study by a student researcher examining the searchability and usability of machine learning dataset metadata.

The purpose of this exploratory study was to understand how machine learning (ML) practitioners are searching for and evaluating datasets for use in their work. This research will help inform development of the ML dataset metadata standard Croissant, which is actively being developed by the Croissant MLCommons working group, so it can aid ML practitioners' workflows and promote best practices like Responsible Artificial Intelligence (RAI).

The study consisted of a pre-interview Qualtrics survey ("Survey_questions_pre_interview.pdf") that focused on ranking various metadata elements on a Likert importance scale.

The interview consisted of open questions ("Interview_script_and_questions.pdf") on a range of topics from search of datasets to interoperability to AI used in dataset search. Additionally, participants were asked to share their screen at one point and recall a recent dataset search they had performed.

The resulting survey dataset ("Survey_p1.csv") and interview ("Interview_p1.txt") of participants are presented in open standard formats for accessibility. Identifying data has been removed from the files so there will be missing columns and rows potentially referenced in the files.
d
Waveform Data and Metadata used to National Earthquake Information Center...
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Waveform Data and Metadata used to National Earthquake Information Center Deep-Learning Models [Dataset]. https://catalog.data.gov/dataset/waveform-data-and-metadata-used-to-national-earthquake-information-center-deep-learning-mo
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This is the supporting data used to train machine learning models used by the National Earthquake Information Center to improve pick times and classify source characteristics.
Metadata record for: Compendiums of cancer transcriptomes for machine...
springernature.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim (2023). Metadata record for: Compendiums of cancer transcriptomes for machine learning applications [Dataset]. http://doi.org/10.6084/m9.figshare.9901763.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9901763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Su Bin Lim; Swee Jin Tan; Wan-Teck Lim; Chwee Teck Lim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor Compendiums of cancer transcriptomes for machine learning applications. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format 3. machine readable metadata file in ISA-Tab format (zipped folder)Versioning Note:A revised version was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
Metadata record for: Global soil moisture data derived through machine...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Data Curation Team (2023). Metadata record for: Global soil moisture data derived through machine learning trained with in-situ measurements [Dataset]. http://doi.org/10.6084/m9.figshare.14790510.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14790510.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Scientific Data Curation Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor Global soil moisture data derived through machine learning trained with in-situ measurements. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
A
Active Metadata Management Solution Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Active Metadata Management Solution Report [Dataset]. https://www.marketreportanalytics.com/reports/active-metadata-management-solution-53710
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Active Metadata Management Solution market is experiencing robust growth, driven by the increasing need for efficient data governance, improved data quality, and enhanced data discoverability across diverse industries. The market's expansion is fueled by the rising volume and velocity of data generated by organizations, necessitating sophisticated solutions to manage and leverage this information effectively. Key trends include the adoption of cloud-based solutions, the integration of AI and machine learning for automated metadata management, and a growing focus on data security and compliance. While the initial investment in implementing these solutions can be substantial, the long-term benefits in terms of reduced operational costs, improved data-driven decision-making, and minimized regulatory risks outweigh these initial expenses. We estimate the current market size (2025) to be around $5 billion, projecting a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This growth is largely attributed to the increasing adoption across various sectors, including finance, healthcare, and manufacturing, where data-driven insights are critical for operational efficiency and competitive advantage. The segmentation within the market reflects the diversity of applications and solution types, with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. North America and Europe currently dominate the market share, but the Asia-Pacific region is poised for significant growth in the coming years driven by increasing digitalization and technological advancements. Market restraints include the complexity of implementing and integrating these solutions with existing IT infrastructure, a potential skills gap in managing these systems effectively, and concerns about data privacy and security. However, the ongoing technological advancements and increasing awareness about the importance of data governance are expected to mitigate these challenges. The competitive landscape is marked by a mix of established players and emerging technology providers, constantly innovating to meet the evolving needs of businesses. The market is expected to witness strategic partnerships, mergers and acquisitions, and product enhancements throughout the forecast period, driving further consolidation and innovation.
A Dataset for Machine Learning Algorithm Development
fisheries.noaa.gov
catalog.data.gov
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaska Fisheries Science Center (AFSC) (2021). A Dataset for Machine Learning Algorithm Development [Dataset]. https://www.fisheries.noaa.gov/inport/item/63322
Explore at:
Dataset updated
Jan 1, 2021
Dataset provided by
Alaska Fisheries Science Center
Authors
Alaska Fisheries Science Center (AFSC)
Area covered
Kotzebue Sound, Chukchi Sea, Beaufort Sea, Alaska
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
Interior Design Images & Metadata
kaggle.com
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GalinaKG (2025). Interior Design Images & Metadata [Dataset]. https://www.kaggle.com/datasets/galinakg/interior-design-images-and-metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GalinaKG
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains a curated collection of interior design images categorized by room type and design style. The images are sourced from Pinterest and labeled with relevant metadata for machine learning applications, including image classification, style prediction, and aesthetic analysis.

Dataset Structure

The dataset is organized into directories based on room types:

bathroom/

bedroom/

kitchen/

living_room/

Each room type further contains subdirectories for different design styles, such as:

boho

industrial

minimalist

modern

scandinavian

Files Included

metadata.csv → Contains file paths and labels for room type and design style.

train_data.csv → Training split of the dataset.

val_data.csv → Validation split of the dataset.

test_data.csv → Test split for evaluation.

Metadata Format

Each row in metadata.csv contains:

image_path: Relative path to the image.

room_type: The category of the room (e.g., bathroom, bedroom).

style: The interior design style (e.g., boho, modern).
Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB)
zenodo.org
bin
Updated Jul 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jörg Lehmann; Jörg Lehmann; Sophie Schneider; Sophie Schneider (2024). Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB) [Dataset]. http://doi.org/10.5281/zenodo.12783814
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12783814
Dataset updated
Jul 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jörg Lehmann; Jörg Lehmann; Sophie Schneider; Sophie Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 22, 2024
Area covered
Berlin
Description
This dataset was created with the intent to provide a single larger set of metadata from Berlin State Library for research purposes and the development of AI applications.

The dataset comprises of descriptive metadata of 2.619.397 titles, which together form the "Alte Realkatalog" of Berlin State Library, which may be translated to "Old Subject Catalogue". The data are stored in columnar format, containing 375 columns. They were downloaded in December 2023 from the German central library system (CBS). Exemplary tasks which can be served by this dataset comprise studies on the history of books between 1500 and 1955, on the paratextual formatting of scientific books between 1800 and 1955, and on pattern recognition on the basis of bibliographical metadata.
f
Metadata record for: A shell dataset, for shell features extraction and...
springernature.figshare.com
txt
Updated Mar 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Zhang; Jianhang Zhou; Jing He; Xiaodong Cun; Shaoning Zeng; Bob Zhang (2024). Metadata record for: A shell dataset, for shell features extraction and recognition [Dataset]. http://doi.org/10.6084/m9.figshare.9939353.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9939353.v2
Dataset updated
Mar 1, 2024
Dataset provided by
figshare
Authors
Qi Zhang; Jianhang Zhou; Jing He; Xiaodong Cun; Shaoning Zeng; Bob Zhang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains key characteristics about the data described in the Data Descriptor A shell dataset, for shell features extraction and recognition. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON formatVersioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
P
Meta-Album Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ihsan Ullah; Dustin Carrión-Ojeda; Sergio Escalera; Isabelle Guyon; Mike Huisman; Felix Mohr; Jan N van Rijn; Haozhe Sun; Joaquin Vanschoren; Phan Anh Vu, Meta-Album Dataset [Dataset]. https://paperswithcode.com/dataset/meta-album
Explore at:
Authors
Ihsan Ullah; Dustin Carrión-Ojeda; Sergio Escalera; Isabelle Guyon; Mike Huisman; Felix Mohr; Jan N van Rijn; Haozhe Sun; Joaquin Vanschoren; Phan Anh Vu
Description
Meta Album is a meta-dataset created for few-shot learning, meta-learning, continual learning and so on. Meta Album consists of 40 datasets from 10 unique domains. Datasets are arranged in sets (10 datasets, one dataset from each domain). It is a continuously growing meta-dataset.

We repurposed datasets that were generously made available by original creators. All datasets are free for use for academic purposes, provided that proper credits are given. For your convenience, you may cite our paper, which references all original creators.

Meta-Album is released under a CC BY-NC 4.0 license permitting non-commercial use for research purposes, provided that you cite us. Additionally, redistributed datasets have their own license.

The recommended use of Meta-Album is to conduct fundamental research on machine learning algorithms and conduct benchmarks, particularly in: few-shot learning, meta-learning, continual learning, transfer learning, and image classification.
Metadata record for: All urban areas’ energy use data across 640 districts...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Data Curation Team (2023). Metadata record for: All urban areas’ energy use data across 640 districts in India [Dataset]. http://doi.org/10.6084/m9.figshare.13516925.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13516925.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Scientific Data Curation Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
India
Description
This dataset contains key characteristics about the data described in the Data Descriptor All urban areas’ energy use data across 640 districts in India. Contents:

1. human readable metadata summary table in CSV format 2. machine readable metadata file in JSON format
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
o
Data from: Contextualized, Metadata-Empowered, Coarse-to-Fine...
explore.openaire.eu
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheeraj Mekala (2021). Contextualized, Metadata-Empowered, Coarse-to-Fine Weakly-Supervised Text Classification [Dataset]. https://explore.openaire.eu/search/other?orpId=od_325::9c725f6a4398ba70deb7e866a782f5fd
Explore at:
Dataset updated
Jan 1, 2021
Authors
Dheeraj Mekala
Description
Text classification plays a fundamental role in transforming unstructured text data to structured knowledge. State-of-the-art text classification techniques rely on heavy domain-specific annotations to build massive machine(deep) learning models. Although these deep learning models exhibit superior performance, the lack of training data and expensive human effort in the manual annotation is a key bottleneck that forbids them from being adopted in many practical scenarios. To address this bottleneck, our research exploits the data and develops a family of data-driven text classification frameworks with minimal supervision, for e.g. class names, a few label-indicative seed words per class.The massive volume of text data and complexity of natural language pose significant challenges to categorizing the text corpus without human annotations. For instance, the user- provided seed words can have multiple interpretations depending on the context, and their respective user-intended interpretation has to be identified for accurate classification. Moreover, metadata information like author, year, and location is widely available in addition to the text data, and it could serve as a strong, complementary source of supervision. However, leveraging metadata is challenging because (1) metadata is multi-typed, therefore it requires systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. And also, the label set is typically assumed to be fixed in traditional text classification problems. However, in many real-world applications, new classes especially more fine-grained ones will be introduced as the data volume increases. The goal of our research is to create general data-driven methods that transform real-world text data into structured categories of human knowledge with minimal human effort.This thesis outlines a family of weakly supervised text classification approaches, which upon combining can automatically categorize huge text corpus into coarse and fine-grained classes, with just label hierarchy and a few label-indicative seed words as supervision. Specifically, it first leverages contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of a seed word, and thus result- ing in contextualized weak supervision. Then, to leverage metadata, it organizes the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Finally, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. We have performed extensive experiments on real-world datasets from different domains. The results demonstrate significant advantages of using contextualized weak supervision and leveraging metadata, and superior performance over baselines.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2024). Metadata for Pavlovic et al. - Machine Learning Critical Loads [Dataset]. https://catalog.data.gov/dataset/metadata-for-pavlovic-et-al-machine-learning-critical-loads

Metadata for Pavlovic et al. - Machine Learning Critical Loads

Explore at:

Dataset updated

Feb 8, 2024

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).

Clear search

Close search

Google apps

Main menu

Metadata for Pavlovic et al. - Machine Learning Critical Loads

Data from: Metadata Classification Machine Learning Data

Indexed NLP Article Metadata Dataset

Statistics and Evaluation Data for Publication "Using Supervised Learning to...

Data from: A metadata-based approach for research discipline prediction...

Table 1_Artificial intelligence in breast cancer survival prediction: a...

Zenodo Open Metadata snapshot - Training dataset for records and communities...

Understanding machine learning dataset search behaviors: A survey

Waveform Data and Metadata used to National Earthquake Information Center...

Metadata record for: Compendiums of cancer transcriptomes for machine...

Metadata record for: Global soil moisture data derived through machine...

Active Metadata Management Solution Report

A Dataset for Machine Learning Algorithm Development

Interior Design Images & Metadata

Dataset Structure

Files Included

Metadata Format

Metadata of the "Alter Realkatalog" (ARK) of Berlin State Library (SBB)

Metadata record for: A shell dataset, for shell features extraction and...

Meta-Album Dataset

Metadata record for: All urban areas’ energy use data across 640 districts...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data from: Contextualized, Metadata-Empowered, Coarse-to-Fine...

Metadata for Pavlovic et al. - Machine Learning Critical Loads