This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).
This GitLab project contains the training data that was used for the metadata machine learning classification project.
his dataset consists of a curated collection of published, indexed articles (N=75527) related to Natural Language Processing (NLP) collected from Web Of Science, along with a classification into one of five categories depending on the approach to NLP used. Category 4: The abstract does not mention a particular model or technique. Some papers analyzing frameworks, surveys, papers centered the computer vision component of NLP and dataset proposals among others fall into this category. Category 0 (Rule-Based): A model based on rules or symbolic analysis is used. Category 1 (Statistical Methods): An approach using statistical methods is used. This includes BoWs, N-Grams, TF-IDF, along with other machine learning techniques like SVMs, Logistic Regression, LDA and others. Shallow neural network models like word2vec also belong in this category. Category 2 (Deep Learning): Approaches that use Deep Learning and other Deep Neural Network architectures such as RNNs, CNNs and LSTM are included in this category. Category 3 (Transformer Models): The approach proposed uses transformer based models, like BERT, GPT, T5 and others. It is to note that the classification could be imprecise, is not strictly defined and should be used only as a starting point. Fields: 'Authors', 'Article Title', 'Volume', 'Issue', 'Special Issue', 'Start Page', 'End Page', 'DOI', 'Book DOI', 'Publication Date', 'Times Cited', 'ISSN', 'eISSN', 'Author Full Names', 'Book Author Full Names', 'Language', 'Author Keywords', 'Keywords', 'Funding Orgs', 'Funding Text', 'Cited References', 'DOI Link', 'Number of Pages', 'Categories', 'Research Areas', 'bert_preds', 'setfit_preds', 'knn_preds', 'abstract_hash'. The dataset is provided in different formats. To address potential copyright, licensing, and data privacy concerns, we have replaced the original abstracts with SHA-256 hashes, cryptographic representations of the abstracts' content. Please note that the copyright and licensing status of the original articles may vary, and users should respect any applicable terms and restrictions associated with the source publications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records. This publication contains aggregated data for the paper. It also contains the evaluation data of all model/hyper-parameter training and test runs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is based on the paper:
Hoang-Son Pham, Hanne Poelmans and Amr Ali-Eldin ‘’A metadata-based approach for research discipline prediction using machine learning techniques and distance metrics’’, IEEE Access (2023).
The dataset includes:
a list of project metadata extracted from FRIS portal
a list of VODS disciplines
a distance matrix
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundBreast cancer (BC), as a leading cause of cancer mortality in women, demands robust prediction models for early diagnosis and personalized treatment. Artificial Intelligence (AI) and Machine Learning (ML) algorithms offer promising solutions for automated survival prediction, driving this study’s systematic review and meta-analysis.MethodsThree online databases (Web of Science, PubMed, and Scopus) were comprehensively searched (January 2016-August 2023) using key terms (“Breast Cancer”, “Survival Prediction”, and “Machine Learning”) and their synonyms. Original articles applying ML algorithms for BC survival prediction using clinical data were included. The quality of studies was assessed via the Qiao Quality Assessment tool.ResultsAmongst 140 identified articles, 32 met the eligibility criteria. Analyzed ML methods achieved a mean validation accuracy of 89.73%. Hybrid models, combining traditional and modern ML techniques, were mostly considered to predict survival rates (40.62%). Supervised learning was the dominant ML paradigm (75%). Common ML methodologies included pre-processing, feature extraction, dimensionality reduction, and classification. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs), emerged as the preferred modern algorithm within these methodologies. Notably, 81.25% of studies relied on internal validation, primarily using K-fold cross-validation and train/test split strategies.ConclusionThe findings underscore the significant potential of AI-based algorithms in enhancing the accuracy of BC survival predictions. However, to ensure the robustness and generalizability of these predictive models, future research should emphasize the importance of rigorous external validation. Such endeavors will not only validate the efficacy of these models across diverse populations but also pave the way for their integration into clinical practice, ultimately contributing to personalized patient care and improved survival outcomes.Systematic Review Registrationhttps://www.crd.york.ac.uk/prospero/, identifier CRD42024513350.
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted. The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community. Records dataset Filename: zenodo_open_metadata_{ date of export }.jsonl.gz Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json. In addition, some terms have been altered: The term files contains a list of dictionaries containing filetype, size, and filename only. The term license contains a short Zenodo ID of the license (e.g. "cc-by"). Communities dataset Filename: zenodo_community_metadata_{ date of export }.jsonl.gz Each object contains the terms: id, title, description, curation_policy, page which correspond to the fields with the same name available in Zenodo's community creation form. Notes for all datasets For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff. Some values for the top-level terms, which were missing in the metadata may contain a null value. A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These files represent the data and accompanying documents of an independent research study by a student researcher examining the searchability and usability of machine learning dataset metadata.
The purpose of this exploratory study was to understand how machine learning (ML) practitioners are searching for and evaluating datasets for use in their work. This research will help inform development of the ML dataset metadata standard Croissant, which is actively being developed by the Croissant MLCommons working group, so it can aid ML practitioners' workflows and promote best practices like Responsible Artificial Intelligence (RAI).
The study consisted of a pre-interview Qualtrics survey ("Survey_questions_pre_interview.pdf") that focused on ranking various metadata elements on a Likert importance scale.
The interview consisted of open questions ("Interview_script_and_questions.pdf") on a range of topics from search of datasets to interoperability to AI used in dataset search. Additionally, participants were asked to share their screen at one point and recall a recent dataset search they had performed.
The resulting survey dataset ("Survey_p1.csv") and interview ("Interview_p1.txt") of participants are presented in open standard formats for accessibility. Identifying data has been removed from the files so there will be missing columns and rows potentially referenced in the files.
This is the supporting data used to train machine learning models used by the National Earthquake Information Center to improve pick times and classify source characteristics.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor Compendiums of cancer transcriptomes for machine learning applications. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
3. machine readable metadata file in ISA-Tab format (zipped folder)Versioning Note:A revised version was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor Global soil moisture data derived through machine learning trained with in-situ measurements. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Active Metadata Management Solution market is experiencing robust growth, driven by the increasing need for efficient data governance, improved data quality, and enhanced data discoverability across diverse industries. The market's expansion is fueled by the rising volume and velocity of data generated by organizations, necessitating sophisticated solutions to manage and leverage this information effectively. Key trends include the adoption of cloud-based solutions, the integration of AI and machine learning for automated metadata management, and a growing focus on data security and compliance. While the initial investment in implementing these solutions can be substantial, the long-term benefits in terms of reduced operational costs, improved data-driven decision-making, and minimized regulatory risks outweigh these initial expenses. We estimate the current market size (2025) to be around $5 billion, projecting a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This growth is largely attributed to the increasing adoption across various sectors, including finance, healthcare, and manufacturing, where data-driven insights are critical for operational efficiency and competitive advantage. The segmentation within the market reflects the diversity of applications and solution types, with cloud-based solutions gaining significant traction due to their scalability and cost-effectiveness. North America and Europe currently dominate the market share, but the Asia-Pacific region is poised for significant growth in the coming years driven by increasing digitalization and technological advancements. Market restraints include the complexity of implementing and integrating these solutions with existing IT infrastructure, a potential skills gap in managing these systems effectively, and concerns about data privacy and security. However, the ongoing technological advancements and increasing awareness about the importance of data governance are expected to mitigate these challenges. The competitive landscape is marked by a mix of established players and emerging technology providers, constantly innovating to meet the evolving needs of businesses. The market is expected to witness strategic partnerships, mergers and acquisitions, and product enhancements throughout the forecast period, driving further consolidation and innovation.
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a curated collection of interior design images categorized by room type and design style. The images are sourced from Pinterest and labeled with relevant metadata for machine learning applications, including image classification, style prediction, and aesthetic analysis.
The dataset is organized into directories based on room types:
Each room type further contains subdirectories for different design styles, such as:
Each row in metadata.csv contains:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created with the intent to provide a single larger set of metadata from Berlin State Library for research purposes and the development of AI applications.
The dataset comprises of descriptive metadata of 2.619.397 titles, which together form the "Alte Realkatalog" of Berlin State Library, which may be translated to "Old Subject Catalogue". The data are stored in columnar format, containing 375 columns. They were downloaded in December 2023 from the German central library system (CBS). Exemplary tasks which can be served by this dataset comprise studies on the history of books between 1500 and 1955, on the paratextual formatting of scientific books between 1800 and 1955, and on pattern recognition on the basis of bibliographical metadata.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor A shell dataset, for shell features extraction and recognition. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON formatVersioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
Meta Album is a meta-dataset created for few-shot learning, meta-learning, continual learning and so on. Meta Album consists of 40 datasets from 10 unique domains. Datasets are arranged in sets (10 datasets, one dataset from each domain). It is a continuously growing meta-dataset.
We repurposed datasets that were generously made available by original creators. All datasets are free for use for academic purposes, provided that proper credits are given. For your convenience, you may cite our paper, which references all original creators.
Meta-Album is released under a CC BY-NC 4.0 license permitting non-commercial use for research purposes, provided that you cite us. Additionally, redistributed datasets have their own license.
The recommended use of Meta-Album is to conduct fundamental research on machine learning algorithms and conduct benchmarks, particularly in: few-shot learning, meta-learning, continual learning, transfer learning, and image classification.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor All urban areas’ energy use data across 640 districts in India. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON format
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Text classification plays a fundamental role in transforming unstructured text data to structured knowledge. State-of-the-art text classification techniques rely on heavy domain-specific annotations to build massive machine(deep) learning models. Although these deep learning models exhibit superior performance, the lack of training data and expensive human effort in the manual annotation is a key bottleneck that forbids them from being adopted in many practical scenarios. To address this bottleneck, our research exploits the data and develops a family of data-driven text classification frameworks with minimal supervision, for e.g. class names, a few label-indicative seed words per class.The massive volume of text data and complexity of natural language pose significant challenges to categorizing the text corpus without human annotations. For instance, the user- provided seed words can have multiple interpretations depending on the context, and their respective user-intended interpretation has to be identified for accurate classification. Moreover, metadata information like author, year, and location is widely available in addition to the text data, and it could serve as a strong, complementary source of supervision. However, leveraging metadata is challenging because (1) metadata is multi-typed, therefore it requires systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. And also, the label set is typically assumed to be fixed in traditional text classification problems. However, in many real-world applications, new classes especially more fine-grained ones will be introduced as the data volume increases. The goal of our research is to create general data-driven methods that transform real-world text data into structured categories of human knowledge with minimal human effort.This thesis outlines a family of weakly supervised text classification approaches, which upon combining can automatically categorize huge text corpus into coarse and fine-grained classes, with just label hierarchy and a few label-indicative seed words as supervision. Specifically, it first leverages contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of a seed word, and thus result- ing in contextualized weak supervision. Then, to leverage metadata, it organizes the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Finally, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. We have performed extensive experiments on real-world datasets from different domains. The results demonstrate significant advantages of using contextualized weak supervision and leveraging metadata, and superior performance over baselines.
This is the metadata associated with Pavlovic et al. (2023) entitled "Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning" (https://www.sciencedirect.com/science/article/pii/S0048969722063513). It is not EPA data and the data and associated metadata is already publicly available on the journal website. This dataset is associated with the following publication: Pavlovic, N., S. Chang, J. Huang, K. Craig, C. Clark, K. Horn, and C. Driscoll. Empirical nitrogen and sulfur critical loads of U.S. tree species and their uncertainties with machine learning. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 857: 1-10, (2022).