49 datasets found
  1. Dataset: Gold standard dataset for explainability need detection in app...

    • zenodo.org
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Obaidi; Martin Obaidi (2025). Dataset: Gold standard dataset for explainability need detection in app reviews. [Dataset]. http://doi.org/10.5281/zenodo.13273192
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Obaidi; Martin Obaidi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We crawled 90,000 app reviews from both Google Play Store and Apple App Store, including reviews from both free and paid apps. These reviews were filtered for explainability needs, and after this process, 4,495 reviews remained. Among them, 2,185 reviews indicated an explanation need, while 2,310 did not. This resulting gold standard dataset was used to train and evaluate several machine learning models and rule-based approaches for detecting explanation needs in app reviews.

    The dataset includes both balanced and unbalanced evaluation sets, as well as the original crawled data from October 2023. In addition to machine learning approaches, rule-based methods optimized for F1 score, precision, and recall are also included.

    We provide several pre-trained machine learning models (including BERT, SetFit, AdaBoost, K-Nearest Neighbor, Logistic Regression, Naive Bayes, Random Forest, and SVM) along with training scripts and evaluation notebooks. These models can be applied directly or retrained using the included datasets.

    For further details on the structure and usage of the dataset, please refer to the README.md file within the provided ZIP archive.

  2. Gold Price Prediction using Machine Learning

    • kaggle.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subho117 (2024). Gold Price Prediction using Machine Learning [Dataset]. https://www.kaggle.com/datasets/subho117/gold-price-prediction-using-machine-learning/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Subho117
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Subho117

    Released under MIT

    Contents

  3. Z

    GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moncla, Ludovic (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Vigier, Denis
    Moncla, Ludovic
    McDonough, Katherine
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

    The dataset is available in the following formats:

    JSONL format provided by Prodigy

    binary spaCy format (ready to use with the spaCy train pipeline)

    The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

    The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

    Tagset

    NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

    NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

    ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

    Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

    Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

    NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

    NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

    ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

    NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

    ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

    Head: entry name

    Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

    HuggingFace

    The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

    spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

    This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

    Acknowledgement

    The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

  4. Machine Learning Models for Gold Price Prediction (Forecast)

    • kappasignal.com
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KappaSignal (2023). Machine Learning Models for Gold Price Prediction (Forecast) [Dataset]. https://www.kappasignal.com/2023/12/machine-learning-models-for-gold-price.html
    Explore at:
    Dataset updated
    Dec 19, 2023
    Dataset authored and provided by
    KappaSignal
    License

    https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html

    Description

    This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

    Machine Learning Models for Gold Price Prediction

    Financial data:

    • Historical daily stock prices (open, high, low, close, volume)

    • Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

    • Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

    Machine learning features:

    • Feature engineering based on financial data and technical indicators

    • Sentiment analysis data from social media and news articles

    • Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

    Potential Applications:

    • Stock price prediction

    • Portfolio optimization

    • Algorithmic trading

    • Market sentiment analysis

    • Risk management

    Use Cases:

    • Researchers investigating the effectiveness of machine learning in stock market prediction

    • Analysts developing quantitative trading Buy/Sell strategies

    • Individuals interested in building their own stock market prediction models

    • Students learning about machine learning and financial applications

    Additional Notes:

    • The dataset may include different levels of granularity (e.g., daily, hourly)

    • Data cleaning and preprocessing are essential before model training

    • Regular updates are recommended to maintain the accuracy and relevance of the data

  5. Fruits-360 dataset

    • kaggle.com
    • paperswithcode.com
    • +1more
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mihai Oltean
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

    Version: 2025.06.07.0

    Content

    The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

    Branches

    The dataset has 5 major branches:

    -The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

    -The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

    -The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

    -The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

    -The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

    How to cite

    Mihai Oltean, Fruits-360 dataset, 2017-

    Dataset properties

    For the 100x100 branch

    Total number of images: 138704.

    Training set size: 103993 images.

    Test set size: 34711 images.

    Number of classes: 206 (fruits, vegetables, nuts and seeds).

    Image size: 100x100 pixels.

    For the original-size branch

    Total number of images: 58363.

    Training set size: 29222 images.

    Validation set size: 14614 images

    Test set size: 14527 images.

    Number of classes: 90 (fruits, vegetables, nuts and seeds).

    Image size: various (original, captured, size) pixels.

    For the 3-body-problem branch

    Total number of images: 47033.

    Training set size: 34800 images.

    Test set size: 12233 images.

    Number of classes: 3 (Apples, Cherries, Tomatoes).

    Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

    Image size: 100x100 pixels.

    For the meta branch

    Number of classes: 26 (fruits, vegetables, nuts and seeds).

    For the multi branch

    Number of images: 150.

    Filename format:

    For the 100x100 branch

    image_index_100.jpg (e.g. 31_100.jpg) or

    r_image_index_100.jpg (e.g. r_31_100.jpg) or

    r?_image_index_100.jpg (e.g. r2_31_100.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

    For the original-size branch

    r?_image_index.jpg (e.g. r2_31.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

    The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

    For the multi branch

    The file's name is the concatenation of the names of the fruits inside that picture.

    Alternate download

    The Fruits-360 dataset can be downloaded from:

    Kaggle https://www.kaggle.com/moltean/fruits

    GitHub https://github.com/fruits-360

    How fruits were filmed

    Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

    A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

    Behind the fruits, we placed a white sheet of paper as a background.

    Here i...

  6. a

    Data from: Mineral prospectivity mapping using machine learning techniques...

    • hub.arcgis.com
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MetalEarth (2023). Mineral prospectivity mapping using machine learning techniques for gold exploration in the Larder Lake area, Ontario, Canada [Dataset]. https://hub.arcgis.com/documents/1be05051de7c498c97e7cb267076b435
    Explore at:
    Dataset updated
    Sep 21, 2023
    Dataset authored and provided by
    MetalEarth
    Area covered
    Canada, Ontario, Larder Lake
    Description

    A mineral prospectivity map (MPM) focusing on gold mineralization in the Larder Lake region of Northern Ontario, Canada, has been produced in this study. We have used the Random Forest (RF) algorithm to use 32 predictor maps integrating geophysical, geochemical, and geological datasets from various sources that represent vectors to gold mineralization. It is evident from the efficiency of classification curves that MPMs generated are robust. The unsupervised algorithms, K-means and principal component analysis (PCA) were used to investigate and visualize the clustering nature of large geochemical and geophysical datasets. We used RQ-mode PCA to compute variable and object loadings simultaneously, which allows the displays of observations and the variables at the same scale. PCA biplots of the Larder Lake geochemical data show that Au is strongly correlated with W, S, Pb and K, but inversely correlated with Fe, Mn, Co, Mg, Ca, and Ni. The known gold mineralization locations were well classified by RF with the accuracy of 95.63 %. Furthermore, partial least squares-discriminant analysis (PLS-DA) model combines 3D geophysical clusters and geochemical compositions, which indicates the Au-rich areas are characterized with low to mid resistivity – low susceptibility properties. We conclude that the Larder Lake-Cadillac deformation zone (LLCDZ) is relatively more fertile than the Lincoln-Nipissing shear zone (LNSZ) with respect to gold mineralization due to deeper penetrating faults. The intersection of the LLCDZ and network of high-angle NE-trending cross faults acts as key conduits for gold endowments in the Larder Lake area. This study innovatively combined multivariate geological, geochemical, and geophysical datasets via machine learning algorithms, which improves identification of geochemical anomalies and interpretation of spatial features associated with gold mineralization.

  7. e

    Relationship and Entity Extraction Evaluation Dataset (Entities)

    • data.europa.eu
    • data.wu.ac.at
    json
    Updated Oct 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defence Science and Technology Laboratory (2021). Relationship and Entity Extraction Evaluation Dataset (Entities) [Dataset]. https://data.europa.eu/data/datasets/relationship-and-entity-extraction-evaluation-dataset-entities
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Oct 30, 2021
    Dataset authored and provided by
    Defence Science and Technology Laboratory
    Description

    This entities dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'entities.json' file is to be used together with the 'documents.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.

  8. Gold VTKEL 300 documents dataset

    • figshare.com
    txt
    Updated Sep 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahi Dost (2020). Gold VTKEL 300 documents dataset [Dataset]. http://doi.org/10.6084/m9.figshare.9815387.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 9, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Shahi Dost
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (Updated version after fixed some bugs from previous version)Manually corrected gold standard dataset for 300 documents of VTKEL.

  9. h

    daily-historical-stock-price-data-for-alamos-gold-inc-20032025

    • huggingface.co
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Ben Ali (2025). daily-historical-stock-price-data-for-alamos-gold-inc-20032025 [Dataset]. https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-alamos-gold-inc-20032025
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    Khaled Ben Ali
    Description

    📈 Daily Historical Stock Price Data for Alamos Gold Inc. (2003–2025)

    A clean, ready-to-use dataset containing daily stock prices for Alamos Gold Inc. from 2003-05-02 to 2025-05-28. This dataset is ideal for use in financial analysis, algorithmic trading, machine learning, and academic research.

      🗂️ Dataset Overview
    

    Company: Alamos Gold Inc. Ticker Symbol: AGI Date Range: 2003-05-02 to 2025-05-28 Frequency: Daily Total Records: 5554 rows (one per trading day)… See the full description on the dataset page: https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-alamos-gold-inc-20032025.

  10. BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning...

    • osti.gov
    • data.openei.org
    • +1more
    Updated Dec 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOE Open Energy Data Initiative (OEDI) (2022). BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning Dataset [Dataset]. http://doi.org/10.25984/2329316
    Explore at:
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Office of Sciencehttp://www.er.doe.gov/
    National Renewable Energy Laboratory (NREL), Golden, CO (United States)
    DOE Open Energy Data Initiative (OEDI)
    Description

    The BUTTER-E - Energy Consumption Data for the BUTTER Empirical Deep Learning Dataset adds node-level energy consumption data from watt-meters to the primary sweep of the BUTTER - Empirical Deep Learning Dataset. This dataset contains energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network "shapes", and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects. BUTTER-E is intended to be joined with the BUTTER dataset (see "BUTTER - Empirical Deep Learning Dataset on OEDI" resource below) which characterizes the performance of 483k distinct fully connected neural networks but does not include energy measurements.

  11. ABC Gold standard dataset for 300 documents

    • figshare.com
    txt
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abc Abc (2019). ABC Gold standard dataset for 300 documents [Dataset]. http://doi.org/10.6084/m9.figshare.9913886.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    figshare
    Authors
    Abc Abc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Manually corrected gold standard dataset for 300 documents of ABC.

  12. Data from: Dataset of Au atomic structures for training Machine Learning...

    • zenodo.org
    • repository.uantwerpen.be
    bz2
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vlahovic Jovana; Vlahovic Jovana; Cem Sevik; Cem Sevik; Milorad Milosevic; Milorad Milosevic (2025). Dataset of Au atomic structures for training Machine Learning Interatomic Potentials [Dataset]. http://doi.org/10.5281/zenodo.15366677
    Explore at:
    bz2Available download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vlahovic Jovana; Vlahovic Jovana; Cem Sevik; Cem Sevik; Milorad Milosevic; Milorad Milosevic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains atomic structures of gold (Au) generated using density functional theory (DFT) calculations performed with the VASP package [1, 2]. The calculations were carried out using the projector-augmented wave (PAW) [3, 4] method and Perdew–Burke–Ernzerhof (PBE) pseudopotentials for gold, within the generalised gradient approximation (GGA) [5] for the exchange-correlation functional.

    Molecular dynamics simulations were performed for at least 500 steps per structure. For bulk systems, the temperature range spans from 100 K to 1500 K, while for nanoparticles and slab structures, it extends from 100 K to 1000 K. For training our machine learning model (GAP [6]), the first 200 steps of each molecular dynamics trajectory were discarded to allow the thermostat to equilibrate the system to the target temperature. Using this dataset, we trained a GAP model for Au nanoparticles, whose applicability extends beyond the nanoparticle sizes included in the training set.

    The dataset includes these starting configurations:

    • Small, low-energy Au nanoparticles (3 to 55 atoms)

    • Bulk Au in fcc, bcc, hcp, and simple cubic (sc) crystal structures

    • Slab models of fcc surfaces

    The initial low-energy nanoparticle structures were adopted from a dataset reported in the literature [7].

    For each structure, we provide atomic coordinates along with corresponding total energies and per-atom forces. This dataset is suitable for training and validating machine learning interatomic potentials for gold.

    References

    [1] Kresse and Furthmüller. "Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set." Physical Review B 54.16 (1996): 11169.
    [2] https://www.vasp.at/
    [3] Blöchl. "Projector augmented-wave method." Physical Review B 50.24 (1994): 17953.
    [4] Kresse and Joubert. "From ultrasoft pseudopotentials to the projector augmented-wave method." Physical Review B 59.3 (1999): 1758.
    [5] Perdew, John P., Kieron Burke, and Matthias Ernzerhof. "Generalized gradient approximation made simple." Physical Review Letters 77.18 (1996): 3865
    [6] Bartók, Payne, et al. "Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons." Physical Review Letters 104.13 (2010): 136403.
    [7] Manna, Sukriti, et al. "A database of low-energy atomically precise nanoclusters." Scientific Data 10.1 (2023): 308.

  13. A

    ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Sentiment Analysis of Commodity News (Gold)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-sentiment-analysis-of-commodity-news-gold-732f/e3232de2/?iid=002-045&v=presentation
    Explore at:
    Dataset updated
    Sep 27, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Sentiment Analysis of Commodity News (Gold)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-in-commodity-market-gold on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This is a news dataset for the commodity market where we have manually annotated 11,412 news headlines across multiple dimensions into various classes. The dataset has been sampled from a period of 20+ years (2000-2021).

    Content

    The dataset has been collected from various news sources and annotated by three human annotators who were subject experts. Each news headline was evaluated on various dimensions, for instance - if a headline is a price related news then what is the direction of price movements it is talking about; whether the news headline is talking about the past or future; whether the news item is talking about asset comparison; etc.

    Acknowledgements

    Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." In Future of Information and Communication Conference, pp. 589-601. Springer, Cham, 2021.

    https://arxiv.org/abs/2009.04202 Sinha, Ankur, and Tanmay Khandait. "Impact of News on the Commodity Market: Dataset and Results." arXiv preprint arXiv:2009.04202 (2020)

    We would like to acknowledge the financial support provided by the India Gold Policy Centre (IGPC).

    Inspiration

    Commodity prices are known to be quite volatile. Machine learning models that understand the commodity news well, will be able to provide an additional input to the short-term and long-term price forecasting models. The dataset will also be useful in creating news-based indicators for commodities.

    Apart from researchers and practitioners working in the area of news analytics for commodities, the dataset will also be useful for researchers looking to evaluate their models on classification problems in the context of text-analytics. Some of the classes in the dataset are highly imbalanced and may pose challenges to the machine learning algorithms.

    --- Original source retains full ownership of the source dataset ---

  14. f

    Dataset of automatically extracted sizes and morphologies of AuNPs from...

    • figshare.com
    txt
    Updated Nov 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Subramanian; Kevin Cruse; Amalie Trewartha; Xingzhi Wang; Paul Alivisatos; Gerbrand Ceder (2021). Dataset of automatically extracted sizes and morphologies of AuNPs from literature-mined SEM/TEM images [Dataset]. http://doi.org/10.6084/m9.figshare.17019836.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 18, 2021
    Dataset provided by
    figshare
    Authors
    Akshay Subramanian; Kevin Cruse; Amalie Trewartha; Xingzhi Wang; Paul Alivisatos; Gerbrand Ceder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of automatically extracted sizes and morphologies of Gold nanoparticles, which have been obtained from the analysis of 4365 literature-mined SEM/TEM images. The dataset contains 4365 records, each of which contains extracted size/morphology information and metadata corresponding to a single microscopy image.

  15. B

    Gold Standard Snapshot Serengeti Bounding Box Coordinates

    • borealisdata.ca
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Schneider; Stefan Kremer; Graham Taylor (2025). Gold Standard Snapshot Serengeti Bounding Box Coordinates [Dataset]. http://doi.org/10.5683/SP/TPB5ID
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Borealis
    Authors
    Stefan Schneider; Stefan Kremer; Graham Taylor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2010 - 2016
    Area covered
    Serengeti, Africa
    Description

    To contribute to the terrific work done by the Snapshot Serengeti community to provide bounding box coordinates for the Gold Standard Snapshot Serengeti dataset for the purpose of training deep learning object detectors to detect, localize, and classify species from camera trap images.

  16. f

    Additional file 3 of OffsampleAI: artificial intelligence approach to...

    • springernature.figshare.com
    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katja Ovchinnikova; Vitaly Kovalev; Lachlan Stuart; Theodore Alexandrov (2023). Additional file 3 of OffsampleAI: artificial intelligence approach to recognize off-sample mass spectrometry images [Dataset]. http://doi.org/10.6084/m9.figshare.12082305.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Katja Ovchinnikova; Vitaly Kovalev; Lachlan Stuart; Theodore Alexandrov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3 : Supplementary Data D1-D5: D1: “Supplementary methods and results.pdf”. D2: “Interactive tagging of ion images using web app.mov”, video of a tagger using the TagOff web app. D3: “Gold standard datasets.csv”, metadata of 87 public datasets from METASPACE selected for the gold standard. D4: “DHB matrix clusters frequencies.csv”, results of annotation of 31 gold standard datasets acquired using the MALDI DHB matrix and positive ion mode and off-sample recognition for DHB matrix clusters generated according to a combinatorial model. D5: “DESI offsample ions frequencies.csv”, a file showing for each molecular formula the number of DESI imaging datasets from the gold standard where ions with such molecular formula were classified as off-sample.

  17. h

    daily-historical-stock-price-data-for-americas-gold-and-silver-corporation-20032025...

    • huggingface.co
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Ben Ali (2025). daily-historical-stock-price-data-for-americas-gold-and-silver-corporation-20032025 [Dataset]. https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-americas-gold-and-silver-corporation-20032025
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    Khaled Ben Ali
    Description

    📈 Daily Historical Stock Price Data for Americas Gold and Silver Corporation (2003–2025)

    A clean, ready-to-use dataset containing daily stock prices for Americas Gold and Silver Corporation from 2003-10-27 to 2025-05-28. This dataset is ideal for use in financial analysis, algorithmic trading, machine learning, and academic research.

      🗂️ Dataset Overview
    

    Company: Americas Gold and Silver Corporation Ticker Symbol: USAS Date Range: 2003-10-27 to 2025-05-28 Frequency:… See the full description on the dataset page: https://huggingface.co/datasets/khaledxbenali/daily-historical-stock-price-data-for-americas-gold-and-silver-corporation-20032025.

  18. Data from: DECM Machine Learning Training Corpus

    • figshare.com
    • produccioncientifica.ucm.es
    • +1more
    bin
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patricia Murrieta-Flores; Mariana Favila-Vázquez; Raquel Liceras-Garrido (2023). DECM Machine Learning Training Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.12366734.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Patricia Murrieta-Flores; Mariana Favila-Vázquez; Raquel Liceras-Garrido
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The DECM Corpus is a digital corpus of the texts of Relaciones Geográficas de Nueva España (the Geographic Reports of New Spain) with different versions, including a machine ready version, a gold standard annotated dataset, and an automatically annotated version ready for text mining and machine learning experiments.This version contains a sample of the RGs manually annotated by multiple researchers with the software of our industry partner, Tagtog. This corpus has been used to carry out the NLP and ML experiments and the files are available in JSON and TSV format. These files are composed by texts and annotations. This is also accompanied by the DECM ontology which provides an explanation of the entities and labels produced. This corpus can be used for further experimentation with Artificial Intelligence methods.

  19. D

    Data from: Global Wheat Head Detection (GWHD) Dataset: A Large and Diverse...

    • ckan.grassroots.tools
    pdf
    Updated Sep 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rothamsted Research (2022). Global Wheat Head Detection (GWHD) Dataset: A Large and Diverse Dataset of High-Resolution RGB-Labelled Images to Develop and Benchmark Wheat Head Detection Methods [Dataset]. https://ckan.grassroots.tools/ar/dataset/fc628bb2-24cb-46ca-8a04-466c605c72d4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 15, 2022
    Dataset provided by
    Rothamsted Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    jats:pThe detection of wheat heads in plant images is an important task for estimating pertinent wheat traits including head population density and head characteristics such as health, size, maturity stage, and the presence of awns. Several studies have developed methods for wheat head detection from high-resolution RGB imagery based on machine learning algorithms. However, these methods have generally been calibrated and validated on limited datasets. High variability in observational conditions, genotypic differences, development stages, and head orientation makes wheat head detection a challenge for computer vision. Further, possible blurring due to motion or wind and overlap between heads for dense populations make this task even more complex. Through a joint international collaborative effort, we have built a large, diverse, and well-labelled dataset of wheat images, called the Global Wheat Head Detection (GWHD) dataset. It contains 4700 high-resolution RGB images and 190000 labelled wheat heads collected from several countries around the world at different growth stages with a wide range of genotypes. Guidelines for image acquisition, associating minimum metadata to respect FAIR principles, and consistent head labelling methods are proposed when developing new head detection datasets. The GWHD dataset is publicly available at

  20. o

    Data from: CS4984/CS5984: Big Data Text Summarization Team 17 ETDs

    • explore.openaire.eu
    Updated Dec 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farnaz Khaghani; Ashin Marin Thomas; Chinmaya Patnayak; Dhruv Sharma; John Aromando (2018). CS4984/CS5984: Big Data Text Summarization Team 17 ETDs [Dataset]. https://explore.openaire.eu/search/dataset?pid=10919%2F86420
    Explore at:
    Dataset updated
    Dec 15, 2018
    Authors
    Farnaz Khaghani; Ashin Marin Thomas; Chinmaya Patnayak; Dhruv Sharma; John Aromando
    Description

    Given the current explosion of information over various media such as electronic and physical texts, concise and relevant data has become key to the understanding of things. Summarization, which essentially is the process of reducing the text to convey only the salient aspects, has emerged as a challenging task in the field of Natural Language Processing. In a scientific construct, academia has been generating voluminous amounts of data in the form of theses and dissertations. Obtaining the chapter-wise summary of an electronic thesis or dissertation can be a computationally expensive task, particularly because of its length and the subject to which it pertains to. Through this course, research and development of various summarization techniques, primarily extractive and abstractive summarization, were analyzed. There have been various developments in the field of deep learning to tackle problems related to summarization and produce coherent and meaningful summaries for news articles. In this project, tools that could be used to generate coherent and concise summaries of long electronic theses and dissertations (ETDs) were developed as well. The major concern initially was to get the text from a PDF file of an ETD. GROBID and Scienceparse were used as pre-processing tools to carry out this task and presented the text from a PDF in a structured format such as XML or JSON file. The outputs from each of the tools were compared qualitatively as well as quantitatively. After this, a transfer learning approach was adopted, wherein a pre-trained model was tweaked to fit to the task of summarizing each ETD. This came in as a challenge to make the model learn the nuances of an ETD. An iterative approach was used to explore various networks, each trying to improve the shortcomings of the previous one in its novel way. Existing deep learning models including Sequence-2-Sequence, Pointer Generator Networks, and A Hybrid Extractive-Abstractive Reinforce-Selecting Sentence Rewriting Network, were used to generate and test summaries. Further tweaks were made to these deep neural networks to account for much longer and varied datasets as compared to what they were inherently designed to work for -- in this case ETDs. A thorough evaluation of these generated summaries was also done with respect to golden standards for five dissertations and theses created during the span of the course. ROUGE-1, ROUGE-2, and ROUGE-SU4 were used to compare the generated summaries with the golden standards. The average ROUGE scores were 0.1387, 0.1224, and 0.0480 respectively. These low ROUGE scores could be attributed to the varying summary length, and also to the complexity of the task of summarizing an ETD. The scope of improvements and the underlying reasons for the performance have also been analyzed. The conclusion that can be drawn from the project is that any machine learning task is highly biased by what pattern is inherently present in the data on which it is being trained. In the context of summarization, there can be a different perspective from which an article can be summarized, and thus the quantitative evaluation measures can vary drastically even after the summary is a coherent one. NSF: IIS-1619028 The submission contains multiple files: - CS5984_Final_Presentation.pdf: The PDF version of the presentation. - CS5984_Final_Presentation.ppt: The PowerPoint for the presentation. - CS5984_Final_Report.pdf: The PDF version of the report. - CS5984_Final_Report.zip: The LaTeX source code for the report. - ArXiv finished file: processed and tokenized arXiv data for Pointer Generator Network -text-summarization-tensorflow: seq2seq model code in TensorFlow modified to adapt with arXiv dataset

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Martin Obaidi; Martin Obaidi (2025). Dataset: Gold standard dataset for explainability need detection in app reviews. [Dataset]. http://doi.org/10.5281/zenodo.13273192
Organization logo

Dataset: Gold standard dataset for explainability need detection in app reviews.

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Obaidi; Martin Obaidi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We crawled 90,000 app reviews from both Google Play Store and Apple App Store, including reviews from both free and paid apps. These reviews were filtered for explainability needs, and after this process, 4,495 reviews remained. Among them, 2,185 reviews indicated an explanation need, while 2,310 did not. This resulting gold standard dataset was used to train and evaluate several machine learning models and rule-based approaches for detecting explanation needs in app reviews.

The dataset includes both balanced and unbalanced evaluation sets, as well as the original crawled data from October 2023. In addition to machine learning approaches, rule-based methods optimized for F1 score, precision, and recall are also included.

We provide several pre-trained machine learning models (including BERT, SetFit, AdaBoost, K-Nearest Neighbor, Logistic Regression, Naive Bayes, Random Forest, and SVM) along with training scripts and evaluation notebooks. These models can be applied directly or retrained using the included datasets.

For further details on the structure and usage of the dataset, please refer to the README.md file within the provided ZIP archive.

Search
Clear search
Close search
Google apps
Main menu