100+ datasets found
  1. f

    Table_1_Applying machine-learning to rapidly analyze large qualitative text...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondaronek, Paulina; Yardley, Lucy; Papakonstantinou, Trisevgeni; Towler, Lauren; Amlôt, Richard; Ainsworth, Ben; Chadborn, Tim (2023). Table_1_Applying machine-learning to rapidly analyze large qualitative text datasets to inform the COVID-19 pandemic response: comparing human and machine-assisted topic analysis techniques.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001054549
    Explore at:
    Dataset updated
    Oct 31, 2023
    Authors
    Bondaronek, Paulina; Yardley, Lucy; Papakonstantinou, Trisevgeni; Towler, Lauren; Amlôt, Richard; Ainsworth, Ben; Chadborn, Tim
    Description

    IntroductionMachine-assisted topic analysis (MATA) uses artificial intelligence methods to help qualitative researchers analyze large datasets. This is useful for researchers to rapidly update healthcare interventions during changing healthcare contexts, such as a pandemic. We examined the potential to support healthcare interventions by comparing MATA with “human-only” thematic analysis techniques on the same dataset (1,472 user responses from a COVID-19 behavioral intervention).MethodsIn MATA, an unsupervised topic-modeling approach identified latent topics in the text, from which researchers identified broad themes. In human-only codebook analysis, researchers developed an initial codebook based on previous research that was applied to the dataset by the team, who met regularly to discuss and refine the codes. Formal triangulation using a “convergence coding matrix” compared findings between methods, categorizing them as “agreement”, “complementary”, “dissonant”, or “silent”.ResultsHuman analysis took much longer than MATA (147.5 vs. 40 h). Both methods identified key themes about what users found helpful and unhelpful. Formal triangulation showed both sets of findings were highly similar. The formal triangulation showed high similarity between the findings. All MATA codes were classified as in agreement or complementary to the human themes. When findings differed slightly, this was due to human researcher interpretations or nuance from human-only analysis.DiscussionResults produced by MATA were similar to human-only thematic analysis, with substantial time savings. For simple analyses that do not require an in-depth or subtle understanding of the data, MATA is a useful tool that can support qualitative researchers to interpret and analyze large datasets quickly. This approach can support intervention development and implementation, such as enabling rapid optimization during public health emergencies.

  2. Making Predictions using Large Scale Gaussian Processes - Dataset - NASA...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Making Predictions using Large Scale Gaussian Processes - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/making-predictions-using-large-scale-gaussian-processes
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    One of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.

  3. d

    Data from: Sparse Machine Learning Methods for Understanding Large Text...

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

  4. TMDB movies clean dataset

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharat Kumar0925 (2024). TMDB movies clean dataset [Dataset]. https://www.kaggle.com/datasets/bharatkumar0925/tmdb-movies-clean-dataset
    Explore at:
    zip(266877093 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Bharat Kumar0925
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset contains two files: Large_movies_data.csv and large_movies_clean.csv. The data is taken from the TMDB dataset. Originally, it contained around 900,000 movies, but some movies were dropped for recommendation purposes. Specifically, movies missing an overview were removed since the overview is one of the most important columns for analysis.

    Column Description:

    Large_movies_data.csv:

    • Id: Unique identifier for each movie.
    • Title: The title of the movie.
    • Overview: A brief description of the movie.
    • Genres: The genres associated with the movie.
    • Cast: The main actors in the movie.
    • Director: The director of the movie.
    • Writers: The screenwriters of the movie.
    • Production_companies: Companies involved in producing the movie.
    • Producers: Producers of the movie.
    • Original_language: The original language of the movie.
    • Vote_count: Number of votes the movie has received.
    • Vote_average: Average rating based on user votes.
    • Popularity: Popularity score of the movie.
    • Runtime: Duration of the movie in minutes.
    • Release_date: The release date of the movie.

    Total movies in Large_movies_data.csv: 663,828.

    Large_movies_clean.csv:

    This file is a cleaned version with unnecessary columns removed, text converted to lowercase, and many symbols removed (though some may still remain). If you find that certain features are missing, you can use the original Large_movies_data.csv.

    Columns in large_movies_clean.csv: - Id: Unique identifier for each movie. - Title: The title of the movie. - Tags: Combined information from the overview, genres, and other textual columns. - Original_language: The original language of the movie. - Vote_count: Number of votes the movie has received. - Vote_average: Average rating based on user votes. - Year: Year extracted from the release date. - Month: Month extracted from the release date.

    Possible Use Cases:

    1. Recommendation System: A robust recommendation system can be built using this large dataset.
    2. Analysis: Analyze various aspects, such as identifying actors who starred in the most popular movies, the impact of having the same writer, director, and producer on a movie, and whether independent producers create better movies.
    3. Rating Prediction: Predict the average rating of a movie based on factors such as overview, genres, and cast.
    4. Other Analysis: Perform other types of analysis to discover patterns in the movie industry.

    If you find this dataset useful, please upvote it!

  5. o

    Replication data for: Machine Learning Methods for Demand Estimation

    • openicpsr.org
    Updated May 1, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Bajari; Denis Nekipelov; Stephen P. Ryan; Miaoyu Yang (2015). Replication data for: Machine Learning Methods for Demand Estimation [Dataset]. http://doi.org/10.3886/E113366V1
    Explore at:
    Dataset updated
    May 1, 2015
    Dataset provided by
    American Economic Association
    Authors
    Patrick Bajari; Denis Nekipelov; Stephen P. Ryan; Miaoyu Yang
    Description

    We survey and apply several techniques from the statistical and computer science literature to the problem of demand estimation. To improve out-of-sample prediction accuracy, we propose a method of combining the underlying models via linear regression. Our method is robust to a large number of regressors; scales easily to very large data sets; combines model selection and estimation; and can flexibly approximate arbitrary non-linear functions. We illustrate our method using a standard scanner panel data set and find that our estimates are considerably more accurate in out-of-sample predictions of demand than some commonly used alternatives.

  6. H

    (HS 1) Toward Seamless Environmental Modeling: Integration of HydroShare...

    • hydroshare.org
    • search.dataone.org
    zip
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Don Choi; Jonathan Goodall; Lawrence Band; Iman Maghami; Laurence Lin; Linnea Saby; Zhiyu/Drew Li; Shaowen Wang; Chris Calloway; Martin Seul; Dan Ames; David Tarboton; Hong Yi (2024). (HS 1) Toward Seamless Environmental Modeling: Integration of HydroShare with Server-side Methods for Exposing Large Datasets to Models [Dataset]. http://doi.org/10.4211/hs.afcc703d884e4f73b598c9e4b8f8a15e
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    HydroShare
    Authors
    Young-Don Choi; Jonathan Goodall; Lawrence Band; Iman Maghami; Laurence Lin; Linnea Saby; Zhiyu/Drew Li; Shaowen Wang; Chris Calloway; Martin Seul; Dan Ames; David Tarboton; Hong Yi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This HydroShare resource was created to support the study presented in Choi et al. (2024), titled "Toward Reproducible and Interoperable Environmental Modeling: Integration of HydroShare with Server-side Methods for Exposing Large-Extent Spatial Datasets to Models." Ensuring the reproducibility of scientific studies is crucial for advancing research, with effective data management serving as a cornerstone for achieving this goal. In hydrologic and environmental modeling, spatial data is used as model input, and sharing this spatial data is a main step in the data management process. However, by focusing only on sharing data at the file level through small files rather than providing the ability to Find, Access, Interoperate with, and directly Reuse subsets of larger datasets, online data repositories have missed an opportunity to foster more reproducible science. This has led to challenges when accommodating large files that benefit from consistent data quality and seamless geographic extent.

    To utilize the benefits of large datasets, the objective of the Choi et al. (2024) study was to create and test an approach for exposing large extent spatial (LES) datasets to support catchment-scale hydrologic modeling needs. GeoServer and THREDDS Data Server connected to HydroShare were used to provide seamless access to LES datasets. The approach was demonstrated using the Regional Hydro-Ecologic Simulation System (RHESSys) for three different-sized watersheds in the US. Data consistency was assessed across three different data acquisition approaches: the 'conventional' approach, which involved sharing data at the file level through small files, as well as GeoServer and THREDDS Data Server. This assessment was conducted using RHESSys to evaluate differences in model streamflow output. This approach provided an opportunity to serve datasets needed to create catchment models in a consistent way that could be accessed and processed to serve individual modeling needs. For full details on the methods and approach, please refer to Choi et al. (2024). This HydroShare resource is essential for accessing the data and workflows that were integral to the study.

    This collection resource (HS 1) comprises 7 individual HydroShare resources (HS 2-8), each containing different datasets or workflows. These 7 HydroShare resources consist of the following: three resources for three state-scale LES datasets (HS 2-4), one resource with Jupyter notebooks for three different approaches and three different watersheds (HS 5), one resource for RHESSys model instances (i.e., input) of the conventional approach and observation data for all data access approaches in three different watersheds (HS 6), one resource with Jupyter notebooks for automated workflows to create LES datasets (HS 7), and finally one resource with Jupyter notebooks for the evaluation of data consistency (HS 8). More information on each resource is provided within it.

  7. n

    ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and...

    • data-staging.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siavash Mirarab; John Yin; Chao Zhang (2023). ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization [Dataset]. http://doi.org/10.6076/D16W2H
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    University of California, San Diego
    Authors
    Siavash Mirarab; John Yin; Chao Zhang
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Motivation Evolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends and is not able to analyze the largest available datasets in a reasonable time.
    Results ASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10,000 species or datasets with more than 100,000 genes in <2 days. Availability and implementation ASTRAL-MP is available at https://github.com/smirarab/ASTRAL/tree/MP. Methods In testing the efficiency of ASTRAL-MP, we use several simulated and real datasets (see Table). The datasets range in the number of species (n) between 48 and 1,000 and have between 1,000 and 14,446 gene trees (k).

    Name Original publication

    Species (n)

    Genes (k)

    Type

    Generations

    Contraction threshold

    Reps.

    SV

    Mirarab and Warnow (2015) 100, 200, 500, 1000 1000 Simulated

    2×1062×106 Fully resolved 10

    Avian

    Mirarab et al. (2014a) 48 14 446, 1000 Real Unknown (order: 107) Full, 0, 33, 50, 75% 1, 10

    Insects

    Sayyari et al. (2017) 144 1478 Real Unknown Fully resolved 1

    Note: For SV, some outlier replicates have fewer than 1m000 genes because poorly resolved gene trees are removed. For avian, the full dataset is subsampled randomly to create 10 inputs with 1m000 gene trees. In addition, to test limits of n, we used an existing simulated dataset (20 replicates) with 104 species and 1000 gene trees similarly to the SV1000 dataset. To test limits of k, we used an insect transcriptomic dataset (Misof et al., 2014; Sayyari et al., 2017) with 144 taxa and 1,478 genes, each with 100 bootstrapped gene trees.

  8. r

    Big Data and Society Abstract & Indexing - ResearchHelpDesk

    • researchhelpdesk.org
    Updated Jun 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Help Desk (2022). Big Data and Society Abstract & Indexing - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/abstract-and-indexing/477/big-data-and-society
    Explore at:
    Dataset updated
    Jun 23, 2022
    Dataset authored and provided by
    Research Help Desk
    Description

    Big Data and Society Abstract & Indexing - ResearchHelpDesk - Big Data & Society (BD&S) is open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities and computing and their intersections with the arts and natural sciences about the implications of Big Data for societies. The Journal's key purpose is to provide a space for connecting debates about the emerging field of Big Data practices and how they are reconfiguring academic, social, industry, business, and government relations, expertise, methods, concepts, and knowledge. BD&S moves beyond usual notions of Big Data and treats it as an emerging field of practice that is not defined by but generative of (sometimes) novel data qualities such as high volume and granularity and complex analytics such as data linking and mining. It thus attends to digital content generated through online and offline practices in social, commercial, scientific, and government domains. This includes, for instance, the content generated on the Internet through social media and search engines but also that which is generated in closed networks (commercial or government transactions) and open networks such as digital archives, open government, and crowdsourced data. Critically, rather than settling on a definition the Journal makes this an object of interdisciplinary inquiries and debates explored through studies of a variety of topics and themes. BD&S seeks contributions that analyze Big Data practices and/or involve empirical engagements and experiments with innovative methods while also reflecting on the consequences for how societies are represented (epistemologies), realized (ontologies) and governed (politics). Article processing charge (APC) The article processing charge (APC) for this journal is currently 1500 USD. Authors who do not have funding for open access publishing can request a waiver from the publisher, SAGE, once their Original Research Article is accepted after peer review. For all other content (Commentaries, Editorials, Demos) and Original Research Articles commissioned by the Editor, the APC will be waived. Abstract & Indexing Clarivate Analytics: Social Sciences Citation Index (SSCI) Directory of Open Access Journals (DOAJ) Google Scholar Scopus

  9. Large Customer Churn Analysis Dataset

    • kaggle.com
    zip
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hajra Amir (2024). Large Customer Churn Analysis Dataset [Dataset]. https://www.kaggle.com/datasets/hajraamir21/large-customer-churn-analysis-dataset
    Explore at:
    zip(17387 bytes)Available download formats
    Dataset updated
    Dec 18, 2024
    Authors
    Hajra Amir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains synthetic data generated for customer churn analysis. It includes 1000 entries representing customer information, such as demographics, account details, subscription types, and churn status. The data is ideal for predictive modeling, machine learning algorithms, and exploratory data analysis (EDA). Features: CustomerID: A unique identifier for each customer. Gender: Male or Female. Age: Customer's age in years. Geography: Country or region of the customer (e.g., Germany, France, UK). Tenure: Number of months the customer has been with the company. Contract: Type of subscription (Month-to-month, One-year, Two-year). MonthlyCharges: The amount billed monthly. TotalCharges: The total amount billed to date. PaymentMethod: Method used for payments (e.g., Credit card, Direct debit). IsActiveMember: Whether the customer is an active member (1 = Active, 0 = Inactive). Churn: Indicates whether the customer has churned (Yes/No).

  10. Additional file 1 of A generative model for evaluating missing data methods...

    • springernature.figshare.com
    zip
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lav Radosavljević; Stephen M. Smith; Thomas E. Nichols (2025). Additional file 1 of A generative model for evaluating missing data methods in large epidemiological cohorts [Dataset]. http://doi.org/10.6084/m9.figshare.28377358.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lav Radosavljević; Stephen M. Smith; Thomas E. Nichols
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Material 1.

  11. b

    Data from: Evolution of large males is associated with female-skewed adult...

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    • +3more
    zip
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    András Liker; Veronika Bókony; Ivett Pipoly; Jean-François Lemaître; Jean-Michel Gaillard; Tamas Szekely; Robert P. Freckleton (2021). Evolution of large males is associated with female-skewed adult sex ratios in amniotes [Dataset]. http://doi.org/10.5061/dryad.5qfttdz56
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 19, 2021
    Dataset provided by
    University of Sheffield
    Université Claude Bernard Lyon 1
    University of Pannonia
    Hungarian Research Network
    University of Bath
    Authors
    András Liker; Veronika Bókony; Ivett Pipoly; Jean-François Lemaître; Jean-Michel Gaillard; Tamas Szekely; Robert P. Freckleton
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Body size often differs between the sexes (leading to sexual size dimorphism, SSD), as a consequence of differential responses by males and females to selection pressures. Adult sex ratio (the proportion of males in the adult population, ASR) should influence SSD because ASR relates to both the number of competitors and available mates, which shape the intensity of mating competition and thereby promotes SSD evolution. However, whether ASR correlates with SSD variation among species has not been yet tested across a broad range of taxa. Using phylogenetic comparative analyses of 462 amniotes (i.e. reptiles, birds and mammals), we fill this knowledge gap by showing that male bias in SSD increases with increasingly female-biased ASRs in both mammals and birds. This relationship is not explained by the higher mortality of the larger sex because SSD is not associated with sex differences in either juvenile or adult mortality. Phylogenetic path analysis indicates that higher mortality in one sex leads to skewed ASR, which in turn may generate selection for SSD biased towards the rare sex. Taken together, our findings provide evidence that skewed ASRs in amniote populations can result in the rarer sex evolving large size to capitalize on enhanced mating opportunities.

    Methods Comparative dataset containing raw data used in the study. Data were collected from published sources (see Methods in the paper), references are provided for all records.

  12. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  13. n

    Data from: Advanced Computational Methods for Large-Scale Optimization...

    • curate.nd.edu
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhihao Xu (2025). Advanced Computational Methods for Large-Scale Optimization Problems [Dataset]. http://doi.org/10.7274/28786112.v1
    Explore at:
    Dataset updated
    May 12, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Zhihao Xu
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    With the development of science and technology, large-scale optimization tasks have become integral to cutting-edge engineering. The challenges of solving these problems arises from ever-growing system sizes, intricate physical space, and the computational cost required to accurately model and optimize target objectives. Taking the design of advanced functional materials as an example, the high-dimensional parameter space and high-fidelity physical simulations can demand immense computational resources for searching and iterations. Although emerging machine learning techniques have been combined with conventional experimental and simulation approaches to explore the design space and identify high-performance solutions, these methods are still limited to a small part of the design space around those materials have been well investigated.

    Over the past several decades, continuous development of both hardware and algorithms have addressed some of the challenges. High-performance computing (HPC) architectures and heterogeneous systems have greatly expanded the capacity to perform large-scale calculations and optimizations; On the other hand, the emergence of machine learning frameworks and algorithms have dramatically facilitated the development of advanced models and enable the integration of AI-driven techniques into traditional experiments and simulations more seamlessly. In recent years, quantum computing (QC) has received widespread attention due to its powerful performance on solving global optima and is regarded as a promising solution to large-scale and non-linear optimization problems in the future, and in the meantime, the quantum computing principles also expand the capacity of classical algorithms on exploring high-dimensional combinatorial spaces. In this dissertation, we will show the power of the integration of machine learning algorithms, quantum algorithms and HPC architectures on tackling the challenges of solving large-scale optimization problems.

    In the first part of this dissertation, we introduced an optimization algorithm based on a Quantum-inspired Genetic Algorithm (QGA) to design planar multilayer (PML) for transparent radiative cooler (TRC) applications. Results of numerical experiments showed that our QGA-facilitated optimization algorithm can converge to comparable solutions as quantum annealing (QA) and the QGA overperformed on classical genetic algorithm (CGA) on both convergence speed and global search capacity. Our work shows that quantum heuristic algorithms will become powerful tools for addressing the challenges traditional optimization algorithm faced when solving large-scale optimization problems with complex search space.

    In the second part of the dissertation, we proposed a quantum annealing-assisted lattice optimization (QALO) algorithm for high-entropy alloy (HEA) systems. The algorithm is developed based on the active learning framework that integrates the field-aware factorization machine (FFM), quantum annealing (QA) and machine learning potential (MLP). When applying to optimize the bulk grain configuration of the NbMoTaW alloy system, our algorithm can quickly obtain low-energy microstructures and the results successfully reproduce the Nb segregation and W enrichment in the bulk phase driven by thermodynamic driving force, which usually be observed in the experiments and MC/MD simulations. This work highlights the potential of quantum computing in exploring the large design space for HEA systems.

    In the third part of the dissertation, we employed the Distributed Quantum Approximate Optimization Algorithm (DQAOA) to address large-scale combinatorial optimization problems that exceed the limits of conventional computational resources. This was achieved through a divide-and-conquer strategy, in which the original problem is decomposed into smaller sub-tasks that are solved in parallel on a high-performance computing (HPC) system. To further enhance convergence efficiency, we introduced an Impact Factor Directed (IFD) decomposition method. By calculating impact factors and leveraging a targeted traversal strategy, IFD captures local structural features of the problem, making it effective for both dense and sparse instances. Finally, we explored the integration of DQAOA with the Quantum Framework (QFw) on the Frontier HPC system, demonstrating the potential for efficient management of large-scale circuit execution workloads across CPUs and GPUs.

  14. Z

    Large-scale Docking Datasets for Machine Learning

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Luttens; Israel Cabeza de Vaca; Leonard Sparring; Ulf Norinder; Jens Carlsson (2023). Large-scale Docking Datasets for Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7903160
    Explore at:
    Dataset updated
    May 22, 2023
    Dataset provided by
    Science for Life Laboratory, Uppsala University
    Uppsala University, Stockholm University, Örebro University
    Authors
    Andreas Luttens; Israel Cabeza de Vaca; Leonard Sparring; Ulf Norinder; Jens Carlsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large-scale virtual screening has become a valuable tool for early-phase drug discovery. Recent expansions of commercial chemical space have made it computationally intractable to evaluate all compounds in the libraries. Machine learning is one of the methods that aim to prioritize specific subsets of these vast libraries. In order to put these methods to the test, access to large-scale datasets is beneficial. To help the community benchmark their work, we share the docking scores of several ultralarge virtual screening campaigns.

    The datasets we provide contain canonical SMILES, compound identifiers, and docking scores. We docked two different chemical libraries against eight different biological targets with therapeutic relevance. The first dataset contained approximately 15.5 million molecules adhering to the "Rule-of-Four", whereas the second datasets consists of approximately 235 million "lead-like" molecules. The biological targets represent different classes of proteins and binding sites.

    More details on the datasets and our methods can be found on (https://github.com/carlssonlab/conformalpredictor) and our pre-print (https://doi.org/10.26434/chemrxiv-2023-w3x36).

    Please feel free to download and use these datasets for your own research purposes. We only ask that you cite our pre-print and datasets appropriately if you use it in your work. Thank you for your interest in our research!

  15. Dataset Large Scale

    • kaggle.com
    zip
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edmilson Silva (2024). Dataset Large Scale [Dataset]. https://www.kaggle.com/datasets/sc0v1n0/large-scale-01-knapsack-problems
    Explore at:
    zip(12207 bytes)Available download formats
    Dataset updated
    Aug 30, 2024
    Authors
    Edmilson Silva
    Description

    This dataset was created to facilitate the implementation of the study on knapsack problems.

    Uncorrelated Data Instances

    FileOptimum
    knapPI_1_100_1000_19147
    knapPI_1_200_1000_111238
    knapPI_1_500_1000_128857
    knapPI_1_1000_1000_154503
    knapPI_1_2000_1000_1110625
    knapPI_1_5000_1000_1276457
    knapPI_1_10000_1000_1563647

    Weakly Correlated Instances

    FileOptimum
    knapPI_2_100_1000_11514
    knapPI_2_200_1000_11634
    knapPI_2_500_1000_14566
    knapPI_2_1000_1000_19052
    knapPI_2_2000_1000_118051
    knapPI_2_5000_1000_144356
    knapPI_2_10000_1000_190204

    Strongly Correlated Instances

    FileOptimum
    knapPI_3_100_1000_12397
    knapPI_3_200_1000_12697
    knapPI_3_500_1000_17117
    knapPI_3_1000_1000_114390
    knapPI_3_2000_1000_128919
    knapPI_3_5000_1000_172505
    knapPI_3_10000_1000_1146919

    Resources:

  16. r

    Big Data and Society CiteScore 2024-2025 - ResearchHelpDesk

    • researchhelpdesk.org
    Updated Apr 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Help Desk (2022). Big Data and Society CiteScore 2024-2025 - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/sjr/477/big-data-and-society
    Explore at:
    Dataset updated
    Apr 9, 2022
    Dataset authored and provided by
    Research Help Desk
    Description

    Big Data and Society CiteScore 2024-2025 - ResearchHelpDesk - Big Data & Society (BD&S) is open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities and computing and their intersections with the arts and natural sciences about the implications of Big Data for societies. The Journal's key purpose is to provide a space for connecting debates about the emerging field of Big Data practices and how they are reconfiguring academic, social, industry, business, and government relations, expertise, methods, concepts, and knowledge. BD&S moves beyond usual notions of Big Data and treats it as an emerging field of practice that is not defined by but generative of (sometimes) novel data qualities such as high volume and granularity and complex analytics such as data linking and mining. It thus attends to digital content generated through online and offline practices in social, commercial, scientific, and government domains. This includes, for instance, the content generated on the Internet through social media and search engines but also that which is generated in closed networks (commercial or government transactions) and open networks such as digital archives, open government, and crowdsourced data. Critically, rather than settling on a definition the Journal makes this an object of interdisciplinary inquiries and debates explored through studies of a variety of topics and themes. BD&S seeks contributions that analyze Big Data practices and/or involve empirical engagements and experiments with innovative methods while also reflecting on the consequences for how societies are represented (epistemologies), realized (ontologies) and governed (politics). Article processing charge (APC) The article processing charge (APC) for this journal is currently 1500 USD. Authors who do not have funding for open access publishing can request a waiver from the publisher, SAGE, once their Original Research Article is accepted after peer review. For all other content (Commentaries, Editorials, Demos) and Original Research Articles commissioned by the Editor, the APC will be waived. Abstract & Indexing Clarivate Analytics: Social Sciences Citation Index (SSCI) Directory of Open Access Journals (DOAJ) Google Scholar Scopus

  17. d

    Data from: Fully automated sequence alignment methods are comparable to, and...

    • datadryad.org
    • researchdiscovery.drexel.edu
    zip
    Updated Jan 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Therese A. Catanach; Andrew D. Sweet; Nam-phuong D. Nguyen; Rhiannon M. Peery; Andrew H. Debevec; Andrea K. Thomer; Amanda C. Owings; Bret M. Boyd; Aron D. Katz; Felipe N. Soto-Adames; Julie M. Allen (2019). Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus [Dataset]. http://doi.org/10.5061/dryad.nc220
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2019
    Dataset provided by
    Dryad
    Authors
    Therese A. Catanach; Andrew D. Sweet; Nam-phuong D. Nguyen; Rhiannon M. Peery; Andrew H. Debevec; Andrea K. Thomer; Amanda C. Owings; Bret M. Boyd; Aron D. Katz; Felipe N. Soto-Adames; Julie M. Allen
    Time period covered
    Oct 19, 2017
    Description

    Cleaned_GenBank_Files.zipHepatitis B virus GenBank files after initial data filtering steps.cleaned_genbank.zipGenome_alignments.zipSequence alignments of hepatitis B virus genomes and the S-region. Files include the manual genome alignment, de-gapped manual alignments, MUSCLE genome alignment, linearized and unlinearized PASTA alignments, and the S-region alignment.Genome_trees.zipTree files estimated from sequence alignments of hepatitis B virus genomes. Trees are best maximum likelihood (ML) trees with bootstrap support values. Includes trees based on MUSCLE, manual, and PASTA genome alignments.Genome_consensus_sequence.fastaConsensus sequence of hepatitis B virus genomes. This sequence was used as a reference for HBV manual alignments.GenomeConsensus.fastaGenotype_trees.zipTree files used for genotype occupancy tests in hepatitis B viruses. Trees estimated from manual or PASTA genome alignments. Files include .tre and .xml format.GI_Clustering.zipInitial files of hepatitis B virus s...

  18. n

    Data from: A new method for handling missing species in diversification...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 6, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natalie Cusimano; Tanja Stadler; Susanne S. Renner (2012). A new method for handling missing species in diversification analysis applicable to randomly or non-randomly sampled phylogenies [Dataset]. http://doi.org/10.5061/dryad.r8f04fk2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 6, 2012
    Dataset provided by
    Ludwig-Maximilians-Universität München
    Authors
    Natalie Cusimano; Tanja Stadler; Susanne S. Renner
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.

  19. d

    Data from: Data and code from: A high throughput approach for measuring soil...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: A high throughput approach for measuring soil slaking index [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-a-high-throughput-approach-for-measuring-soil-slaking-index
    Explore at:
    Dataset updated
    Sep 2, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset includes soil wet aggregate stability measurements from the Upper Mississippi River Basin LTAR site in Ames, Iowa. Samples were collected in 2021 from this long-term tillage and cover crop trial in a corn-based agroecosystem. We measured wet aggregate stability using digital photography to quantify disintegration (slaking) of submerged aggregates over time, similar to the technique described by Fajardo et al. (2016) and Rieke et al. (2021). However, we adapted the technique to larger sample numbers by using a multi-well tray to submerge 20-36 aggregates simultaneously. We used this approach to measure slaking index of 160 soil samples (2120 aggregates). This dataset includes slaking index calculated for each aggregates, and also summarized by samples. There were usually 10-12 aggregates measured per sample. We focused primarily on methodological issues, assessing the statistical power of slaking index, needed replication, sensitivity to cultural practices, and sensitivity to sample collection date. We found that small numbers of highly unstable aggregates lead to skewed distributions for slaking index. We concluded at least 20 aggregates per sample were preferred to provide confidence in measurement precision. However, the experiment had high statistical power with only 10-12 replicates per sample. Slaking index was not sensitive to the initial size of dry aggregates (3 to 10 mm diameter); therefore, pre-sieving soils was not necessary. The field trial showed greater aggregate stability under no-till than chisel plow practice, and changing stability over a growing season. These results will be useful to researchers and agricultural practitioners who want a simple, fast, low-cost method for measuring wet aggregate stability on many samples.

  20. d

    Replication Data for: Large Language Models as a Substitute for Human...

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heseltine, Michael (2024). Replication Data for: Large Language Models as a Substitute for Human Experts in Annotating Political Text [Dataset]. http://doi.org/10.7910/DVN/V2P6YL
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Heseltine, Michael
    Description

    Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bondaronek, Paulina; Yardley, Lucy; Papakonstantinou, Trisevgeni; Towler, Lauren; Amlôt, Richard; Ainsworth, Ben; Chadborn, Tim (2023). Table_1_Applying machine-learning to rapidly analyze large qualitative text datasets to inform the COVID-19 pandemic response: comparing human and machine-assisted topic analysis techniques.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001054549

Table_1_Applying machine-learning to rapidly analyze large qualitative text datasets to inform the COVID-19 pandemic response: comparing human and machine-assisted topic analysis techniques.DOCX

Explore at:
Dataset updated
Oct 31, 2023
Authors
Bondaronek, Paulina; Yardley, Lucy; Papakonstantinou, Trisevgeni; Towler, Lauren; Amlôt, Richard; Ainsworth, Ben; Chadborn, Tim
Description

IntroductionMachine-assisted topic analysis (MATA) uses artificial intelligence methods to help qualitative researchers analyze large datasets. This is useful for researchers to rapidly update healthcare interventions during changing healthcare contexts, such as a pandemic. We examined the potential to support healthcare interventions by comparing MATA with “human-only” thematic analysis techniques on the same dataset (1,472 user responses from a COVID-19 behavioral intervention).MethodsIn MATA, an unsupervised topic-modeling approach identified latent topics in the text, from which researchers identified broad themes. In human-only codebook analysis, researchers developed an initial codebook based on previous research that was applied to the dataset by the team, who met regularly to discuss and refine the codes. Formal triangulation using a “convergence coding matrix” compared findings between methods, categorizing them as “agreement”, “complementary”, “dissonant”, or “silent”.ResultsHuman analysis took much longer than MATA (147.5 vs. 40 h). Both methods identified key themes about what users found helpful and unhelpful. Formal triangulation showed both sets of findings were highly similar. The formal triangulation showed high similarity between the findings. All MATA codes were classified as in agreement or complementary to the human themes. When findings differed slightly, this was due to human researcher interpretations or nuance from human-only analysis.DiscussionResults produced by MATA were similar to human-only thematic analysis, with substantial time savings. For simple analyses that do not require an in-depth or subtle understanding of the data, MATA is a useful tool that can support qualitative researchers to interpret and analyze large datasets quickly. This approach can support intervention development and implementation, such as enabling rapid optimization during public health emergencies.

Search
Clear search
Close search
Google apps
Main menu