100+ datasets found
  1. g

    A Large Scale Fish Dataset

    • gts.ai
    json
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). A Large Scale Fish Dataset [Dataset]. https://gts.ai/dataset-download/a-large-scale-fish-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation.

  2. Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...

    • zenodo.org
    • produccioncientifica.ucm.es
    pdf, zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oresti Banos; Oresti Banos; Miguel Damas; Miguel Damas; Carmen Goicoechea; Carmen Goicoechea; Pandelis Perakakis; Pandelis Perakakis; Hector Pomares; Hector Pomares; Ciro Rodriguez-Leon; Ciro Rodriguez-Leon; Daniel Sanabria; Daniel Sanabria; Claudia Villalonga; Claudia Villalonga (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. http://doi.org/10.5281/zenodo.11060596
    Explore at:
    pdf, zipAvailable download formats
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Oresti Banos; Oresti Banos; Miguel Damas; Miguel Damas; Carmen Goicoechea; Carmen Goicoechea; Pandelis Perakakis; Pandelis Perakakis; Hector Pomares; Hector Pomares; Ciro Rodriguez-Leon; Ciro Rodriguez-Leon; Daniel Sanabria; Daniel Sanabria; Claudia Villalonga; Claudia Villalonga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

    The following is a non-exhaustive list of collected data:

    • Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).
    • Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.
    • Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.
    • Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.
    • Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

    For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

    For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.

  3. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  4. d

    Large Scale Topo Tower (Polygon) (LGATE-164) - Datasets - data.wa.gov.au

    • catalogue.data.wa.gov.au
    Updated Jul 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Large Scale Topo Tower (Polygon) (LGATE-164) - Datasets - data.wa.gov.au [Dataset]. https://catalogue.data.wa.gov.au/dataset/large-scale-topo-tower-polygon-lgate-164
    Explore at:
    Dataset updated
    Jul 10, 2019
    Area covered
    Western Australia
    Description

    A tall framework or structure, the elevation of which is functional. Multiple points that describe the feature’s perimeter. NOTE: Landgate no longer maintains large scale topographic features. The large scale topographic data capture programme ceased in 2016. Please consider carefully the suitability of the data within this service for your purpose. © Western Australian Land Information Authority (Landgate). Use of Landgate data is subject to Personal Use License terms and conditions unless otherwise authorised under approved License terms and conditions.

  5. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    • explore.openaire.eu
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  6. lotsa_data

    • huggingface.co
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce (2025). lotsa_data [Dataset]. https://huggingface.co/datasets/Salesforce/lotsa_data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    LOTSA Data

    The Large-scale Open Time Series Archive (LOTSA) is a collection of open time series datasets for time series forecasting. It was collected for the purpose of pre-training Large Time Series Models. See the paper and codebase for more information.

      Citation
    

    If you're using LOTSA data in your research or applications, please cite it using this BibTeX: BibTeX: @article{woo2024unified, title={Unified Training of Universal Time Series Forecasting Transformers}… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/lotsa_data.

  7. Z

    HPC-ODA Dataset Collection

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Apr 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netti, Alessio (2021). HPC-ODA Dataset Collection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3701439
    Explore at:
    Dataset updated
    Apr 9, 2021
    Dataset authored and provided by
    Netti, Alessio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks.

    The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 6 different segments:

    Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction.

    Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification.

    Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification.

    Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction.

    Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies.

    DEEP-EST Dataset: this medium-granularity dataset was collected on the modular DEEP-EST HPC system and consists of three parts.These were collected on 16 compute nodes each, while running several MPI applications under different warm-water cooling configurations. This dataset can be used for CPU and GPU temperature prediction, or for thermal characterization.

    The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.

  8. W

    Index To The BGS Collection Of Large Scale Mine Plans & Land Survey Plans.

    • cloud.csiss.gmu.edu
    • hosted-metadata.bgs.ac.uk
    • +4more
    html
    Updated Dec 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United Kingdom (2019). Index To The BGS Collection Of Large Scale Mine Plans & Land Survey Plans. [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/index-to-the-bgs-collection-of-large-scale-mine-plans-land-survey-plans
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Dec 18, 2019
    Dataset provided by
    United Kingdom
    Description

    Index to the BGS collection of large scale or large format plans of all types including those relating to mining activity, including abandonment plans and site investigations. The Plans Database Index was set up c.1983 as a digital index to the collections of Land Survey Plans and Plans of Abandoned Mines. There are entries for all registered plans but not all the index fields are complete, as this depends on the nature of the original plan. The index covers the whole of Great Britain.

  9. Software for solving large-scale generalized eigenvalue problems on...

    • catalog.data.gov
    • cloud.csiss.gmu.edu
    • +1more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Software for solving large-scale generalized eigenvalue problems on distributed computers. [Dataset]. https://catalog.data.gov/dataset/software-for-solving-large-scale-generalized-eigenvalue-problems-on-distributed-computers-34f79
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Open source software for solving large-scale generalized eigenvalue problems on distributed computers. Suitable for large (80,000 by 80,000 or greater) dense matrices. Written in Fortran90+. Includes a test program and sample output. In particular, this distribution, MPI_GEVP_Package.tar.gz, consists of documentation (HowTo_MPI_GEVP_inviter.pdf), a collection of output files, and the software distribution itself. To use the software, download MPI_GEVP_package.tar.gz, unwrap it, and follow the instructions in the HowTo to compile the solver and another program for generating test matrix elements. Then run various tests and compare the results with the output found in the various Output files.

  10. o

    Data from: A large-scale COVID-19 Twitter chatter dataset for open...

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Aug 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell (2020). A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration [Dataset]. http://doi.org/10.5281/zenodo.3977558
    Explore at:
    Dataset updated
    Aug 9, 2020
    Authors
    Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell
    Description

    Version 22 of the dataset, we have refactored the full_dataset.tsv and full_dataset_clean.tsv files (since version 20) to include two additional columns: language and place country code (when available). This change now includes language and country code for ALL the tweets in the dataset, not only clean tweets. With this change we have removed the clean_place_country.tar.gz and clean_languages.tar.gz files. With our refactoring of the dataset generating code we also found a small bug that made some of the retweets not be counted properly, hence the extra increase on tweets available. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (602,921,788 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (142,360,288 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.

  11. f

    THINGS-data: Behavioral odd-one-out data and code

    • plus.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Hebart; Oliver Contier; Lina Teichmann; Adam Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris Baker (2023). THINGS-data: Behavioral odd-one-out data and code [Dataset]. http://doi.org/10.25452/figshare.plus.20552784.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figshare+
    Authors
    Martin Hebart; Oliver Contier; Lina Teichmann; Adam Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris Baker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    4.7 million object odd-one-out judgements from human participants on Amazon Mechanical Turk.

    Part of THINGS-data: A multimodal collection of large-scale datasets for investigating object representations in brain and behavior.

    See related materials in Collection at: https://doi.org/10.25452/figshare.plus.c.6161151

  12. d

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • catalog.data.gov
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://catalog.data.gov/dataset/buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.

  13. u

    TOGA COARE Large Scale Atmospheric Data

    • data.ucar.edu
    • rda-web-prod.ucar.edu
    • +3more
    binary
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TOGA COARE Data Information System, NESDIS, NOAA, U.S. Department of Commerce (2024). TOGA COARE Large Scale Atmospheric Data [Dataset]. http://doi.org/10.5065/QM9W-XP57
    Explore at:
    binaryAvailable download formats
    Dataset updated
    Aug 4, 2024
    Dataset provided by
    Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
    Authors
    TOGA COARE Data Information System, NESDIS, NOAA, U.S. Department of Commerce
    Time period covered
    Feb 15, 1993
    Description

    This dataset contains synoptic surface and upper air data and satellite images obtained during TOGA COARE. For other TOGA COARE data archives, see the UCAR/EOL TOGA COARE Project Page [], which contains a link to other archives.

  14. N

    Data from: A large-scale study on the effects of sex on gray matter...

    • neurovault.org
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). A large-scale study on the effects of sex on gray matter asymmetry [Dataset]. http://identifiers.org/neurovault.collection:2825
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A collection of 5 brain maps. Each brain map is a 3D array of values representing properties of the brain at different locations.

    Collection description

    Statistical maps presented in the manuscript "A large-scale study on the effects of sex on gray matter asymmetry", published in Brain Structure and Function.

  15. d

    Data from: A Review of International Large-Scale Assessments in Education...

    • datasets.ai
    • catalog.data.gov
    33
    Updated Nov 14, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of State (2015). A Review of International Large-Scale Assessments in Education Assessing Component Skills and Collecting Contextual Data [Dataset]. https://datasets.ai/datasets/a-review-of-international-large-scale-assessments-in-education-assessing-component-skills-
    Explore at:
    33Available download formats
    Dataset updated
    Nov 14, 2015
    Dataset authored and provided by
    Department of State
    Description

    The OECD has initiated PISA for Development (PISA-D) in response to the rising need of developing countries to collect data about their education systems and the capacity of their student bodies. This report aims to compare and contrast approaches regarding the instruments that are used to collect data on (a) component skills and cognitive instruments, (b) contextual frameworks, and (c) the implementation of the different international assessments, as well as approaches to include children who are not at school, and the ways in which data are used. It then seeks to identify assessment practices in these three areas that will be useful for developing countries. This report reviews the major international and regional large-scale educational assessments: large-scale international surveys, school-based surveys and household-based surveys. For each of the issues discussed, there is a description of the prevailing international situation, followed by a consideration of the issue for developing countries and then a description of the relevance of the issue to PISA for Development.

  16. M

    Mass Data Migration Service Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Mass Data Migration Service Report [Dataset]. https://www.archivemarketresearch.com/reports/mass-data-migration-service-56309
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Mass Data Migration Service market is experiencing robust growth, driven by the increasing volume of data generated across various industries and the rising need for efficient data management solutions. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033, reaching an estimated value of $50 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the proliferation of cloud computing and the associated need to migrate legacy on-premise systems to cloud environments is a major catalyst. Secondly, the growing adoption of data analytics and business intelligence initiatives necessitates efficient and reliable data migration capabilities. Thirdly, stringent data privacy regulations and compliance requirements are pushing organizations to adopt robust data migration solutions for better control and security. Finally, the rising demand for data-driven decision making across diverse sectors like healthcare, finance, and manufacturing is further bolstering market growth. Segment-wise, the cloud-based Mass Data Migration Service is expected to dominate the market due to its scalability, cost-effectiveness, and enhanced security features. Among application segments, healthcare & life sciences, manufacturing, and BFSI are leading the adoption, reflecting their substantial data volumes and the critical need for secure and efficient data handling. Geographically, North America and Europe currently hold significant market share, but the Asia-Pacific region is anticipated to experience substantial growth driven by increasing digitalization and investment in technological infrastructure. However, challenges such as data security concerns, integration complexities, and the lack of skilled professionals capable of handling large-scale data migrations represent potential restraints to market growth. Despite these challenges, the overall outlook for the Mass Data Migration Service market remains highly positive, promising substantial growth and opportunities for market players in the coming years.

  17. i

    Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...

    • ieee-dataport.org
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://ieee-dataport.org/documents/large-scale-dataset-twitter-chatter-about-online-learning-during-current-covid-19-omicron
    Explore at:
    Dataset updated
    Aug 10, 2022
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    no. 8

  18. d

    New Visions for Large Scale Networks: Research and Applications

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated May 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCO NITRD (2025). New Visions for Large Scale Networks: Research and Applications [Dataset]. https://catalog.data.gov/dataset/new-visions-for-large-scale-networks-research-and-applications
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    NCO NITRD
    Description

    This paper documents the findings of the March 12-14, 2001 Workshop on New Visions for Large-Scale Networks: Research and Applications. The workshops objectives were to develop a vision for the future of networking 10 to 20 years out and to identify needed Federal networking research to enable that vision...

  19. Z

    Community Detection to Split Large-scale Assemblies in Subassemblies

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584
    Explore at:
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    Münker, Sören
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

    Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

    A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

    Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}

  20. Z

    Accompanying data - Papyrus - A large scale curated dataset aimed at...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jespers, Willem (2024). Accompanying data - Papyrus - A large scale curated dataset aimed at bioactivity predictions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7019873
    Explore at:
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    Jespers, Willem
    IJzerman, Ad P.
    Béquignon, Olivier J. M.
    Bongers, Brandon J.
    van Westen, Gerard J. P.
    van de Water, Bob
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Addition of supporting files:- LICENSE.txt- data_types.json- data_size.json

    Fixed version of Papyrus++ 05.5:- In the previous 05.5 version data was incorrectly duplicated based on assay type. This resulted in unintended data augmentation.- In this fixed 05.5 version the duplicates have been eliminated, now reporting the correct amount of data per assay type.

    This repository contains the version 05.5 of the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the article "Papyrus - A large scale curated dataset aimed at bioactivity predictions" http://doi.org/10.1186/s13321-022-00672-x.

    With the ongoing rapid growth of publicly available ligand-protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers’ time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some example quantitative structure-activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
GTS (2024). A Large Scale Fish Dataset [Dataset]. https://gts.ai/dataset-download/a-large-scale-fish-dataset/

A Large Scale Fish Dataset

Explore at:
jsonAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation.

Search
Clear search
Close search
Google apps
Main menu