100+ datasets found

g
A Large Scale Fish Dataset
gts.ai
json
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). A Large Scale Fish Dataset [Dataset]. https://gts.ai/dataset-download/a-large-scale-fish-dataset/
Explore at:
jsonAvailable download formats
Dataset updated
Mar 20, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation.
Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...
zenodo.org
produccioncientifica.ucm.es
pdf, zip
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oresti Banos; Oresti Banos; Miguel Damas; Miguel Damas; Carmen Goicoechea; Carmen Goicoechea; Pandelis Perakakis; Pandelis Perakakis; Hector Pomares; Hector Pomares; Ciro Rodriguez-Leon; Ciro Rodriguez-Leon; Daniel Sanabria; Daniel Sanabria; Claudia Villalonga; Claudia Villalonga (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. http://doi.org/10.5281/zenodo.11060596
Explore at:
pdf, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11060596
Dataset updated
Jul 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Oresti Banos; Oresti Banos; Miguel Damas; Miguel Damas; Carmen Goicoechea; Carmen Goicoechea; Pandelis Perakakis; Pandelis Perakakis; Hector Pomares; Hector Pomares; Ciro Rodriguez-Leon; Ciro Rodriguez-Leon; Daniel Sanabria; Daniel Sanabria; Claudia Villalonga; Claudia Villalonga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

The following is a non-exhaustive list of collected data:

Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).

Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.

Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.

Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.

Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.
TREC 2022 Deep Learning test collection
catalog.data.gov
s.cnmilf.com
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
d
Large Scale Topo Tower (Polygon) (LGATE-164) - Datasets - data.wa.gov.au
catalogue.data.wa.gov.au
Updated Jul 10, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Large Scale Topo Tower (Polygon) (LGATE-164) - Datasets - data.wa.gov.au [Dataset]. https://catalogue.data.wa.gov.au/dataset/large-scale-topo-tower-polygon-lgate-164
Explore at:
Dataset updated
Jul 10, 2019
Area covered
Western Australia
Description
A tall framework or structure, the elevation of which is functional. Multiple points that describe the feature’s perimeter. NOTE: Landgate no longer maintains large scale topographic features. The large scale topographic data capture programme ceased in 2016. Please consider carefully the suitability of the data within this service for your purpose. © Western Australian Land Information Authority (Landgate). Use of Landgate data is subject to Personal Use License terms and conditions unless otherwise authorised under approved License terms and conditions.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
explore.openaire.eu
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
lotsa_data
huggingface.co
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salesforce (2025). lotsa_data [Dataset]. https://huggingface.co/datasets/Salesforce/lotsa_data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2025
Dataset provided by
Salesforce Inchttp://salesforce.com/
Authors
Salesforce
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
LOTSA Data

The Large-scale Open Time Series Archive (LOTSA) is a collection of open time series datasets for time series forecasting. It was collected for the purpose of pre-training Large Time Series Models. See the paper and codebase for more information.

Citation

If you're using LOTSA data in your research or applications, please cite it using this BibTeX: BibTeX: @article{woo2024unified, title={Unified Training of Universal Time Series Forecasting Transformers}… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/lotsa_data.
Z
HPC-ODA Dataset Collection
data.niaid.nih.gov
explore.openaire.eu
Updated Apr 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netti, Alessio (2021). HPC-ODA Dataset Collection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3701439
Explore at:
Dataset updated
Apr 9, 2021
Dataset authored and provided by
Netti, Alessio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HPC-ODA is a collection of datasets acquired on production HPC systems, which are representative of several real-world use cases in the field of Operational Data Analytics (ODA) for the improvement of reliability and energy efficiency. The datasets are composed of monitoring sensor data, acquired from the components of different HPC systems depending on the specific use case. Two tools, whose overhead is proven to be very light, were used to acquire data in HPC-ODA: these are the DCDB and LDMS monitoring frameworks.

The aim of HPC-ODA is to provide several vertical slices (here named segments) of the monitoring data available in a large-scale HPC installation. The segments all have different granularities, in terms of data sources and time scale, and provide several use cases on which models and approaches to data processing can be evaluated. While having a production dataset from a whole HPC system - from the infrastructure down to the CPU core level - at a fine time granularity would be ideal, this is often not feasible due to the confidentiality of the data, as well as the sheer amount of storage space required. HPC-ODA includes 6 different segments:

Power Consumption Prediction: a fine-granularity dataset that was collected from a single compute node in a HPC system. It contains both node-level data as well as per-CPU core metrics, and can be used to perform regression tasks such as power consumption prediction.

Fault Detection: a medium-granularity dataset that was collected from a single compute node while it was subjected to fault injection. It contains only node-level data, as well as the labels for both the applications and faults being executed on the HPC node in time. This dataset can be used to perform fault classification.

Application Classification: a medium-granularity dataset that was collected from 16 compute nodes in a HPC system while running different parallel MPI applications. Data is at the compute node level, separated for each of them, and is paired with the labels of the applications being executed. This dataset can be used for tasks such as application classification.

Infrastructure Management: a coarse-granularity dataset containing cluster-wide data from a HPC system, about its warm water cooling system as well as power consumption. The data is at the rack level, and can be used for regression tasks such as outlet water temperature or removed heat prediction.

Cross-architecture: a medium-granularity dataset that is a variant of the Application Classification one, and shares the same ODA use case. Here, however, single-node configurations of the applications were executed on three different compute node types with different CPU architectures. This dataset can be used to perform cross-architecture application classification, or performance comparison studies.

DEEP-EST Dataset: this medium-granularity dataset was collected on the modular DEEP-EST HPC system and consists of three parts.These were collected on 16 compute nodes each, while running several MPI applications under different warm-water cooling configurations. This dataset can be used for CPU and GPU temperature prediction, or for thermal characterization.

The HPC-ODA dataset collection includes a readme document containing all necessary usage information, as well as a lightweight Python framework to carry out the ODA tasks described for each dataset.
W
Index To The BGS Collection Of Large Scale Mine Plans & Land Survey Plans.
cloud.csiss.gmu.edu
hosted-metadata.bgs.ac.uk
+4more
html
Updated Dec 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United Kingdom (2019). Index To The BGS Collection Of Large Scale Mine Plans & Land Survey Plans. [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/index-to-the-bgs-collection-of-large-scale-mine-plans-land-survey-plans
Explore at:
htmlAvailable download formats
Dataset updated
Dec 18, 2019
Dataset provided by
United Kingdom
Description
Index to the BGS collection of large scale or large format plans of all types including those relating to mining activity, including abandonment plans and site investigations. The Plans Database Index was set up c.1983 as a digital index to the collections of Land Survey Plans and Plans of Abandoned Mines. There are entries for all registered plans but not all the index fields are complete, as this depends on the nature of the original plan. The index covers the whole of Great Britain.
Software for solving large-scale generalized eigenvalue problems on...
catalog.data.gov
cloud.csiss.gmu.edu
+1more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Software for solving large-scale generalized eigenvalue problems on distributed computers. [Dataset]. https://catalog.data.gov/dataset/software-for-solving-large-scale-generalized-eigenvalue-problems-on-distributed-computers-34f79
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Open source software for solving large-scale generalized eigenvalue problems on distributed computers. Suitable for large (80,000 by 80,000 or greater) dense matrices. Written in Fortran90+. Includes a test program and sample output. In particular, this distribution, MPI_GEVP_Package.tar.gz, consists of documentation (HowTo_MPI_GEVP_inviter.pdf), a collection of output files, and the software distribution itself. To use the software, download MPI_GEVP_package.tar.gz, unwrap it, and follow the instructions in the HowTo to compile the solver and another program for generating test matrix elements. Then run various tests and compare the results with the output found in the various Output files.
o
Data from: A large-scale COVID-19 Twitter chatter dataset for open...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Aug 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell (2020). A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration [Dataset]. http://doi.org/10.5281/zenodo.3977558
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3977558
Dataset updated
Aug 9, 2020
Authors
Juan M. Banda; Ramya Tekumalla; Guanyu Wang; Jingyuan Yu; Tuo Liu; Yuning Ding; Katya Artemova; Elena Tutubalina; Gerardo Chowell
Description
Version 22 of the dataset, we have refactored the full_dataset.tsv and full_dataset_clean.tsv files (since version 20) to include two additional columns: language and place country code (when available). This change now includes language and country code for ALL the tweets in the dataset, not only clean tweets. With this change we have removed the clean_place_country.tar.gz and clean_languages.tar.gz files. With our refactoring of the dataset generating code we also found a small bug that made some of the retweets not be counted properly, hence the extra increase on tweets available. Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over 4 million tweets a day. We have added additional data provided by our new collaborators from January 27th to March 27th, to provide extra longitudinal coverage. Version 10 added ~1.5 million tweets in the Russian language collected between January 1st and May 8th, gracefully provided to us by: Katya Artemova (NRU HSE) and Elena Tutubalina (KFU). From version 12 we have included daily hashtags, mentions and emoijis and their frequencies the respective zip files. From version 14 we have included the tweet identifiers and their respective language for the clean version of the dataset. Since version 20 we have included language and place location for all tweets. The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (602,921,788 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv file (142,360,288 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the full_dataset-statistics.tsv and full_dataset-clean-statistics.tsv files. For more statistics and some visualizations visit: http://www.panacealab.org/covid19/ More details can be found (and will be updated faster at: https://github.com/thepanacealab/covid19_twitter) and our pre-print about the dataset (https://arxiv.org/abs/2004.03688) As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data ONLY for research purposes. They need to be hydrated to be used. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo for these updates. Release: We have standardized the name of the resource to match our pre-print manuscript and to not have to update it every week.
f
THINGS-data: Behavioral odd-one-out data and code
plus.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Hebart; Oliver Contier; Lina Teichmann; Adam Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris Baker (2023). THINGS-data: Behavioral odd-one-out data and code [Dataset]. http://doi.org/10.25452/figshare.plus.20552784.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25452/figshare.plus.20552784.v1
Dataset updated
May 31, 2023
Dataset provided by
Figshare+
Authors
Martin Hebart; Oliver Contier; Lina Teichmann; Adam Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris Baker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
4.7 million object odd-one-out judgements from human participants on Amazon Mechanical Turk.

Part of THINGS-data: A multimodal collection of large-scale datasets for investigating object representations in brain and behavior.

See related materials in Collection at: https://doi.org/10.25452/figshare.plus.c.6161151
d
Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...
catalog.data.gov
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2024). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. https://catalog.data.gov/dataset/buildingsbench-a-large-scale-dataset-of-900k-buildings-and-benchmark-for-short-term-load-f
Explore at:
Dataset updated
Jan 11, 2024
Dataset provided by
National Renewable Energy Laboratory
Description
The BuildingsBench datasets consist of: Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock. 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF. Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB). BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below: ElectricityLoadDiagrams20112014 Building Data Genome Project-2 Individual household electric power consumption (Sceaux) Borealis SMART IDEAL Low Carbon London A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
u
TOGA COARE Large Scale Atmospheric Data
data.ucar.edu
rda-web-prod.ucar.edu
+3more
binary
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TOGA COARE Data Information System, NESDIS, NOAA, U.S. Department of Commerce (2024). TOGA COARE Large Scale Atmospheric Data [Dataset]. http://doi.org/10.5065/QM9W-XP57
Explore at:
binaryAvailable download formats
Unique identifier
https://doi.org/10.5065/QM9W-XP57
Dataset updated
Aug 4, 2024
Dataset provided by
Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory
Authors
TOGA COARE Data Information System, NESDIS, NOAA, U.S. Department of Commerce
Time period covered
Feb 15, 1993
Description
This dataset contains synoptic surface and upper air data and satellite images obtained during TOGA COARE. For other TOGA COARE data archives, see the UCAR/EOL TOGA COARE Project Page [], which contains a link to other archives.
N
Data from: A large-scale study on the effects of sex on gray matter...
neurovault.org
zip
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). A large-scale study on the effects of sex on gray matter asymmetry [Dataset]. http://identifiers.org/neurovault.collection:2825
Explore at:
zipAvailable download formats
Unique identifier
https://identifiers.org/neurovault.collection:2825
Dataset updated
Mar 27, 2025
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A collection of 5 brain maps. Each brain map is a 3D array of values representing properties of the brain at different locations.

Collection description

Statistical maps presented in the manuscript "A large-scale study on the effects of sex on gray matter asymmetry", published in Brain Structure and Function.
d
Data from: A Review of International Large-Scale Assessments in Education...
datasets.ai
catalog.data.gov
33
Updated Nov 14, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of State (2015). A Review of International Large-Scale Assessments in Education Assessing Component Skills and Collecting Contextual Data [Dataset]. https://datasets.ai/datasets/a-review-of-international-large-scale-assessments-in-education-assessing-component-skills-
Explore at:
33Available download formats
Dataset updated
Nov 14, 2015
Dataset authored and provided by
Department of State
Description
The OECD has initiated PISA for Development (PISA-D) in response to the rising need of developing countries to collect data about their education systems and the capacity of their student bodies. This report aims to compare and contrast approaches regarding the instruments that are used to collect data on (a) component skills and cognitive instruments, (b) contextual frameworks, and (c) the implementation of the different international assessments, as well as approaches to include children who are not at school, and the ways in which data are used. It then seeks to identify assessment practices in these three areas that will be useful for developing countries. This report reviews the major international and regional large-scale educational assessments: large-scale international surveys, school-based surveys and household-based surveys. For each of the issues discussed, there is a description of the prevailing international situation, followed by a consideration of the issue for developing countries and then a description of the relevance of the issue to PISA for Development.
M
Mass Data Migration Service Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Mass Data Migration Service Report [Dataset]. https://www.archivemarketresearch.com/reports/mass-data-migration-service-56309
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Mass Data Migration Service market is experiencing robust growth, driven by the increasing volume of data generated across various industries and the rising need for efficient data management solutions. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033, reaching an estimated value of $50 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the proliferation of cloud computing and the associated need to migrate legacy on-premise systems to cloud environments is a major catalyst. Secondly, the growing adoption of data analytics and business intelligence initiatives necessitates efficient and reliable data migration capabilities. Thirdly, stringent data privacy regulations and compliance requirements are pushing organizations to adopt robust data migration solutions for better control and security. Finally, the rising demand for data-driven decision making across diverse sectors like healthcare, finance, and manufacturing is further bolstering market growth. Segment-wise, the cloud-based Mass Data Migration Service is expected to dominate the market due to its scalability, cost-effectiveness, and enhanced security features. Among application segments, healthcare & life sciences, manufacturing, and BFSI are leading the adoption, reflecting their substantial data volumes and the critical need for secure and efficient data handling. Geographically, North America and Europe currently hold significant market share, but the Asia-Pacific region is anticipated to experience substantial growth driven by increasing digitalization and investment in technological infrastructure. However, challenges such as data security concerns, integration complexities, and the lack of skilled professionals capable of handling large-scale data migrations represent potential restraints to market growth. Despite these challenges, the overall outlook for the Mass Data Migration Service market remains highly positive, promising substantial growth and opportunities for market players in the coming years.
i
Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...
ieee-dataport.org
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://ieee-dataport.org/documents/large-scale-dataset-twitter-chatter-about-online-learning-during-current-covid-19-omicron
Explore at:
Dataset updated
Aug 10, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
no. 8
d
New Visions for Large Scale Networks: Research and Applications
catalog.data.gov
datasets.ai
+2more
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCO NITRD (2025). New Visions for Large Scale Networks: Research and Applications [Dataset]. https://catalog.data.gov/dataset/new-visions-for-large-scale-networks-research-and-applications
Explore at:
Dataset updated
May 14, 2025
Dataset provided by
NCO NITRD
Description
This paper documents the findings of the March 12-14, 2001 Workshop on New Visions for Large-Scale Networks: Research and Applications. The workshops objectives were to develop a vision for the future of networking 10 to 20 years out and to identify needed Federal networking research to enable that vision...
Z
Community Detection to Split Large-scale Assemblies in Subassemblies
data.niaid.nih.gov
zenodo.org
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584
Explore at:
Dataset updated
Aug 19, 2023
Dataset authored and provided by
Münker, Sören
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}
Z
Accompanying data - Papyrus - A large scale curated dataset aimed at...
data.niaid.nih.gov
zenodo.org
Updated Apr 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jespers, Willem (2024). Accompanying data - Papyrus - A large scale curated dataset aimed at bioactivity predictions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7019873
Explore at:
Dataset updated
Apr 8, 2024
Dataset provided by
Jespers, Willem
IJzerman, Ad P.
Béquignon, Olivier J. M.
Bongers, Brandon J.
van Westen, Gerard J. P.
van de Water, Bob
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Addition of supporting files:- LICENSE.txt- data_types.json- data_size.json

Fixed version of Papyrus++ 05.5:- In the previous 05.5 version data was incorrectly duplicated based on assay type. This resulted in unintended data augmentation.- In this fixed 05.5 version the duplicates have been eliminated, now reporting the correct amount of data per assay type.

This repository contains the version 05.5 of the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the article "Papyrus - A large scale curated dataset aimed at bioactivity predictions" http://doi.org/10.1186/s13321-022-00672-x.

With the ongoing rapid growth of publicly available ligand-protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers’ time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some example quantitative structure-activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.