7 datasets found

Z
Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST...
data.niaid.nih.gov
zenodo.org
Updated Jun 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schürholt, Konstantin (2022). Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6632086
Explore at:
Dataset updated
Jun 13, 2022
Dataset provided by
Schürholt, Konstantin
Giró-i-Nieto, Xavier
Taskiran, Diyar
Borth, Damian
Knyazev, Boris
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 24 model zoos with varying hyperparameter combinations are generated and includes 47’360 unique neural network models resulting in over 2’415’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

Dataset

This dataset is part of a larger collection of model zoos and contains the zoos trained on the labelled samples from MNIST. All zoos with extensive information and code can be found at www.modelzoos.cc.

This repository contains two types of files: the raw model zoos as collections of models (file names beginning with "mnist_"), as well as preprocessed model zoos wrapped in a custom pytorch dataset class (filenames beginning with "dataset"). Zoos are trained in three configurations varying the seed only (seed), varying hyperparameters with fixed seeds (hyp_fix) or varying hyperparameters with random seeds (hyp_rand). The index_dict.json files contain information on how to read the vectorized models.

For more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.
IMDB 50K Movie Reviews (TEST your BERT)
kaggle.com
Updated Dec 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atul Anand {Jha} (2019). IMDB 50K Movie Reviews (TEST your BERT) [Dataset]. https://www.kaggle.com/atulanandjha/imdb-50k-movie-reviews-test-your-bert/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atul Anand {Jha}
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Context

Large Movie Review Dataset v1.0 . 😃

https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252" alt="IMDB wall">

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Reference: http://ai.stanford.edu/~amaas/data/sentiment/

NOTE

A starter kernel is here : https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel

A kernel to expose Dataset collection :

Content

Now let’s understand the task in hand: given a movie review, predict whether it’s positive or negative.

The dataset we use is 50,000 IMDB reviews (25K for train and 25K for test) from the PyTorch-NLP library.

Each review is tagged pos or neg .

There are 50% positive reviews and 50% negative reviews both in train and test sets.

Columns:

text : Reviews from people.

Sentiment : Negative or Positive tag on the review/feedback (Boolean).

Acknowledgements

When using this Dataset Please Cite this ACL paper using :

@InProceedings{

maas-EtAl:2011:ACL-HLT2011,

author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},

title = {Learning Word Vectors for Sentiment Analysis},

booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},

month = {June},

year = {2011},

address = {Portland, Oregon, USA},

publisher = {Association for Computational Linguistics},

pages = {142--150},

url = {http://www.aclweb.org/anthology/P11-1015}

}

Link to ref Dataset: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html

https://www.samyzaf.com/ML/imdb/imdb.html

Inspiration

BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.

Sentence/Table Pair Data from Wikipedia for Pre-training with...

zenodo.org
data.niaid.nih.gov

application/gzip

Updated Oct 29, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. http://doi.org/10.5281/zenodo.5612316

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5612316

Dataset updated

Oct 29, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

# path to the uncompressed files, should be a directory with a set of tar files
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
dataset = (
  wds.Dataset(url)
  .shuffle(1000) # cache 1000 samples and shuffle
  .decode()
  .to_tuple("json")
  .batched(20) # group every 20 examples into a batch
)

# Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
# You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
 's1_all_links': {
  'Sils,_Girona': [[0, 4]],
  'municipality': [[10, 22]],
  'Comarques_of_Catalonia': [[30, 37]],
  'Selva': [[41, 46]],
  'Catalonia': [[51, 60]]
 }, # list of entities and their mentions in the sentence (start, end location)
 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
  {
    'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
    's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
    's2s': [ # list of other sentences that contain the common entity pair, or evidence
     {
       'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
       'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
       's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
       'pair_locs': [ # mentions of the entity pair in the evidence
        [[19, 27]], # mentions of entity 1
        [[0, 5], [288, 293]] # mentions of entity 2
       ],
       'all_links': {
        'Selva': [[0, 5], [288, 293]],
        'Comarques_of_Catalonia': [[19, 27]],
        'Catalonia': [[40, 49]]
       }
      }
    ,...] # there are multiple evidence sentences
   },
 ,...] # there are multiple entity pairs in the query
}

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
 's1_all_links': {...}, # same as text-only
 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
 'table_pairs': [
  'tid': 'Major_League_Baseball-1',
  'text':[
    ['World Series Records', 'World Series Records', ...],
    ['Team', 'Number of Series won', ...],
    ['St. Louis Cardinals (NL)', '11', ...],
  ...] # table content, list of rows
  'index':[
    [[0, 0], [0, 1], ...],
    [[1, 0], [1, 1], ...],
  ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
  'value_ranks':[
    [0, 0, ...],
    [0, 0, ...],
    [0, 10, ...],
  ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
  'value_inv_ranks': [], # inverse rank
  'all_links':{
    'St._Louis_Cardinals': {
     '2': [
      [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
     ] # list of mentions in the second row, the key is row_id
    },
    'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
  }
  'name': '', # table name, if exists
  'pairs': {
    'pair': ['American_League', 'National_League'],
    's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
    'table_pair_locs': {
     '17': [ # mention of entity pair in row 17
       [
        [[17, 0], [3, 18]],
        [[17, 1], [3, 18]],
        [[17, 2], [3, 18]],
        [[17, 3], [3, 18]]
       ], # mention of the first entity
       [
        [[17, 0], [21, 36]],
        [[17, 1], [21, 36]],
       ] # mention of the second entity
     ]
    }
   }
 ]
}

cars_wagonr_swift
kaggle.com
zip
Updated Sep 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
Explore at:
zip(44486490 bytes)Available download formats
Dataset updated
Sep 11, 2019
Authors
Ajay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

Content

There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

Inspiration

With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
What you see is what you get: Delineating the urban jobs-housing spatial...
figshare.com
zip
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan (2021). What you see is what you get: Delineating the urban jobs-housing spatial distribution at a parcel scale by using street view imagery [Dataset]. http://doi.org/10.6084/m9.figshare.12960212.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12960212.v1
Dataset updated
Feb 12, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan
License
https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html
Description
The compressed package (Study_code.zip) contains the code files implemented by an under review paper ("What you see is what you get: Delineating urban jobs-housing spatial distribution at a parcel scale by using street view imagery based on deep learning technique").The compressed package (input_land_parcel_with_attributes.zip) is the sampled mixed "jobs-housing" attributes data of the study area with multiple probability attributes (Only working, Only living, working and living) at the land parcel scale.The compressed package (input_street_view_images.zip) is the surrounding street view data near sampled land parcels (input_land_parcel_with_attributes.zip) with the pixel size of 240*160 obtained from Tencent map (https://map.qq.com/).The compressed package (output_results.zip) contains the result vector files (Jobs-housing pattern distribution and error distribution) and file description (Readme.txt).This project uses some Python open source libraries (Numpy, Pandas, Selenium, Gdal, Pytorch and sklearn). This project complies with the GPL license.Numpy (https://numpy.org/) is an open source numerical calculation tool developed by Travis Oliphant. Used in this project for matrix operation. This library complies with the BSD license.Pandas (https://pandas.pydata.org/) is an open source library, providing high-performance, easy-to-use data structures and data analysis tools. This library complies with the BSD license.Selenium(https://www.selenium.dev/) is a suite of tools for automating web browsers.Used in this project for getting street view images.This library complies with the BSD license.Gdal(https://gdal.org/) is a translator library for raster and vector geospatial data formats.Used in this project for processing geospatial data.This library complies with the BSD license.Pytorch(https://pytorch.org/) is an open source machine learning framework that accelerates the path from research prototyping to production deployment.Used in this project for deep learning.This library complies with the BSD license.sklearn(https://scikit-learn.org/) is an open source machine learning tool for python.Used in this project for comparing precision metrics.This library complies with the BSD license.
Z
Data from: WildfireSpreadTS: A dataset of multi-modal time series for...
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerard, Sebastian (2024). WildfireSpreadTS: A dataset of multi-modal time series for wildfire spread prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8006176
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Zhao, Yu
Gerard, Sebastian
Sullivan, Josephine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a multi-temporal, multi-modal remote-sensing dataset for predicting how active wildfires will spread at a resolution of 24 hours. The dataset consists of 13.607 images across 607 fire events in the United States from January 2018 to October 2021. For each fire event, the dataset contains a full time series of daily observations, containing detected active fires and variables related to fuel, topography and weather conditions. Documentation WildfireSpreadTS_Documentation.pdf includes further details about the dataset, following Gebru et al.'s "Datasheets for Datasets" framework. This documentation is similar to the supplementary material of the associated NeurIPS paper, excluding only information about experimental setup and results. For full details, please refer to the associated paper. Code: Getting started Get started working with the dataset at https://github.com/SebastianGer/WildfireSpreadTS. The code includes a PyTorch Dataset and Lightning DataModule to allow for easy access. We recommend converting the GeoTIFF files provided here to HDF5 files (bigger files, but much faster). The necessary code is also available in the repository.

This work is funded by Digital Futures in the project EO-AI4GlobalChange. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at C3SE partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
u
Data from: CEML - Counterfactuals for Explaining Machine Learning models - A...
pub.uni-bielefeld.de
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André Artelt (2024). CEML - Counterfactuals for Explaining Machine Learning models - A Python toolbox [Dataset]. https://pub.uni-bielefeld.de/record/2936468
Explore at:
Dataset updated
May 28, 2024
Authors
André Artelt
Description
CEML - Counterfactuals for Explaining Machine Learning models - A Python toolbox

ceml is a Python toolbox for computing counterfactuals. Counterfactuals can be used to explain the predictions of machine learning models.

It supports many common machine learning frameworks:

- scikit-learn - PyTorch - Keras - Tensorflow

Furthermore, ceml is easy to use and can be extended very easily. See the documentation and user guide for more information on how to use and extend ceml.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Schürholt, Konstantin (2022). Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6632086

Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST

Explore at:

Dataset updated

Jun 13, 2022

Dataset provided by

Schürholt, Konstantin
Giró-i-Nieto, Xavier
Taskiran, Diyar
Borth, Damian
Knyazev, Boris

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract

In the last years, neural networks have evolved from laboratory environments to the state-of-the-art for many real-world problems. Our hypothesis is that neural network models (i.e., their weights and biases) evolve on unique, smooth trajectories in weight space during training. Following, a population of such neural network models (refereed to as “model zoo”) would form topological structures in weight space. We think that the geometry, curvature and smoothness of these structures contain information about the state of training and can be reveal latent properties of individual models. With such zoos, one could investigate novel approaches for (i) model analysis, (ii) discover unknown learning dynamics, (iii) learn rich representations of such populations, or (iv) exploit the model zoos for generative modelling of neural network weights and biases. Unfortunately, the lack of standardized model zoos and available benchmarks significantly increases the friction for further research about populations of neural networks. With this work, we publish a novel dataset of model zoos containing systematically generated and diverse populations of neural network models for further research. In total the proposed model zoo dataset is based on six image datasets, consist of 24 model zoos with varying hyperparameter combinations are generated and includes 47’360 unique neural network models resulting in over 2’415’360 collected model states. Additionally, to the model zoo data we provide an in-depth analysis of the zoos and provide benchmarks for multiple downstream tasks as mentioned before.

Dataset

This dataset is part of a larger collection of model zoos and contains the zoos trained on the labelled samples from MNIST. All zoos with extensive information and code can be found at www.modelzoos.cc.

This repository contains two types of files: the raw model zoos as collections of models (file names beginning with "mnist_"), as well as preprocessed model zoos wrapped in a custom pytorch dataset class (filenames beginning with "dataset"). Zoos are trained in three configurations varying the seed only (seed), varying hyperparameters with fixed seeds (hyp_fix) or varying hyperparameters with random seeds (hyp_rand). The index_dict.json files contain information on how to read the vectorized models.

For more information on the zoos and code to access and use the zoos, please see www.modelzoos.cc.

Clear search

Close search

Google apps

Main menu

Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST...

IMDB 50K Movie Reviews (TEST your BERT)

Context

Content

Columns:

Acknowledgements

Inspiration

Sentence/Table Pair Data from Wikipedia for Pre-training with...

cars_wagonr_swift

Context

Content

Inspiration

What you see is what you get: Delineating the urban jobs-housing spatial...

Data from: WildfireSpreadTS: A dataset of multi-modal time series for...

Data from: CEML - Counterfactuals for Explaining Machine Learning models - A...

Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNISTSee More Versions

Model Zoo: A Dataset of Diverse Populations of Neural Network Models - MNIST