Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
it is crucial to examine them from an empirical perspective.
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The sample inefficiency of standard deep reinforcement learning methods precludes their application to many real-world problems. Methods which leverage human demonstrations require fewer samples but have been researched less. As demonstrated in the computer vision and natural language processing communities, large-scale datasets have the capacity to facilitate research by serving as an experimental and benchmarking platform for new methods. However, existing datasets compatible with reinforcement learning simulators do not have sufficient scale, structure, and quality to enable the further development and evaluation of methods focused on using human examples. Therefore, we introduce a comprehensive, large-scale, simulator-paired dataset of human demonstrations: MineRL. The dataset consists of over 60 million automatically annotated state-action pairs across a variety of related tasks in Minecraft, a dynamic, 3D, open-world environment. We present a novel data collection scheme which al
Facebook
TwitterA dataset of 3D animal models used for training and testing 3D shape reconstruction models.
Facebook
TwitterLRS3-TED: a large-scale dataset for visual speech recognition.
Facebook
TwitterThe National Hydrography Dataset (NHD) is a feature-based database that interconnects and uniquely identifies the stream segments or reaches that make up the nation's surface water drainage system. NHD data was originally developed at 1:100,000 scale and exists at that scale for the whole country. High resolution NHD adds detail to the original 1:100,000-scale NHD. (Data for Alaska, Puerto Rico and the Virgin Islands was developed at high-resolution, not 1:100,000 scale.) Like the 1:100,000-scale NHD, high resolution NHD contains reach codes for networked features and isolated lakes, flow direction, names, stream level, and centerline representations for areal water bodies. Reaches are also defined to represent waterbodies and the approximate shorelines of the Great Lakes, the Atlantic and Pacific Oceans and the Gulf of Mexico. The NHD also incorporates the National Spatial Data Infrastructure framework criteria set out by the Federal Geographic Data Committee.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Details: The INCLUDE dataset has 4292 videos (the paper mentions 4287 videos but 5 videos were added later). The videos used for training are mentioned in train.csv (3475), while that used for testing is mentioned in test.csv (817 files). Each video is a recording of 1 ISL sign, signed by deaf students from St. Louis School for the Deaf, Adyar, Chennai.
INCLUDE50 has 766 train videos and 192 test videos.
Train-Test Split: Please download the train-test split for INCLUDE and INCLUDE50 from here: Train-Test Split
Publication Link: https://dl.acm.org/doi/10.1145/3394171.3413528
AI4Bharat website: https://sign-language.ai4bharat.org/
Download Instructions
For ease of access, we have prepared a Shell Script to download all the parts of the dataset and extract them to form the complete INCLUDE dataset.
You can find the script here: http://bit.ly/include_dl
Paper Abstract: Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR methods with hyperparameter tuning. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset
Facebook
TwitterRgbD1K: A large-scale dataset and benchmark for rgb-d object tracking
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tar.gz file includes dataset for UniTSyn
Facebook
TwitterThis paper documents the findings of the March 12-14, 2001 Workshop on New Visions for Large-Scale Networks: Research and Applications. The workshops objectives were to develop a vision for the future of networking 10 to 20 years out and to identify needed Federal networking research to enable that vision...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2022
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The dataset published in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer".
This is Dataset v2:
The modified dataset that includes 68,597 tweets in English. The annotations with more than 80% agreement are included. TweetID: Tweet ID from Twitter API LangID: 1 (English) TopicID: Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports HateLabel: Final hate label decision 0-Normal, 1-Offensive, 2-Hate
GitHub Repo:
NOTE:โฆ See the full description on the dataset page: https://huggingface.co/datasets/ctoraman/large-scale-hate-speech-v2.
Facebook
TwitterS3DIS comprises 6 colored 3D point clouds from 6 large-scale indoor areas, along with semantic instance annotations for 12 object categories (wall, floor, ceiling, beam, column, window, door, sofa, desk, chair, bookcase, and board).
The Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset is composed of the colored 3D point clouds of six large-scale indoor areas from three different buildings, each covering approximately 935, 965, 450, 1700, 870, and 1100 square meters (total of 6020 square meters). These areas show diverse properties in architectural style and appearance and include mainly office areas, educational and exhibition spaces, and conference rooms, personal offices, restrooms, open spaces, lobbies, stairways, and hallways are commonly found therein. The entire point clouds are automatically generated without any manual intervention using the Matterport scanner. The dataset also includes semantic instance annotations on the point clouds for 12 semantic elements, which are structural elements (ceiling, floor, wall, beam, column, window, and door) and commonly found items and furniture (table, chair, sofa, bookcase, and board).
https://redivis.com/fileUploads/5bdaf09c-7d3b-4a91-b192-d98a0f0b0018%3E" alt="S3DIS.png">
%3Cu%3E%3Cstrong%3EImportant Information%3C/strong%3E%3C/u%3E
%3C!-- --%3E
Facebook
TwitterCOIN dataset for comprehensive instructional video analysis
Facebook
Twitterhttps://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Aesthetic Visual Analysis (AVA) contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style for high-level image quality categorization.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, โA Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave,โ Journal of Data, vol. 7, no. 8, p. 109, Aug. 2022, doi: 10.3390/data7080109 Abstract The COVID-19 Omicron variant, reported to be the most immune evasive variant of COVID-19, is resulting in a surge of COVID-19 cases globally. This has caused schools, colleges, and universities in different parts of the world to transition to online learning. As a result, social media platforms such as Twitter are seeing an increase in conversations, centered around information seeking and sharing, related to online learning. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions, perspectives, attitudes, and feedback towards online learning during the current surge of COVID-19 cases caused by the Omicron variant. Therefore this work presents a large-scale public Twitter dataset of conversations about online learning since the first detected case of the COVID-19 Omicron variant in November 2021. The dataset files contain the raw version that comprises 52,868 Tweet IDs (that correspond to the same number of Tweets) and the cleaned and preprocessed version that contains 46,208 unique Tweet IDs. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management. Data Description The dataset comprises 7 .txt files. The raw version of this dataset comprises 6 .txt files (TweetIDs_Corona Virus.txt, TweetIDs_Corona.txt, TweetIDs_Coronavirus.txt, TweetIDs_Covid.txt, TweetIDs_Omicron.txt, and TweetIDs_SARS CoV2.txt) that contain Tweet IDs grouped together based on certain synonyms or terms that were used to refer to online learning and the Omicron variant of COVID-19 in the respective tweets. The cleaned and preprocessed version of this dataset is provided in the .txt file - TweetIDs_Duplicates_Removed.txt. The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweetsr) may be used. The list of all the synonyms or terms that were used for the dataset development is as follows: COVID-19: Omicron, COVID, COVID19, coronavirus, coronaviruspandemic, COVID-19, corona, coronaoutbreak, omicron variant, SARS CoV-2, corona virus online learning: online education, online learning, remote education, remote learning, e-learning, elearning, distance learning, distance education, virtual learning, virtual education, online teaching, remote teaching, virtual teaching, online class, online classes, remote class, remote classes, distance class, distance classes, virtual class, virtual classes, online course, online courses, remote course, remote courses, distance course, distance courses, virtual course, virtual courses, online school, virtual school, remote school, online college, online university, virtual college, virtual university, remote college, remote university, online lecture, virtual lecture, remote lecture, online lectures, virtual lectures, remote lectures A description of the dataset files is provided below: TweetIDs_Corona Virus.txt โ Contains 321 Tweet IDs correspond to tweets that comprise the keywords โ "corona virus" and one or more keywords/terms that refer to online learning TweetIDs_Corona.txt โ Contains 1819 Tweet IDs correspond to tweets that comprise the keyword โ "corona" or "coronaoutbreak" and one or more keywords/terms that refer to online learning TweetIDs_Coronavirus.txt โ Contains 1429 Tweet IDs correspond to tweets that comprise the keywords โ "coronavirus" or "coronaviruspandemic" and one or more keywords/terms that refer to online learning TweetIDs_Covid.txt โ Contains 41088 Tweet IDs correspond to tweets that comprise the keywords โ "COVID" or "COVID19" or "COVID-19" and one or more keywords/terms that refer to online learning TweetIDs_Omicron.txt โ Contains 8198 Tweet IDs correspond to tweets that comprise the keywords โ "omicron" or "omicron variant" and one or more keywords/terms that refer to online learning TweetIDs_SARS CoV2.txt โ Contains 13 Tweet IDs correspond to tweets that comprise the keyword โ "SARS-CoV-2" and one or more keywords/terms that refer to online learning TweetIDs_Duplicates_Removed.txt - A collection of 46208 unique Tweet IDs from all the 6 .txt files mentioned above after...
Facebook
TwitterThe XMM Large Scale Structure survey (XMM-LSS) is an X-ray survey aimed at studying the large scale structure of the Universe. The XMM-LSS field (centered at RA (J2000) = 02h 24m 00.27s, Dec (J2000) = -04o 09' 47.6") is currently being followed up using observations across a wide range of wavelengths, and in their paper the authors present the observational results of a low frequency radio survey of the XMM-LSS field using the Very Large Array at 74 and 325 MHz. This survey will map out the locations of the extragalactic radio sources relative to the large scale structure as traced by the X-ray emission. This is of particular interest because radio galaxies and radio-loud AGN show strong and complex interactions with their small and larger scale environment, and different classes of radio galaxies are suggested to lie at different places with respect to the large scale structure. For the phase calibration of the radio data, the authors used standard self-calibration at 325 MHz and field-base calibration at 74 MHz. Polyhedron-based imaging as well as mosaicking methods were used at both frequencies. At 74 MHz, the resolution was 30 arcseconds, the median 5-sigma sensitivity was ~ 162 mJy/beam and 666 sources were detected over an area of 132 square degrees. At 325 MHz, the resolution was 6.7 arcseconds, the median 5-sigma sensitivity was 4 mJy/beam, and 847 sources were detected over an area of 15.3 square degrees. At 325 MHz, a region of diffuse radio emission which is a cluster halo or relic candidate was detected. The observations were conducted using the VLA in July 2003 in the A-configuration (most extended) and in June 2002 in the B-configuration. This table contains the VLA 325-MHz source list, comprising 605 single sources and 615 components of 237 multiple sources, for a total of 1220 entries. (Notice that, in Section 4.1 of the reference paper, somewhat different numbers are given, i.e., the authors quote 621 single sources and 226 multiple sources). For the multiple sources, each component (A, B, etc.) is listed separately, in order of decreasing brightness. This table was created by the HEASARC in March 2012 based on CDS Catalog J/A+A/456/791 file tablea1.dat. This is a service provided by NASA HEASARC .
Facebook
TwitterIn this paper, we investigate the use of Bayesian networks to construct large-scale diagnostic systems. In particular, we consider the development of large-scale Bayesian networks by composition. This compositional approach reflects how (often redundant) subsystems are architected to form systems such as electrical power systems. We develop high-level specifications, Bayesian networks, clique trees, and arithmetic circuits representing 24 different electrical power systems. The largest among these 24 Bayesian networks contains over 1,000 random variables. Another BN represents the real-world electrical power system ADAPT, which is representative of electrical power systems deployed in aerospace vehicles. In addition to demonstrating the scalability of the compositional approach, we briefly report on experimental results from the diagnostic competition DXC, where the ProADAPT team, using techniques discussed here, obtained the highest scores in both Tier 1 (among 9 international competitors) and Tier 2 (among 6 international competitors) of the industrial track. While we consider diagnosis of power systems specically, we believe this work is relevant to other system health management problems, in particular in dependable systems such as aircraft and spacecraft. Reference: O. J. Mengshoel, S. Poll, and T. Kurtoglu. "Developing Large-Scale Bayesian Networks by Composition: Fault Diagnosis of Electrical Power Systems in Aircraft and Spacecraft." Proc. of the IJCAI-09 Workshop on Self-* and Autonomous Systems (SAS): Reasoning and Integration Challenges, 2009 BibTex Reference: @inproceedings{mengshoel09developing, title = {Developing Large-Scale {Bayesian} Networks by Composition: Fault Diagnosis of Electrical Power Systems in Aircraft and Spacecraft}, author = {Mengshoel, O. J. and Poll, S. and Kurtoglu, T.}, booktitle = {Proc. of the IJCAI-09 Workshop on Self-$\star$ and Autonomous Systems (SAS): Reasoning and Integration Challenges}, year={2009} }
Facebook
TwitterOne of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, Iโd suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CheXmask Database presents a comprehensive, uniformly annotated collection of chest radiographs, constructed from five public databases: ChestX-ray8, Chexpert, MIMIC-CXR-JPG, Padchest and VinDr-CXR. The database aggregates 657,566 anatomical segmentation masks derived from images which have been processed using the HybridGNet model to ensure consistent, high-quality segmentation. To confirm the quality of the segmentations, we include in this database individual Reverse Classification Accuracy (RCA) scores for each of the segmentation masks. This dataset is intended to catalyze further innovation and refinement in the field of semantic chest X-ray analysis, offering a significant resource for researchers in the medical imaging domain.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
๐ฎ TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
๐ Website | ๐ Arxiv | ๐ป Code| ๐ค Datasets
If you like our project or are interested in its updates, please star us :) Thank you! โญ
Summary
TLDR: CoTA is a large-scale dataset of synthetic Chains-of-Thought-and-Action (CoTA) generated by programs.
Load data
from datasets import load_dataset dataset = load_dataset("Salesforce/program-cota-llava"โฆ See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/program-cota-llava.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
it is crucial to examine them from an empirical perspective.