Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LSDB Archive meta data sheet
Facebook
TwitterThis dataset tracks the updates made on the dataset "Meta-analysis, Simpson's paradox, and the number needed to treat" as a repository for previous versions of the data and metadata.
Facebook
TwitterThis dataset tracks the updates made on the dataset "Pooling, meta-analysis, and the evaluation of drug safety" as a repository for previous versions of the data and metadata.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The most vulnerable group of traffic participants are pedestrians using mobility aids. While there has been significant progress in the robustness and reliability of camera based general pedestrian detection systems, pedestrians reliant on mobility aids are highly underrepresented in common datasets for object detection and classification.
To bridge this gap and enable research towards robust and reliable detection systems which may be employed in traffic monitoring, scheduling, and planning, we present this dataset of a pedestrian crossing scenario taken from an elevated traffic monitoring perspective together with ground truth annotations (Yolo format [1]). Classes present in the dataset are pedestrian (without mobility aids), as well as pedestrians using wheelchairs, rollators/wheeled walkers, crutches, and walking canes. The dataset comes with official training, validation, and test splits.
An in-depth description of the dataset can be found in [2]. If you make use of this dataset in your work, research or publication, please cite this work as:
@inproceedings{mohr2023mau,
author = {Mohr, Ludwig and Kirillova, Nadezda and Possegger, Horst and Bischof, Horst},
title = {{A Comprehensive Crossroad Camera Dataset of Mobility Aid Users}},
booktitle = {Proceedings of the 34th British Machine Vision Conference ({BMVC}2023)},
year = {2023}
}
Archive mobility.zip contains the full detection dataset in Yolo format with images, ground truth labels and meta data, archive mobility_class_hierarchy.zip contains labels and meta files (Yolo format) for training with class hierarchy using e.g. the modified version of Yolo v5/v8 available under [3].
To use this dataset with Yolo, you will need to download and extract the zip archive and change the path entry in dataset.yaml to the directory where you extracted the archive to.
[1] https://github.com/ultralytics/ultralytics
[2] coming soon
[3] coming soon
Facebook
Twitterhttp://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Europeana http://www.europeana.eu/ is made our of other archive and gathers the informations giving an API to access it from remote apps. This dataset is about the archive who contributed to the europeana. This data is from 2011
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A biodiversity dataset graph: UCSB-IZC
The intended use of this archive is to facilitate (meta-)analysis of the UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC). UCSB-IZC is a natural history collection of invertebrate zoology at Cheadle Center of Biodiversity and Ecological Restoration, University of California Santa Barbara.
This dataset provides versioned snapshots of the UCSB-IZC network as tracked by Preston [2,3] between 2021-10-08 and 2021-11-04 using [preston track "https://api.gbif.org/v1/occurrence/search/?datasetKey=d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0"].
This archive contains 14349 images related to 32533 occurrence/specimen records. See included sample-image.jpg and their associated meta-data sample-image.json [4].
The images were counted using:
$ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c
| grep -o -P ".*depict"
| sort
| uniq
| wc -l
And the occurrences were counted using:
$ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c
| grep -o -P "occurrence/([0-9])+"
| sort
| uniq
| wc -l
The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to establish a versioning mechanism.
To retrieve and verify the downloaded UCSB-IZC biodiversity dataset graph, first download preston-*.tar.gz. Then, extract the archives into a "data" folder. Alternatively, you can use the Preston [2,3] command-line tool to "clone" this dataset using:
$ java -jar preston.jar clone --remote https://archive.org/download/preston-ucsb-izc/data.zip/,https://zenodo.org/record/5557670/files,https://zenodo.org/record/5660088/files/
After that, verify the index of the archive by reproducing the following provenance log history:
$ java -jar preston.jar history . .
To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.
$ java -jar preston.jar verify hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c file:/home/jhpoelen/ucsb-izc/data/ce/1d/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c OK CONTENT_PRESENT_VALID_HASH 66438 hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 file:/home/jhpoelen/ucsb-izc/data/f6/8d/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 OK CONTENT_PRESENT_VALID_HASH 4093 hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef file:/home/jhpoelen/ucsb-izc/data/3e/70/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef OK CONTENT_PRESENT_VALID_HASH 5746 hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b file:/home/jhpoelen/ucsb-izc/data/99/58/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b OK CONTENT_PRESENT_VALID_HASH 6147 hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b
Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".
Files in this data publication:
--- start of file descriptions ---
-- description of archive and its contents (this file) -- README
-- executable java jar containing preston [2,3] v0.3.1. -- preston.jar
-- preston archive containing UCSB-IZC (meta-)data/image files, associated provenance logs and a provenance index -- preston-[00-ff].tar.gz
-- individual provenance index files -- 2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
-- example image and meta-data -- sample-image.jpg (with hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c) sample-image.json (with hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844)
--- end of file descriptions ---
References
[1] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-11-04 as indexed by the Global Biodiversity Informatics Facility (GBIF) with provenance hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36 hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c. [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . [3] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132 [4] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08. https://www.gbif.org/occurrence/3323647301 . hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c
Facebook
TwitterMeta learning with LLM: supplemental code for reproducibility of computational results for MLT and MLT-plus-TM. Related research paper: "META LEARNING WITH LANGUAGE MODELS: CHALLENGES AND OPPORTUNITIES IN THE CLASSIFICATION OF IMBALANCED TEXT", A. Vassilev, H. Jin, M. Hasan, 2023 (to appear on arXiv).All code and data is contained in the zip archive arxiv2023.zip, subject to the licensing terms shown below. See the Readme.txt contained there for detailed explanation how to unpack and run the code. See also requirements.txt for the necessary depedencies (libraries needed). This is not a dataset, but only python source code.
Facebook
TwitterIn June of 1994 and August and September of 1995, the U.S. Geological Survey, in cooperation with the University of Texas Bureau of Economic Geology, conducted geophysical surveys of the Sabine and Calcasieu Lake areas and the Gulf of Mexico offshore eastern Texas and western Louisiana. This report serves as an archive of unprocessed digital boomer seismic reflection data, trackline maps, navigation files, observers' logbooks, GIS information, and formal FGDC metadata. In addition, a filtered and gained GIF image of each seismic profile is provided. The archived trace data are in standard Society of Exploration Geophysicists (SEG) SEG-Y format (Barry and others, 1975) and may be downloaded and processed with commercial or public domain software such as Seismic Unix (SU). Examples of SU processing scripts and in-house (USGS) software for viewing SEG-Y files (Zihlman, 1992) are also provided. Processed profile images, trackline maps, navigation files, and formal metadata may be viewed with a web browser. Scanned handwritten logbooks and Field Activity Collection System (FACS) logs may be viewed with Adobe Reader. For more information on the seismic surveys see http://walrus.wr.usgs.gov/infobank/g/g194gm/html/g-1-94-gm.meta.html and http://walrus.wr.usgs.gov/infobank/g/g195gm/html/g-1-95-gm.meta.html These data are also available via GeoMapApp (http://www.geomapapp.org/) and Virtual Ocean ( http://www.virtualocean.org/) earth science exploration and visualization applications.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset of midori (Blue Archive)
This is the dataset of midori (Blue Archive), containing 200 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)
Name Images Download Description
raw 200 Download Raw data with meta information.
raw-stage3 556 Download 3-stage cropped raw data with meta information.
raw-stage3-eyes 676 Download⦠See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/midori_bluearchive.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset of hoshino (Blue Archive)
This is the dataset of hoshino (Blue Archive), containing 150 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)
Name Images Download Description
raw 150 Download Raw data with meta information.
raw-stage3 420 Download 3-stage cropped raw data with meta information.
raw-stage3-eyes 477 Download⦠See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/hoshino_bluearchive.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset used in a meta-analysis examining the effects of educational technology on mathematics outcomes. Includes effects from 40 studies with codes for study and methodological features.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset of tsurugi (Blue Archive)
This is the dataset of tsurugi (Blue Archive), containing 200 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)
Name Images Download Description
raw 200 Download Raw data with meta information.
raw-stage3 531 Download 3-stage cropped raw data with meta information.
raw-stage3-eyes 667 Download⦠See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/tsurugi_bluearchive.
Facebook
TwitterFrom May 13 to May 14 of 2008, the U.S. Geological Survey conducted geophysical surveys in Lake Panasoffkee, Florida. Thisreport serves as an archive of unprocessed digital boomer and CHIRP seismic reflection data, trackline maps, navigation files, GIS information, FACS logs, and formal FGDC metadata. Filtered and (or) gained digital images of the seismic profiles are also provided. The archived trace data are in standard Society of Exploration Geophysicists (SEG) SEG-Y format (Barry and others, 1975) and may be downloaded and processed with commercial or public domain software such as Seismic Unix (SU). Example SU processing scripts and USGS software for viewing the SEG-Y files (Zihlman, 1992) are also provided. For more information on the seismic surveys see http://walrus.wr.usgs.gov/infobank/j/j308fl/html/j-3-08-fl.meta.html These data are also available via GeoMapApp (http://www.geomapapp.org/) and Virtual Ocean ( http://www.virtualocean.org/) earth science exploration and visualization applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
The dataset is mainly collected from existing datasets. We used data from:
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression tech(nical)?[\s\-_]*?debt to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt.
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
id: the id used in the original source. We use the URL path to identify Medium posts.body: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).created_utc: the time the item was posted in seconds since epoch in UTC. author: the author of the item. We use the username or userid from the source.source: where the item was posted. Valid sources are:
meta: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score and num_comments for keys that have the same meaning/information across multiple sources.This is a sample item from Reddit:
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq to process the JSON.
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
meta.url) to PDF documents?lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see https://github.com/sse-lnu/tdmentions for more analyses
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
Twitter~10,000 professional retail scene photographs from UK grocery stores for computer vision research
| Attribute | Details |
|---|---|
| Total Images | ~10,000 high-resolution photos |
| Markets | United Kingdom |
| Collections | 2014 archive, Full store surveys, Halloween 2024 |
| Privacy | All faces automatically blurred |
| License | Evaluation & Research Only |
| Format | JPEG with comprehensive metadata |
This dataset is perfect for:
š Request access: https://huggingface.co/datasets/dresserman/kanops-open-access-imagery
This dataset is gated - request access on HuggingFace. By requesting access, you agree to the evaluation-only license terms.
from datasets import load_dataset
# Load the dataset (after getting HuggingFace access)
ds = load_dataset(
"imagefolder",
data_dir="hf://datasets/dresserman/kanops-open-access-imagery/train",
split="train",
)
# Access first image
img = ds[0]["image"] # PIL.Image
img.show()
import pandas as pd
meta = pd.read_csv(
"hf://datasets/dresserman/kanops-open-access-imagery/metadata.csv"
)
print(meta.head())
train/
āāā 2014/
ā āāā Aldi/
ā āāā Tesco/
ā āāā Sainsburys/
ā āāā ... (22 UK retailers)
āāā FullStores/
ā āāā Tesco_Lincoln_2014/
ā āāā Tesco_Express_2015/
ā āāā Asda_Leeds_2016/
āāā Halloween2024/
āāā Various_Retailers/
Root files:
āāā MANIFEST.csv # File listing + basic attributes
āāā metadata.csv # Enriched metadata (retailer, dims, collection)
āāā checksums.sha256 # Integrity verification
āāā blur_log.csv # Face-blur verification log
āāā LICENSE # Evaluation-only terms
Each image includes comprehensive metadata in metadata.csv:
| Field | Description |
|---|---|
file_name | Path relative to dataset root |
bytes | File size in bytes |
width, height | Image dimensions |
sha256 | Content hash for integrity verification |
collection | One of: 2014, FullStores, Halloween2024 |
retailer | Inferred from file path |
year | Inferred from file path |
License: Evaluation & Research Only
For commercial licensing: Contact happytohelp@groceryinsight.com
This free sample is part of Kanops Archive - a much larger commercial dataset used by AI companies and research institutions.
Applications: - Training production computer vision models - Autonomous checkout systems - Retail robotics and automation - Seasonal demand forecasting - Market research and competitive intelligence
Learn more: [groceryinsight.com/retail-image-dataset](...
Facebook
TwitterThe present dataset contains (meta) information extracted from the materials preserved in the archival funds of the International Institute of Intellectual Cooperation (IIIC), which was recently digitized [available at https://atom.archives.unesco.org/iiic ]. More precisely, the dataset focuses on subseries A and F from the Series Correspondence. Using machine learning and natural language processing (NLP) techniques, we have parsed scanned documents and extracted from them meta-information like: people and location mentions, language (e.g., French), nature of material (e.g., letter vs. attached document), formal aspects (e.g., handwritten vs. typewritten), and -- if possible -- its year of publication. Moreover, we have associated these entities (e.g., a given person) and information to the specific document(s) where they appear. We have divided the dataset in three files: one focused on people and two on locations (one for countries and another for cities). This dataset has been generated within the ERC-StG project named "Social Networks of the Past: Mapping Hispanic and Lusophone Literary Modernity, 1898-1959".
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of meta-data from articles posted to the arXiv ( arxiv.org) is a preprint server where scientists post their publications for the public to view). You'll find article titles, abstracts, subjects, and publish dates from condensed matter physics articles (arxiv archive) in JSON format.
train.csv and test.csv are CSV files created from original JSON files with fields:
date: the date the article was posted to the arxiv. In train.csvthe dates are all before May 2014, and in test.csv the dates are all greater than or equal to May 1, 2014. abstract: the abstract of the article title: the title of the article subject: the subject which the authors attribute to the article. This field will only be present in train.csv. There are 30 unique subjects represented in this dataset.
Predict the number of articles in each subject.
Predict the number of articles in each subject, in every month of the test.csv. As a check, in April 2014, there were 58 articles labeled "quantum physics" and 88 articles labeled "superconductivity." The predictions will be scored using the mean squared error (https://en.wikipedia.org/wiki/Mean_squared_error) of the predictions.
Final submission could be like -
A csv with the predictions. A sample submission with the correct format has been provided in submission.csv. Your submission should be IDENTICAL to this file, except that your predictions should not all be 0. Note, missing values will be filled in with 0. Please examine this file to make sure your submission is in the proper format.
Facebook
TwitterClotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.
S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen āClotho-AQA: A Crowdsourced Dataset for Audio Question Answering.ā The paper is available online at 2204.09634.pdf (arxiv.org)
If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)
To use the dataset,
⢠Download and extract āaudio_files.zipā. This contains all the 1991 audio samples in the dataset.
⢠Download āclotho_aqa_train.csvā, āclotho_aqa_val.csvā, and āclotho_aqa_test.csvā. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.
License:
The audio files in the archive āaudio_files.zipā are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file āclotho_aqa_metadata.csvā for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:
⢠File name
⢠Keywords
⢠URL for the original audio file
⢠Start and ending samples for the excerpt that is used in the Clotho dataset
⢠Uploader/user in the Freesound platform (manufacturer)
⢠Link to the license of the file.
The questions and answers in the files:
⢠clotho_aqa_train.csv
⢠clotho_aqa_val.csv
⢠clotho_aqa_test.csv
are under the MIT license, described in the LICENSE file.
References:
[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.
[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Facebook
Twitterhttp://meta.icos-cp.eu/ontologies/cpmeta/icosLicencehttp://meta.icos-cp.eu/ontologies/cpmeta/icosLicence
Quality checked and gap-filled daily MODIS observations of surface reflectance and land surface temperature at global eddy co-variance sites for the time period 2000-2020. Two product versions: one features all MODIS pixels within 2km radius around a given site, and a second version consists of an average time series that represents the area within 1km2 around a site. All data layers have a complementary layer with gap-fill information. MODIS data comprise all sites in the Fluxnet La Thuile, Berkeley and ICOS Drought 2018 data releases. FluxnetEO MODIS reflectance products: enhanced vegetation index (EVI), normalized difference vegetation index (NDVI), generalized NDVI (kNDVI), near infra-red reflectance of vegetation (NIRv), normalized difference water index (NDWI) with band 5, 6, or 7 as reference, the scaled wide dynamic range vegetation index (sWDRVI), surface reflectance in MODIS bands 1-7. Based on the NASA MCD43A4 and MCD43A2 collection 6 products with a pixel size of 500m. FluxnetEO MODIS land surface temperature: Terra and Aqua, day and night, at native viewing zenith angle as well as corrected to viewing zenith angles of 0 and 40degrees (Ermida et al., 2018, RSE). Based on NASA MOD11A1 and MYD11A1 collection 6 at a pixel size of 1km. Supplementary data to Walther, S., Besnard, S., Nelson, J.A., El-Madany, T. S., Migliavacca, M., Weber, U., Ermida, S. L., Brümmer, C., Schrader, F., Prokushkin, A., Panov, A., Jung, M. , 2021. A view from space on global flux towers by MODIS and Landsat: The FluxnetEO dataset, in preparation for Biogeosciences Discussions. ZIP archive of netcdf files for the stations in Americas : US-AR1, US-AR2, US-ARM, US-ARb, US-ARc, US-Atq, US-Aud, US-Bar, US-Bkg, US-Blo, US-Bn2, US-Bn3, US-Bo1, US-Bo2, US-Brw, US-CRT, US-CaV, US-Cop, US-Dk3, US-FPe Besnard, S., Nelson, J., Walther, S., Weber, U. (2021). The FluxnetEO dataset (MODIS) for American stations located in United States (0), 2000-01-01ā2020-12-30, https://hdl.handle.net/11676/wssMbn0QhNm4gQzxR44pcoIs
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LSDB Archive meta data sheet