93 datasets found

LSDB Archive meta data sheet
figshare.com
xlsx
Updated Sep 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kakeru Yokoi (2021). LSDB Archive meta data sheet [Dataset]. http://doi.org/10.6084/m9.figshare.14417879.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14417879.v2
Dataset updated
Sep 18, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kakeru Yokoi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LSDB Archive meta data sheet
Meta-analysis, Simpson's paradox, and the number needed to treat - h4vw-qr4x...
healthdata.gov
csv, xlsx, xml
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Meta-analysis, Simpson's paradox, and the number needed to treat - h4vw-qr4x - Archive Repository [Dataset]. https://healthdata.gov/dataset/Meta-analysis-Simpson-s-paradox-and-the-number-nee/dh62-v83n
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Sep 18, 2025
Description
This dataset tracks the updates made on the dataset "Meta-analysis, Simpson's paradox, and the number needed to treat" as a repository for previous versions of the data and metadata.
Pooling, meta-analysis, and the evaluation of drug safety - tssk-8hb2 -...
healthdata.gov
csv, xlsx, xml
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Pooling, meta-analysis, and the evaluation of drug safety - tssk-8hb2 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Pooling-meta-analysis-and-the-evaluation-of-drug-s/m7j8-9vf6
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Sep 10, 2025
Description
This dataset tracks the updates made on the dataset "Pooling, meta-analysis, and the evaluation of drug safety" as a repository for previous versions of the data and metadata.
t
Crossroad Camera Dataset - Mobility Aid Users
repository.tugraz.at
zip
Updated Dec 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ludwig Mohr; Nadezda Kirillova; Horst Possegger; Horst Bischof; Ludwig Mohr; Nadezda Kirillova; Horst Possegger; Horst Bischof (2025). Crossroad Camera Dataset - Mobility Aid Users [Dataset]. http://doi.org/10.3217/2gat1-pev27
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3217/2gat1-pev27
Dataset updated
Dec 1, 2025
Dataset provided by
Graz University of Technology
Authors
Ludwig Mohr; Nadezda Kirillova; Horst Possegger; Horst Bischof; Ludwig Mohr; Nadezda Kirillova; Horst Possegger; Horst Bischof
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Oct 2022
Description
The most vulnerable group of traffic participants are pedestrians using mobility aids. While there has been significant progress in the robustness and reliability of camera based general pedestrian detection systems, pedestrians reliant on mobility aids are highly underrepresented in common datasets for object detection and classification.
To bridge this gap and enable research towards robust and reliable detection systems which may be employed in traffic monitoring, scheduling, and planning, we present this dataset of a pedestrian crossing scenario taken from an elevated traffic monitoring perspective together with ground truth annotations (Yolo format [1]). Classes present in the dataset are pedestrian (without mobility aids), as well as pedestrians using wheelchairs, rollators/wheeled walkers, crutches, and walking canes. The dataset comes with official training, validation, and test splits.
An in-depth description of the dataset can be found in [2]. If you make use of this dataset in your work, research or publication, please cite this work as:
@inproceedings{mohr2023mau,
author = {Mohr, Ludwig and Kirillova, Nadezda and Possegger, Horst and Bischof, Horst},
title = {{A Comprehensive Crossroad Camera Dataset of Mobility Aid Users}},
booktitle = {Proceedings of the 34th British Machine Vision Conference ({BMVC}2023)},
year = {2023}
}
Archive mobility.zip contains the full detection dataset in Yolo format with images, ground truth labels and meta data, archive mobility_class_hierarchy.zip contains labels and meta files (Yolo format) for training with class hierarchy using e.g. the modified version of Yolo v5/v8 available under [3].
To use this dataset with Yolo, you will need to download and extract the zip archive and change the path entry in dataset.yaml to the directory where you extracted the archive to.
[1] https://github.com/ultralytics/ultralytics
[2] coming soon
[3] coming soon
w
EUropeana meta-archive sources
data.wu.ac.at
csv, html
Updated Oct 10, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open GLAM (2013). EUropeana meta-archive sources [Dataset]. https://data.wu.ac.at/odso/datahub_io/NjViNTc1NTgtNzQwYS00MjI2LThlNzctM2RhMWQyNzdlMTky
Explore at:
csv(117349.0), csv(117348.0), html(20.0)Available download formats
Dataset updated
Oct 10, 2013
Dataset provided by
Open GLAM
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Description
Europeana http://www.europeana.eu/ is made our of other archive and gathers the informations giving an API to access it from remote apps. This dataset is about the archive who contributed to the europeana. This data is from 2011
UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC) Data Archive and...
data.niaid.nih.gov
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheadle Center for Biodiversity and Ecological Restoration, University of California Santa Barbara (2024). UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC) Data Archive and Biodiversity Dataset Graph hash://md5/10663911550bb52a0f5741993f82db9d hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5557669
Explore at:
Dataset updated
Dec 3, 2024
Dataset provided by
University of Californiahttp://universityofcalifornia.edu/
Authors
Cheadle Center for Biodiversity and Ecological Restoration, University of California Santa Barbara
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Santa Barbara
Description
A biodiversity dataset graph: UCSB-IZC

The intended use of this archive is to facilitate (meta-)analysis of the UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC). UCSB-IZC is a natural history collection of invertebrate zoology at Cheadle Center of Biodiversity and Ecological Restoration, University of California Santa Barbara.

This dataset provides versioned snapshots of the UCSB-IZC network as tracked by Preston [2,3] between 2021-10-08 and 2021-11-04 using [preston track "https://api.gbif.org/v1/occurrence/search/?datasetKey=d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0"].

This archive contains 14349 images related to 32533 occurrence/specimen records. See included sample-image.jpg and their associated meta-data sample-image.json [4].

The images were counted using:

$ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c
| grep -o -P ".*depict"
| sort
| uniq
| wc -l

And the occurrences were counted using:

$ preston cat hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c
| grep -o -P "occurrence/([0-9])+"
| sort
| uniq
| wc -l

The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to establish a versioning mechanism.

To retrieve and verify the downloaded UCSB-IZC biodiversity dataset graph, first download preston-*.tar.gz. Then, extract the archives into a "data" folder. Alternatively, you can use the Preston [2,3] command-line tool to "clone" this dataset using:

$ java -jar preston.jar clone --remote https://archive.org/download/preston-ucsb-izc/data.zip/,https://zenodo.org/record/5557670/files,https://zenodo.org/record/5660088/files/

After that, verify the index of the archive by reproducing the following provenance log history:

$ java -jar preston.jar history . .

To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while.

$ java -jar preston.jar verify hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c file:/home/jhpoelen/ucsb-izc/data/ce/1d/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c OK CONTENT_PRESENT_VALID_HASH 66438 hash://sha256/ce1dc2468dfb1706a6f972f11b5489dc635bdcf9c9fd62a942af14898c488b2c hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 file:/home/jhpoelen/ucsb-izc/data/f6/8d/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 OK CONTENT_PRESENT_VALID_HASH 4093 hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef file:/home/jhpoelen/ucsb-izc/data/3e/70/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef OK CONTENT_PRESENT_VALID_HASH 5746 hash://sha256/3e70b7adc1a342e5551b598d732c20b96a0102bb1e7f42cfc2ae8a2c4227edef hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b file:/home/jhpoelen/ucsb-izc/data/99/58/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b OK CONTENT_PRESENT_VALID_HASH 6147 hash://sha256/995806159ae2fdffdc35eef2a7eccf362cb663522c308aa6aa52e2faca8bb25b

Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston".

Files in this data publication:

--- start of file descriptions ---

-- description of archive and its contents (this file) -- README

-- executable java jar containing preston [2,3] v0.3.1. -- preston.jar

-- preston archive containing UCSB-IZC (meta-)data/image files, associated provenance logs and a provenance index -- preston-[00-ff].tar.gz

-- individual provenance index files -- 2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

-- example image and meta-data -- sample-image.jpg (with hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c) sample-image.json (with hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844)

--- end of file descriptions ---

References

[1] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-11-04 as indexed by the Global Biodiversity Informatics Facility (GBIF) with provenance hash://sha256/d5eb492d3e0304afadcc85f968de1e23042479ad670a5819cee00f2c2c277f36 hash://sha256/80c0f5fc598be1446d23c95141e87880c9e53773cb2e0b5b54cb57a8ea00b20c. [2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . [3] MJ Elliott, JH Poelen, JAB Fortes (2020). Toward Reliable Biodiversity Dataset References. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2020.101132 [4] Cheadle Center for Biodiversity and Ecological Restoration (2021). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv accessed via GBIF.org on 2021-10-08. https://www.gbif.org/occurrence/3323647301 . hash://sha256/f68d489a9275cb9d1249767244b594c09ab23fd00b82374cb5877cabaa4d0844 hash://sha256/916ba5dc6ad37a3c16634e1a0e3d2a09969f2527bb207220e3dbdbcf4d6b810c
Meta Learning Paper Supplemental Code
catalog.data.gov
data.nist.gov
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). Meta Learning Paper Supplemental Code [Dataset]. https://catalog.data.gov/dataset/meta-learning-paper-supplemental-code-83368
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Meta learning with LLM: supplemental code for reproducibility of computational results for MLT and MLT-plus-TM. Related research paper: "META LEARNING WITH LANGUAGE MODELS: CHALLENGES AND OPPORTUNITIES IN THE CLASSIFICATION OF IMBALANCED TEXT", A. Vassilev, H. Jin, M. Hasan, 2023 (to appear on arXiv).All code and data is contained in the zip archive arxiv2023.zip, subject to the licensing terms shown below. See the Readme.txt contained there for detailed explanation how to unpack and run the code. See also requirements.txt for the necessary depedencies (libraries needed). This is not a dataset, but only python source code.
d
Archive of Digital Boomer Seismic Reflection Data Collected During USGS...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Archive of Digital Boomer Seismic Reflection Data Collected During USGS Cruises 94CCT01 and 95CCT01, Eastern Texas and Western Louisiana, 1994 and 1995 [Dataset]. https://catalog.data.gov/dataset/archive-of-digital-boomer-seismic-reflection-data-collected-during-usgs-cruises-94cct01-an
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
In June of 1994 and August and September of 1995, the U.S. Geological Survey, in cooperation with the University of Texas Bureau of Economic Geology, conducted geophysical surveys of the Sabine and Calcasieu Lake areas and the Gulf of Mexico offshore eastern Texas and western Louisiana. This report serves as an archive of unprocessed digital boomer seismic reflection data, trackline maps, navigation files, observers' logbooks, GIS information, and formal FGDC metadata. In addition, a filtered and gained GIF image of each seismic profile is provided. The archived trace data are in standard Society of Exploration Geophysicists (SEG) SEG-Y format (Barry and others, 1975) and may be downloaded and processed with commercial or public domain software such as Seismic Unix (SU). Examples of SU processing scripts and in-house (USGS) software for viewing SEG-Y files (Zihlman, 1992) are also provided. Processed profile images, trackline maps, navigation files, and formal metadata may be viewed with a web browser. Scanned handwritten logbooks and Field Activity Collection System (FACS) logs may be viewed with Adobe Reader. For more information on the seismic surveys see http://walrus.wr.usgs.gov/infobank/g/g194gm/html/g-1-94-gm.meta.html and http://walrus.wr.usgs.gov/infobank/g/g195gm/html/g-1-95-gm.meta.html These data are also available via GeoMapApp (http://www.geomapapp.org/) and Virtual Ocean ( http://www.virtualocean.org/) earth science exploration and visualization applications.
h
midori_bluearchive
huggingface.co
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AppleHarem (2024). midori_bluearchive [Dataset]. https://huggingface.co/datasets/AppleHarem/midori_bluearchive
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2024
Dataset authored and provided by
AppleHarem
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset of midori (Blue Archive)

This is the dataset of midori (Blue Archive), containing 200 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)

Name Images Download Description

raw 200 Download Raw data with meta information.

raw-stage3 556 Download 3-stage cropped raw data with meta information.

raw-stage3-eyes 676 Download… See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/midori_bluearchive.
h
hoshino_bluearchive
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AppleHarem (2023). hoshino_bluearchive [Dataset]. https://huggingface.co/datasets/AppleHarem/hoshino_bluearchive
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset authored and provided by
AppleHarem
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset of hoshino (Blue Archive)

This is the dataset of hoshino (Blue Archive), containing 150 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)

Name Images Download Description

raw 150 Download Raw data with meta information.

raw-stage3 420 Download 3-stage cropped raw data with meta information.

raw-stage3-eyes 477 Download… See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/hoshino_bluearchive.
J
Data associated with the publication: A quantitative synthesis of outcomes...
archive.data.jhu.edu
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jennifer R. Morrison; Robert M. Bernard (2024). Data associated with the publication: A quantitative synthesis of outcomes of educational technology approaches in K-12 mathematics [Dataset]. http://doi.org/10.7281/T1/GCUWSL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7281/T1/GCUWSL
Dataset updated
Sep 30, 2024
Dataset provided by
Johns Hopkins Research Data Repository
Authors
Jennifer R. Morrison; Robert M. Bernard
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset funded by
United States Department of Education
Description
Dataset used in a meta-analysis examining the effects of educational technology on mathematics outcomes. Includes effects from 40 studies with codes for study and methodological features.
h
tsurugi_bluearchive
huggingface.co
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AppleHarem (2024). tsurugi_bluearchive [Dataset]. https://huggingface.co/datasets/AppleHarem/tsurugi_bluearchive
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2024
Dataset authored and provided by
AppleHarem
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset of tsurugi (Blue Archive)

This is the dataset of tsurugi (Blue Archive), containing 200 images and their tags. Images are crawled from many sites (e.g. danbooru, pixiv, zerochan ...), the auto-crawling system is powered by DeepGHS Team(huggingface organization).(LittleAppleWebUI)

Name Images Download Description

raw 200 Download Raw data with meta information.

raw-stage3 531 Download 3-stage cropped raw data with meta information.

raw-stage3-eyes 667 Download… See the full description on the dataset page: https://huggingface.co/datasets/AppleHarem/tsurugi_bluearchive.
d
Archive of Digital Boomer and CHIRP Seismic Reflection Data Collected During...
catalog.data.gov
data.usgs.gov
+4more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Archive of Digital Boomer and CHIRP Seismic Reflection Data Collected During USGS Field Activity 08LCA03 in Lake Panasoffkee, Florida, May 2008 [Dataset]. https://catalog.data.gov/dataset/archive-of-digital-boomer-and-chirp-seismic-reflection-data-collected-during-usgs-field-ac
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Lake Panasoffkee, Florida
Description
From May 13 to May 14 of 2008, the U.S. Geological Survey conducted geophysical surveys in Lake Panasoffkee, Florida. Thisreport serves as an archive of unprocessed digital boomer and CHIRP seismic reflection data, trackline maps, navigation files, GIS information, FACS logs, and formal FGDC metadata. Filtered and (or) gained digital images of the seismic profiles are also provided. The archived trace data are in standard Society of Exploration Geophysicists (SEG) SEG-Y format (Barry and others, 1975) and may be downloaded and processed with commercial or public domain software such as Seismic Unix (SU). Example SU processing scripts and USGS software for viewing the SEG-Y files (Zihlman, 1992) are also provided. For more information on the seismic surveys see http://walrus.wr.usgs.gov/infobank/j/j308fl/html/j-3-08-fl.meta.html These data are also available via GeoMapApp (http://www.geomapapp.org/) and Virtual Ocean ( http://www.virtualocean.org/) earth science exploration and visualization applications.
Z
Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Ericsson; Anna Wingkvist (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2593141
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Linnaeus University
Authors
Morgan Ericsson; Anna Wingkvist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

Data collection and processing

The dataset is mainly collected from existing datasets. We used data from:

the archive of Reddit posts by Jason Baumgartner (available at https://pushshift.io,

the archive of Hacker News available at Google's BigQuery (available at https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news), and the Stack Exchange data dump (available at https://archive.org/details/stackexchange).

the GHTorrent project

the GH Archive

The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

We use the regular expression tech(nical)?[\s\-_]*?debt to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt.

Data Format

The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

id: the id used in the original source. We use the URL path to identify Medium posts.

body: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).

created_utc: the time the item was posted in seconds since epoch in UTC.

author: the author of the item. We use the username or userid from the source.

source: where the item was posted. Valid sources are:

HackerNews Comment

HackerNews Job

HackerNews Submission

Reddit Comment

Reddit Submission

StackExchange Answer

StackExchange Comment

StackExchange Question

Medium Post

meta: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score and num_comments for keys that have the same meaning/information across multiple sources.

This is a sample item from Reddit:

{ "id": "ab8auf", "body": "Technical Debt Explained (x-post r/Eve)", "created_utc": 1546271789, "author": "totally_100_human", "source": "Reddit Submission", "meta": { "title": "Technical Debt Explained (x-post r/Eve)", "score": 1, "num_comments": 0, "url": "http://jestertrek.com/eve/technical-debt-2.png", "subreddit": "RCBRedditBot" } }

Sample Analyses

We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq to process the JSON.

How many items are there for each source?

lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c

How many submissions that mentioned technical debt were posted each month?

lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c

What are the titles of items that link (meta.url) to PDF documents?

lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'

Please, I want CSV!

lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'

Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

Please see https://github.com/sse-lnu/tdmentions for more analyses

Limitations and Future updates

The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
original : CIFAR 100
kaggle.com
zip
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
Explore at:
zip(168517945 bytes)Available download formats
Dataset updated
Dec 28, 2024
Authors
Shashwat Pandey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

Kanops Open Retail Imagery - Grocery Dataset

kaggle.com

zip

Updated Oct 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Steve Dresser (2025). Kanops Open Retail Imagery - Grocery Dataset [Dataset]. https://www.kaggle.com/datasets/stevedresser/kanops-open-retail-imagery-grocery-dataset

Explore at:

zip(98224876 bytes)Available download formats

Dataset updated

Oct 24, 2025

Authors

Steve Dresser

Description

🛒 Kanops — Open Access · Retail Imagery Dataset (v0)

~10,000 professional retail scene photographs from UK grocery stores for computer vision research

📊 Quick Stats

Attribute	Details
Total Images	~10,000 high-resolution photos
Markets	United Kingdom
Collections	2014 archive, Full store surveys, Halloween 2024
Privacy	All faces automatically blurred
License	Evaluation & Research Only
Format	JPEG with comprehensive metadata

🎯 What Can You Build?

This dataset is perfect for:

🏪 Shelf Detection - Train models to identify retail fixtures and layouts
📦 Product Recognition - Object detection in dense retail environments
📊 Planogram Analysis - Compare actual vs. planned merchandising
🎃 Seasonal Merchandising - Understand seasonal retail patterns (Halloween collection included)
🤖 Store Navigation - Spatial understanding for retail robotics
🔍 Visual Search - Build retail product search engines
📈 Competitive Intelligence - Benchmark merchandising strategies

📥 Access the Dataset

Primary Source: HuggingFace (Gated)

👉 Request access: https://huggingface.co/datasets/dresserman/kanops-open-access-imagery

This dataset is gated - request access on HuggingFace. By requesting access, you agree to the evaluation-only license terms.

Quick Start Code:

from datasets import load_dataset

# Load the dataset (after getting HuggingFace access)
ds = load_dataset(
  "imagefolder",
  data_dir="hf://datasets/dresserman/kanops-open-access-imagery/train",
  split="train",
)

# Access first image
img = ds[0]["image"] # PIL.Image
img.show()

Load Metadata:

import pandas as pd

meta = pd.read_csv(
  "hf://datasets/dresserman/kanops-open-access-imagery/metadata.csv"
)
print(meta.head())

📁 Dataset Structure

train/
├── 2014/
│  ├── Aldi/
│  ├── Tesco/
│  ├── Sainsburys/
│  └── ... (22 UK retailers)
├── FullStores/
│  ├── Tesco_Lincoln_2014/
│  ├── Tesco_Express_2015/
│  └── Asda_Leeds_2016/
└── Halloween2024/
  └── Various_Retailers/

Root files:
├── MANIFEST.csv     # File listing + basic attributes
├── metadata.csv     # Enriched metadata (retailer, dims, collection)
├── checksums.sha256   # Integrity verification
├── blur_log.csv     # Face-blur verification log
└── LICENSE        # Evaluation-only terms

📋 Metadata Schema

Each image includes comprehensive metadata in metadata.csv:

Field	Description
`file_name`	Path relative to dataset root
`bytes`	File size in bytes
`width`, `height`	Image dimensions
`sha256`	Content hash for integrity verification
`collection`	One of: `2014`, `FullStores`, `Halloween2024`
`retailer`	Inferred from file path
`year`	Inferred from file path

🔒 Privacy & Data Integrity

✅ All faces automatically blurred via automated detection + manual review
✅ SHA-256 checksums for every image (data integrity)
✅ Provenance tracking embedded in EXIF/IPTC/XMP metadata
✅ Gated access to ensure license compliance
✅ Takedown process available if needed

📜 License & Usage Terms

License: Evaluation & Research Only

✅ What You CAN Do:

Use for academic research and publications
Train and evaluate computer vision models (non-commercial)
Benchmark algorithm performance
Educational and learning purposes
Prototype development under evaluation terms

❌ What You CANNOT Do:

Redistribute the dataset or derivatives
Use for commercial production systems
Publicly release model weights trained on this data (without commercial license)
Marketing or brand endorsement

For commercial licensing: Contact happytohelp@groceryinsight.com

🏢 About This Sample Dataset

This free sample is part of Kanops Archive - a much larger commercial dataset used by AI companies and research institutions.

This Free Sample (v0):

~10,000 images from UK retailers only
2014-2024 timeframe
Evaluation and research use only

Full Commercial Dataset (RetailVision Archive):

1M+ images spanning 2011-2025
5 geographic markets: UK, Ireland, Netherlands, Germany, USA
280K+ seasonal images with granular categorization
15 years of retail evolution and trends
Professional curation by retail industry experts

Applications: - Training production computer vision models - Autonomous checkout systems - Retail robotics and automation - Seasonal demand forecasting - Market research and competitive intelligence

Learn more: [groceryinsight.com/retail-image-dataset](...

u
People, Places, and Languages in the correspondence preserved in the archive...
recerca.uoc.edu
dataverse.csuc.cat
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodríguez-Casañ, Rubén; Carbó-Catalan, Elisabet; Del Solar-Escardó, Jimena; Cardillo, Alessio; Ikoff, Ventislav; Roig-Sanz, Diana; Rodríguez-Casañ, Rubén; Carbó-Catalan, Elisabet; Del Solar-Escardó, Jimena; Cardillo, Alessio; Ikoff, Ventislav; Roig-Sanz, Diana (2024). People, Places, and Languages in the correspondence preserved in the archive of the International Institute of Intellectual Cooperation [Dataset]. https://recerca.uoc.edu/documentos/67321e99aea56d4af0485996
Explore at:
Dataset updated
2024
Authors
Rodríguez-Casañ, Rubén; Carbó-Catalan, Elisabet; Del Solar-Escardó, Jimena; Cardillo, Alessio; Ikoff, Ventislav; Roig-Sanz, Diana; Rodríguez-Casañ, Rubén; Carbó-Catalan, Elisabet; Del Solar-Escardó, Jimena; Cardillo, Alessio; Ikoff, Ventislav; Roig-Sanz, Diana
Description
The present dataset contains (meta) information extracted from the materials preserved in the archival funds of the International Institute of Intellectual Cooperation (IIIC), which was recently digitized [available at https://atom.archives.unesco.org/iiic ]. More precisely, the dataset focuses on subseries A and F from the Series Correspondence. Using machine learning and natural language processing (NLP) techniques, we have parsed scanned documents and extracted from them meta-information like: people and location mentions, language (e.g., French), nature of material (e.g., letter vs. attached document), formal aspects (e.g., handwritten vs. typewritten), and -- if possible -- its year of publication. Moreover, we have associated these entities (e.g., a given person) and information to the specific document(s) where they appear. We have divided the dataset in three files: one focused on people and two on locations (one for countries and another for cities). This dataset has been generated within the ERC-StG project named "Social Networks of the Past: Mapping Hispanic and Lusophone Literary Modernity, 1898-1959".
arxiv articles dataset
kaggle.com
zip
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kshitij Zutshi (2022). arxiv articles dataset [Dataset]. https://www.kaggle.com/datasets/kshitijzutshi96/arxiv-articles-dataset/discussion
Explore at:
zip(59902401 bytes)Available download formats
Dataset updated
Sep 8, 2022
Authors
Kshitij Zutshi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Dataset

Dataset of meta-data from articles posted to the arXiv ( arxiv.org) is a preprint server where scientists post their publications for the public to view). You'll find article titles, abstracts, subjects, and publish dates from condensed matter physics articles (arxiv archive) in JSON format.

train.csv and test.csv are CSV files created from original JSON files with fields:

date: the date the article was posted to the arxiv. In train.csvthe dates are all before May 2014, and in test.csv the dates are all greater than or equal to May 1, 2014. abstract: the abstract of the article title: the title of the article subject: the subject which the authors attribute to the article. This field will only be present in train.csv. There are 30 unique subjects represented in this dataset.

Possible Challenges -

Predict the number of articles in each subject.

Predict the number of articles in each subject, in every month of the test.csv. As a check, in April 2014, there were 58 articles labeled "quantum physics" and 88 articles labeled "superconductivity." The predictions will be scored using the mean squared error (https://en.wikipedia.org/wiki/Mean_squared_error) of the predictions.

Final submission could be like -

A csv with the predictions. A sample submission with the correct format has been provided in submission.csv. Your submission should be IDENTICAL to this file, except that your predictions should not all be 0. Note, missing values will be filled in with 0. Please examine this file to make sure your submission is in the proper format.
Z
Clotho-AQA dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Tuomas Virtanen (2022). Clotho-AQA dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6473206
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Audio Research Group, Tampere University
Authors
Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Tuomas Virtanen
Description
Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.

S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)

If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)

To use the dataset,

• Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.

• Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.

License:

The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:

• File name

• Keywords

• URL for the original audio file

• Start and ending samples for the excerpt that is used in the Clotho dataset

• Uploader/user in the Freesound platform (manufacturer)

• Link to the license of the file.

The questions and answers in the files:

• clotho_aqa_train.csv

• clotho_aqa_val.csv

• clotho_aqa_test.csv

are under the MIT license, described in the LICENSE file.

References:

[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.

[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
i
The FluxnetEO dataset (MODIS) for American stations located in United States...
meta.icos-cp.eu
Updated Nov 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Besnard; Jacob A. Nelson; Sophia Walther; Ulrich Weber (2021). The FluxnetEO dataset (MODIS) for American stations located in United States (0) [Dataset]. https://meta.icos-cp.eu/objects/wssMbn0QhNm4gQzxR44pcoIs
Explore at:
Dataset updated
Nov 19, 2021
Dataset provided by
Carbon Portal
ICOS data portal
Authors
Simon Besnard; Jacob A. Nelson; Sophia Walther; Ulrich Weber
License
http://meta.icos-cp.eu/ontologies/cpmeta/icosLicencehttp://meta.icos-cp.eu/ontologies/cpmeta/icosLicence
Time period covered
Jan 1, 2000 - Dec 31, 2020
Area covered
United States, US-Aud, United States, United States, United States, United States, United States, United States, United States, United States
Description
Quality checked and gap-filled daily MODIS observations of surface reflectance and land surface temperature at global eddy co-variance sites for the time period 2000-2020. Two product versions: one features all MODIS pixels within 2km radius around a given site, and a second version consists of an average time series that represents the area within 1km2 around a site. All data layers have a complementary layer with gap-fill information. MODIS data comprise all sites in the Fluxnet La Thuile, Berkeley and ICOS Drought 2018 data releases. FluxnetEO MODIS reflectance products: enhanced vegetation index (EVI), normalized difference vegetation index (NDVI), generalized NDVI (kNDVI), near infra-red reflectance of vegetation (NIRv), normalized difference water index (NDWI) with band 5, 6, or 7 as reference, the scaled wide dynamic range vegetation index (sWDRVI), surface reflectance in MODIS bands 1-7. Based on the NASA MCD43A4 and MCD43A2 collection 6 products with a pixel size of 500m. FluxnetEO MODIS land surface temperature: Terra and Aqua, day and night, at native viewing zenith angle as well as corrected to viewing zenith angles of 0 and 40degrees (Ermida et al., 2018, RSE). Based on NASA MOD11A1 and MYD11A1 collection 6 at a pixel size of 1km. Supplementary data to Walther, S., Besnard, S., Nelson, J.A., El-Madany, T. S., Migliavacca, M., Weber, U., Ermida, S. L., Brümmer, C., Schrader, F., Prokushkin, A., Panov, A., Jung, M. , 2021. A view from space on global flux towers by MODIS and Landsat: The FluxnetEO dataset, in preparation for Biogeosciences Discussions. ZIP archive of netcdf files for the stations in Americas : US-AR1, US-AR2, US-ARM, US-ARb, US-ARc, US-Atq, US-Aud, US-Bar, US-Bkg, US-Blo, US-Bn2, US-Bn3, US-Bo1, US-Bo2, US-Brw, US-CRT, US-CaV, US-Cop, US-Dk3, US-FPe Besnard, S., Nelson, J., Walther, S., Weber, U. (2021). The FluxnetEO dataset (MODIS) for American stations located in United States (0), 2000-01-01–2020-12-30, https://hdl.handle.net/11676/wssMbn0QhNm4gQzxR44pcoIs

Facebook

Twitter

Click to copy link

Link copied

Cite

Kakeru Yokoi (2021). LSDB Archive meta data sheet [Dataset]. http://doi.org/10.6084/m9.figshare.14417879.v2

LSDB Archive meta data sheet

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.14417879.v2

Dataset updated

Sep 18, 2021

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Kakeru Yokoi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LSDB Archive meta data sheet

Clear search

Close search

Google apps

Main menu

LSDB Archive meta data sheet

Meta-analysis, Simpson's paradox, and the number needed to treat - h4vw-qr4x...

Pooling, meta-analysis, and the evaluation of drug safety - tssk-8hb2 -...

Crossroad Camera Dataset - Mobility Aid Users

EUropeana meta-archive sources

UC Santa Barbara Invertebrate Zoology Collection (UCSB-IZC) Data Archive and...

Meta Learning Paper Supplemental Code

Archive of Digital Boomer Seismic Reflection Data Collected During USGS...

midori_bluearchive

hoshino_bluearchive

Data associated with the publication: A quantitative synthesis of outcomes...

tsurugi_bluearchive

Archive of Digital Boomer and CHIRP Seismic Reflection Data Collected During...

Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

Data collection and processing

Data Format

Sample Analyses

How many items are there for each source?

How many submissions that mentioned technical debt were posted each month?

What are the titles of items that link (meta.url) to PDF documents?

Please, I want CSV!

Limitations and Future updates

original : CIFAR 100

Kanops Open Retail Imagery - Grocery Dataset

🛒 Kanops — Open Access · Retail Imagery Dataset (v0)

📊 Quick Stats

🎯 What Can You Build?

📥 Access the Dataset

Primary Source: HuggingFace (Gated)

Quick Start Code:

Load Metadata:

📁 Dataset Structure

📋 Metadata Schema

🔒 Privacy & Data Integrity

📜 License & Usage Terms

✅ What You CAN Do:

❌ What You CANNOT Do:

🏢 About This Sample Dataset

This Free Sample (v0):

Full Commercial Dataset (RetailVision Archive):

People, Places, and Languages in the correspondence preserved in the archive...

arxiv articles dataset

About the Dataset

Possible Challenges -

Clotho-AQA dataset

The FluxnetEO dataset (MODIS) for American stations located in United States...

LSDB Archive meta data sheet

What are the titles of items that link (`meta.url`) to PDF documents?