This is a dataset of richly tagged and labeled artwork depicting characters from Japanese anime. The data comes from two image boards, danbooru and moeimouto. This data can be used in an variety of different interesting ways, from classification to generative modeling. Please note that while all of the images in this dataset have been tagged as SFW (non-explicit), the websites these are from do not ban explicit or pornographic images and mislabeled images are possibly still in the dataset.
The first set of data comes from the imageboard Danbooru. The entire corpus of Danbooru images was scraped from the site with permission and was collected into a dataset. The zip files included here have the full metadata for these images as well as a subset of 300,000 of the images in normalized 512px x 512px form. Full information about this dataset is available here:
https://www.gwern.net/Danbooru2017
From the article:
Deep learning for computer revision relies on large annotated datasets. Classification/categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the image -boorus be used. The image boorus are longstanding web databases which host large numbers of images which can be tagged or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations.
The best known booru, with a focus on quality, is Danbooru. We create & provide a torrent which contains ~1.9tb of 2.94m images with 77.5m tag instances (of 333k defined tags, ~26.3/image) covering Danbooru from 24 May 2005 through 31 December 2017 (final ID: #2,973,532), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512x512px JPG (241GB; 2,232,462 images) for convenience.
Our hope is that a Danbooru2017 dataset can be used for rich large-scale classification/tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer.
The second set of data included in this dataset is a little more manageable than the first, it includes a number of cropped illustrated faces from the now defunct site moeimouto. This dataset has been used in GAN work in the past. The data comes from:
http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/
More information:
http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/README.html
If you are interested in creating more face data (potentially from the Danbooru data), here is a helpful resource: https://github.com/nagadomi/lbpcascade_animeface
If you are looking for something a little easier to crack into, check out this other great anime image booru dataset: https://www.kaggle.com/alamson/safebooru
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about artists. It has 1 row and is filtered where the artworks is Large Drawing #5. It features 9 columns including birth date, death date, country, and gender.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.
Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
FloorPlanCAD is a large-scale real-world CAD drawing dataset containing over 15,000 floor plans, ranging from residential to commercial buildings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.
WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each cartoon face in these sets is composed of 16 components that vary in 10 artwork attributes, 4 color attributes, and 4 proportion attributes. Colors are chosen from a discrete set of RGB values. The number of options per attribute category ranges from 3, for short/medium/long chin length, to 111, for the largest category, hairstyle. See below for a description of each attribute.
Each of these components and their variation were drawn by the same artist, Shiraz Fuman, resulting in approximately 250 cartoon component artworks and ~1013 possible combinations. The artwork components are divided into a fixed set of layers that define a Z-ordering for rendering. For instance, the face shape is defined on a layer below the eyes and glasses, so that the artworks are rendered in the correct order. Hair style is a more complex case and needs to be defined on two layers, one behind the face and one in front. There are 8 total layers: hair back, face, hair front, eyes, eyebrows, mouth, facial hair, and glasses.
The mapping from attribute to artwork is also defined by the artist such that any random selection of attributes produces a visually appealing cartoon without any misaligned artwork; this sometimes involves handling interaction between attributes. For example, the proper way to display a “short beard” changes for different face shapes, which required the artist to create a “short beard” artwork for each face shape.
Artwork
chin_length 3 Length of chin (below mouth region)
eye_angle 3 Tilt of the eye inwards or outwards
eye_lashes 2 Whether or not eyelashes are visible
eye_lid 2 Appearance of the eyelids
eyebrow_shape 14 Shape of eyebrows
eyebrow_weight 2 Line weight of eyebrows
face_shape 7 Overall shape of the face
facial_hair 15 Type of facial hair (type 14 is no hair)
glasses 12 Type of glasses (type 11 is no glasses)
hair 111 Type of head hair
Colors
eye_color 5 Color of the eye irises
face_color 11 Color of the face skin
glasses_color 7 Color of the glasses, if present
hair_color 10 Color of the hair, facial hair, and eyebrows
Proportions
eye_eyebrow_distance 3 Distance between the eye and eyebrows
eye_slant 3 Similar to eye_angle, but rotates the eye and does not change artwork
eyebrow_thickness 4 Vertical scaling of the eyebrows
Dataset has been copied from https://google.github.io/cartoonset/ with no modifications (other than tgz being replaced with zip for compression)
Your data will be in front of the world's largest data science community. What questions do you want to see answered?`
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Danbooru2023: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset
Danbooru2023 is a large-scale anime image dataset with over 5 million images contributed and annotated in detail by an enthusiast community. Image tags cover aspects like characters, scenes, copyrights, artists, etc with an average of 30 tags per image. Danbooru is a veteran anime image board with high-quality images and extensive tag metadata. The dataset can be used to train image classification… See the full description on the dataset page: https://huggingface.co/datasets/jpft/danbooru2023.
The Met dataset is a large-scale dataset for Instance-Level Recognition (ILR) in the artwork domain. It relies on the open access collection from the Metropolitan Museum of Art (The Met) in New York to form the training set, which consists of about 400k images from more than 224k classes, with artworks of world-level geographic coverage and chronological periods dating back to the Paleolithic period. Each museum exhibit corresponds to a unique artwork, and defines its own class. The training set exhibits a long-tail distribution with more than half of the classes represented by a single image, making it a special case of few-shot learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is The larger illustrated guide to birds of Southern Africa. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is The illustrated encyclopedia of major aircraft of World War II. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling _domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling _domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling _domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
PASS is a large-scale image dataset, containing 1.4 million images, that does not include any humans and which can be used for high-quality pretraining while significantly reducing privacy concerns.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Danbooru2017 is a large-scale anime image database with 2.9m+ images annotated with 77.5m+ tags; it can be useful for machine learning purposes such as image recognition and generation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personal Protective Equipment Dataset (PPED)
This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.
We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.
1.1. Image collection
We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.
Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.
Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.
Light: Good lighting conditions and poor lighting conditions were studied.
Diversity: Some images contain a single person, and some contain multiple people.
Angle: The pictures we took can be divided into front and side.
A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.
1.2. Label
We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations
1.3. Dataset Features
The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.
1.4. Dataset Division
The data set is divided into 9:1 training set and test set.
We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.
2.1. Environment and Configuration:
Intel Core i7-8700 CPU
NVIDIA GTX1060 GPU
16 GB of RAM
Python: 3.8.10
pytorch: 1.9.0
pycocotools: pycocotools-win
Windows 10
2.2. Applied Models
The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.
2.2.1. Faster R-CNN
Faster R-CNN
backbone: resnet50+fpn
We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.
We modified the dataset path, training classes and training parameters including batch size.
We run train_res50_fpn.py start training.
Then, the weights are trained by the training set.
Finally, we validate the results on the test set.
backbone: mobilenetv2
the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.
The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).
2.2.2. SSD
backbone: resnet50
We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.
The same training method as Faster R-CNN is applied.
The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.
2.2.3. YOLOv3-spp
backbone: DarkNet53
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov3-spp-ultralytics-608.pt.
The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.
2.2.4. YOLOv5
backbone: CSP_DarkNet
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov5s.
The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.
2.3. Evaluation
The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.
Faster R-CNN (R and M)
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py
SSD
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py
YOLOv3-spp
YOLOv5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large ornithopod footprint classification validated by Díaz-Martínez et al. [3].
By Gove Allen [source]
The Paintings Dataset is a rich and diverse collection of various paintings from different artists spanning across multiple time periods. It includes a wide range of art styles, techniques, and subjects, providing an extensive resource for art enthusiasts, historians, researchers, and anyone interested in exploring the world of visual arts.
This dataset aims to capture the essence of artistic expression through its vast array of paintings. From classical masterpieces to contemporary works, it offers a comprehensive perspective on the evolution of artistic creativity throughout history.
Each record in this dataset represents an individual painting with detailed information such as artist's name, artwork title (if applicable), genre/style classification (e.g., landscape, portrait), medium (e.g., oil on canvas), dimensions (height and width), and provenance details if available. Additionally, some records may include additional metadata like the year or era in which the artwork was created.
By providing such comprehensive data about each painting included within this dataset, it enables users to study various aspects of art history. Researchers can analyze trends across different time periods or explore specific artistic movements by filtering the dataset based on genre or style categories. Art enthusiasts can also use this dataset to discover new artists or artworks that align with their interests.
This valuable collection appeals not only to those seeking knowledge or inspiration from renowned artworks but also encourages exploration into lesser-known pieces that may have been overlooked in mainstream discourse. It fosters engagement with cultural heritage while promoting diversity and inclusivity within the realm of visual arts.
Whether you are interested in studying classical works by universally acclaimed painters like Leonardo da Vinci or exploring modern expressions by emerging contemporary artists—this Paintings Dataset has something for everyone who appreciates aesthetics and enjoys unraveling stories through brushstrokes on canvas
How to Use the Paintings Dataset
Welcome to the Paintings Dataset! This dataset is a comprehensive collection of various paintings from different artists and time periods. It contains information about the artist, title, genre, style, and medium of each painting. Whether you are an art enthusiast, researcher, or just curious about paintings, this guide will help you navigate through this dataset easily.
1. Understanding the Columns
This dataset consists of several columns that provide detailed information about each painting. Here is a brief description of each column:
- Artist: The name of the artist who created the painting.
- Title: The title or name given to the artwork by the artist.
- Genre: The artistic category or subject matter depicted in the painting.
- Style: The specific artistic style or movement associated with the painting.
- Medium: The materials and techniques used by the artist to create the artwork.
2. Exploring Artists and Their Paintings
One interesting way to use this dataset is to explore individual artists and their artworks. You can filter by a specific artist's name in order to retrieve all their paintings included in this collection.
For example: If you are interested in exploring all paintings by Leonardo da Vinci, simply filter using Leonardo da Vinci in Artist column using your preferred data analysis tool.
3. Analyzing Painting Genres
The genre column allows you to analyze different categories within this collection of paintings. You can examine popular genres or compare them across different eras.
To analyze genres: - Get unique values for Genre column. - Count frequency for each genre value. - Visualize results using bar charts or other graphical representations.
You might discover which genres were more predominant during certain periods or which artists were known for specific subjects!
4. Investigating Artistic Styles
Similar to genres, artistic styles also play an essential role in the world of painting. This dataset includes various styles like Impressionism, Cubism, Realism, etc. By analyzing the artistic styles column, you can explore trends and shifts in artistic movements.
To investigate styles: - Get unique values for Style column. - Count frequency for each style value. - Visualize results using bar charts or other graphical...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geographic distribution of the large ornithopod dinosaur track data used in this study.
We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32×32, 256×256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks.
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
This is a dataset of richly tagged and labeled artwork depicting characters from Japanese anime. The data comes from two image boards, danbooru and moeimouto. This data can be used in an variety of different interesting ways, from classification to generative modeling. Please note that while all of the images in this dataset have been tagged as SFW (non-explicit), the websites these are from do not ban explicit or pornographic images and mislabeled images are possibly still in the dataset.
The first set of data comes from the imageboard Danbooru. The entire corpus of Danbooru images was scraped from the site with permission and was collected into a dataset. The zip files included here have the full metadata for these images as well as a subset of 300,000 of the images in normalized 512px x 512px form. Full information about this dataset is available here:
https://www.gwern.net/Danbooru2017
From the article:
Deep learning for computer revision relies on large annotated datasets. Classification/categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the image -boorus be used. The image boorus are longstanding web databases which host large numbers of images which can be tagged or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations.
The best known booru, with a focus on quality, is Danbooru. We create & provide a torrent which contains ~1.9tb of 2.94m images with 77.5m tag instances (of 333k defined tags, ~26.3/image) covering Danbooru from 24 May 2005 through 31 December 2017 (final ID: #2,973,532), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512x512px JPG (241GB; 2,232,462 images) for convenience.
Our hope is that a Danbooru2017 dataset can be used for rich large-scale classification/tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer.
The second set of data included in this dataset is a little more manageable than the first, it includes a number of cropped illustrated faces from the now defunct site moeimouto. This dataset has been used in GAN work in the past. The data comes from:
http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/
More information:
http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/README.html
If you are interested in creating more face data (potentially from the Danbooru data), here is a helpful resource: https://github.com/nagadomi/lbpcascade_animeface
If you are looking for something a little easier to crack into, check out this other great anime image booru dataset: https://www.kaggle.com/alamson/safebooru