22 datasets found
  1. Anime Character Traits Dataset

    • kaggle.com
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Roberts (2023). Anime Character Traits Dataset [Dataset]. https://www.kaggle.com/datasets/mjrone/anime-character-traits-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2023
    Dataset provided by
    Kaggle
    Authors
    Michael Roberts
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a dataset I scraped from the 'AnimeCharacterDatabase' website, a place where anime fans contribute data about characters in a large variety of shows. It contains data about the physical and temperamental qualities of many popular anime characters. The traits are stored in the 'tags' column, and can either be read in as a python list, or parsed using regular expressions to get individual traits. I should also mention the index, which is the popularity ranking of each character in the database, and can be used to rank popular traits or for other such uses.

  2. PHCD - Polish Handwritten Characters Database

    • kaggle.com
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiktor Flis (2023). PHCD - Polish Handwritten Characters Database [Dataset]. https://www.kaggle.com/datasets/westedcrean/phcd-polish-handwritten-characters-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wiktor Flis
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F950187%2Fd8a0b40fa9a5ad45c65e703b28d4a504%2Fbackground.png?generation=1703873571061442&alt=media" alt="">

    The process for collecting this dataset was documented in paper "https://doi.org/10.12913/22998624/122567">"Development of Extensive Polish Handwritten Characters Database for Text Recognition Research" by Mikhail Tokovarov, dr Monika Kaczorowska and dr Marek Miłosz. Link to download the original dataset: https://cs.pollub.pl/phcd/. The source fileset also contains a dataset of raw images of whole sentences written in Polish.

    Context

    PHCD (Polish Handwritten Characters Database) is a collection of handwritten texts in Polish. It was created by researchers at Lublin University of Technology for the purpose of offline handwritten text recognition. The database contains more than 530 000 images of handwritten characters. Each image is a 32x32 pixel grayscale image representing one of 89 classes (10 digits, 26 lowercase latin letters, 26 uppercase latin letters, 9 lowercase polish letters, 9 uppercase polish letters and 9 special characters), with around 6 000 examples per class.

    How to use

    This notebook contains a PyTorch example of how to load the dataset from .npz files and train a CNN model. You can also use the dataset with other frameworks, such as TensorFlow, Keras, etc.

    For .npz files, use numpy.load method.

    Contents

    The dataset contains the following:

    • dataset.npz - a file with two compressed numpy arrays:
      • "signs" - with all the images, sized 32 x 32 (grayscale)
      • "labels" - with all the labels (0-88) for examples from signs
    • label_mapping.csv - a csv file with columns label and char, mapping from ids to characters from dataset
    • images - folder with original 530 000 png images, sized 32 x 32, to use with other loading techniques

    Acknowledgements

    I want to express my gratitude to the following people: Dr. Edyta Łukasik for introducing me to this dataset and to authors of this dataset - Mikhail Tokovarov, dr. Monika Kaczorowska and dr. Marek Miłosz from Lublin University of Technology in Poland.

    Inspiration

    You can use this data the same way you used MNIST, KMNIST of Fashion MNIST: refine your image classification skills, use GPU & TPU to implement CNN architectures for models to perform such multiclass classifications.

  3. n

    5,162 Images – Traditional Chinese Handwriting OCR Dataset

    • nexdata.ai
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 5,162 Images – Traditional Chinese Handwriting OCR Dataset [Dataset]. https://www.nexdata.ai/datasets/ocr/1190
    Explore at:
    Dataset updated
    Oct 24, 2023
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Data format, Data content, Annotation content, Photographic angle, Collecting environment
    Description

    This dataset contains 5,162 handwriting images from 262 individuals, covering Traditional Chinese characters used in Taiwan. Each text in the data were annotated with quadrilateral bounding boxes. The handwriting ocr data can be used for training and evaluating OCR models, Traditional Chinese character recognition systems, and AI-based handwriting applications. The accuracy of line-level annotation and transcription is >= 97%.

  4. a

    Data from: Public Health Departments

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • nconemap.gov
    • +2more
    Updated Jan 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CA Governor's Office of Emergency Services (2018). Public Health Departments [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/maps/29c3979a34ba4d509582a0e2adf82fd3
    Explore at:
    Dataset updated
    Jan 17, 2018
    Dataset authored and provided by
    CA Governor's Office of Emergency Services
    Area covered
    Description

    State and Local Public Health Departments in the United States Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Town health officers are included in Vermont and boards of health are included in Massachusetts. Both of these types of entities are elected or appointed to a term of office during which they make and enforce policies and regulations related to the protection of public health. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Since many town health officers in Vermont work out of their personal homes, TechniGraphics represented these entities at the town hall. This is denoted in the [DIRECTIONS] field. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard HSIP fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/18/2009 and the newest record dates from 01/08/2010.

  5. d

    HSIP Correctional Institutions in New Mexico

    • catalog.data.gov
    • gstore.unm.edu
    • +2more
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2020). HSIP Correctional Institutions in New Mexico [Dataset]. https://catalog.data.gov/dataset/hsip-correctional-institutions-in-new-mexico
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    (Point of Contact)
    Area covered
    New Mexico
    Description

    Jails and Prisons (Correctional Institutions). The Jails and Prisons sub-layer is part of the Emergency Law Enforcement Sector and the Critical Infrastructure Category. A Jail or Prison consists of any facility or location where individuals are regularly and lawfully detained against their will. This includes Federal and State prisons, local jails, and juvenile detention facilities, as well as law enforcement temporary holding facilities. Work camps, including camps operated seasonally, are included if they otherwise meet the definition. A Federal Prison is a facility operated by the Federal Bureau of Prisons for the incarceration of individuals. A State Prison is a facility operated by a state, commonwealth, or territory of the US for the incarceration of individuals for a term usually longer than 1 year. A Juvenile Detention Facility is a facility for the incarceration of those who have not yet reached the age of majority (usually 18 years). A Local Jail is a locally administered facility that holds inmates beyond arraignment (usually 72 hours) and is staffed by municipal or county employees. A temporary holding facility, sometimes referred to as a "police lock up" or "drunk tank", is a facility used to detain people prior to arraignment. Locations that are administrative offices only are excluded from the dataset. This definition of Jails is consistent with that used by the Department of Justice (DOJ) in their "National Jail Census", with the exception of "temporary holding facilities", which the DOJ excludes. Locations which function primarily as law enforcement offices are included in this dataset if they have holding cells. If the facility is enclosed with a fence, wall, or structure with a gate around the buildings only, the locations were depicted as "on entity" at the center of the facility. If the facility's buildings are not enclosed, the locations were depicted as "on entity" on the main building or "block face" on the correct street segment. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset. TGS has made a concerted effort to include all correctional institutions. This dataset includes non license restricted data from the following federal agencies: Bureau of Indian Affairs; Bureau of Reclamation; U.S. Park Police; Federal Bureau of Prisons; Bureau of Alcohol, Tobacco, Firearms and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 12/27/2004 and the newest record dates from 09/08/2009

  6. Landscape Character Type - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Mar 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2016). Landscape Character Type - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/landscape-character-type
    Explore at:
    Dataset updated
    Mar 29, 2016
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    The Peak District National Park contains an amazing variety of landscapes including broad open moorlands, more intimate enclosed farmlands and wooded valleys. The landscapes have been shaped by variations in geology and landform and the long settlement and use of these landscapes by people. Todays landscapes have a rich diversity of natural and cultural heritage and this diversity is enjoyed by local communities and visitors. Landscape Character Assessment is a tool for identifying what makes one place different from another. It identifies what makes a place distinctive and does not assign value to particular landscapes. Landscape Character Assessment provides a framework for describing an area systematically, ensuring that judgments about future landscape change can be made based on knowledge of what is distinctive. This study has gathered information from published maps and documents, completed a full field survey of the National Park and held a series of consultation workshops to gather the views of local communities. Formal consultation was carried out on the draft report and amendments made to the maps and text documents. This report shows how the landscapes of the National Park and its surrounding area has been divided into a series of Regional Character Areas representing broad tracts of landscape which share common characteristics. Within each Regional Character Area a number of Landscape Character Types have been defined based upon the pattern of natural and cultural characteristics. This document is the first stage of an ongoing project. The coming year will see the development of a landscape strategy and action plan for the Peak District National Park. The landscape strategy will build on an analysis of condition and forces for change in the landscape and further consultation with stakeholders. The Landscape Character Assessment establishes a baseline audit of the current character of the landscape and provides a framework for the measurement of future landscape change. The assessment will also help to promote appreciation and understanding of the landscape of the National Park.

  7. m

    A Benchmark Dataset for Manipuri Meetei-Mayek Handwritten Character...

    • data.mendeley.com
    Updated Sep 26, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pangambam Singh (2019). A Benchmark Dataset for Manipuri Meetei-Mayek Handwritten Character Recognition [Dataset]. http://doi.org/10.17632/3337bdvx3v.6
    Explore at:
    Dataset updated
    Sep 26, 2019
    Authors
    Pangambam Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Manipur
    Description

    A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far.

    Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages.

    In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images.

    The whole dataset consists of five categories:

    This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.

  8. a

    Truck Driving Schools

    • impactmap-smudallas.hub.arcgis.com
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SMU (2024). Truck Driving Schools [Dataset]. https://impactmap-smudallas.hub.arcgis.com/datasets/truck-driving-schools
    Explore at:
    Dataset updated
    Jan 21, 2024
    Dataset authored and provided by
    SMU
    Area covered
    Description

    Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1) In the event of a threat against truck driving schools, this dataset could be used to locate truck driving schools in need of protection. 2) Identification of large groups of people that may need to be evacuated in the event of an emergency.This dataset is composed of any type of Post Secondary Education facility such as: colleges, universities, technical schools, or trade schools that provide training and certification in the field of professional truck driving. This dataset does not include Administration Only locations. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g. the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] attribute.

  9. f

    Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  10. m

    BISE Dataset-Balinese Script for Imaginary Spelling using...

    • data.mendeley.com
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    I Made Agus Wirawan (2024). BISE Dataset-Balinese Script for Imaginary Spelling using Electroencephalogram Signals [Dataset]. http://doi.org/10.17632/c3m4s2dtcr.2
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    I Made Agus Wirawan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Balinese Script for Imaginary Spelling using Electroencephalogram (BISE dataset) is a collection of data related to pronunciation/spelling and imagination of Balinese script based on electroencephalogram (EEG) signals. This dataset consists of character spelling (CS) and character imagination (CI) datasets. Providing CS and CI datasets is very important to ensure that the EEG signal pattern for the spoken script matches the imagined script. So, this dataset would actually need to be provided. In addition, previous researchers have yet to collect a Balinese script imagination dataset. The participants used as samples in this study were 31 healthy people, 8 males and 23 females, from students of the Balinese language education study program at Universitas Pendidikan Ganesha. The participants' EEG signals were recorded using Contec KT 88 with 16 channels. This dataset consists of 7 types of data, namely: (1) Raw data from the 1st experiment, (2) Raw data from the 2nd experiment, (3) Data analysis of character spelling (CS) in 1st experiment, (4) Data analysis of character imagination (CI) in 1st experiment, (5) Data analysis of character spelling (CS) in 2nd experiment, (6) Data analysis of character imagination (CI) in 2nd experiment, and (7) Raw data from calm conditions. The first experimental data contains EEG signal data from participants experimenting with pronouncing and imagining 18 Balinese scripts sequentially and randomly. Furthermore, the raw data of the second experiment contains EEG signal data from participants who performed the spelling (CS) and imagining (CI) of 6 Balinese vowel scripts sequentially and randomly. Furthermore, from the raw data, a data analysis process was carried out for the raw data in the 1st and 2nd experiments. In the first experiment's raw data, two analysis data were produced for 18 Balinese scripts spelling and 18 Balinese scripts imagined. Furthermore, for the raw data in the 2nd experiment, two analysis data were produced for 6 Balinese vowel scripts that were spelling and 6 Balinese vowel scripts that were imagined. Finally, the raw data of the calm condition contains EEG signal data from participants in a quiet condition before starting the experiment.

  11. d

    A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER...

    • search.dataone.org
    • dataverse.harvard.edu
    • +5more
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Singh, Pangambam (2023). A BENCHMARK DATASET FOR MANIPURI MEETEI-MAYEK HANDWRITTEN CHARACTER RECOGNITION [Dataset]. http://doi.org/10.7910/DVN/OMU2DV
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Singh, Pangambam
    Area covered
    Manipur
    Description

    A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far. Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages. In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images. This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.

  12. Cityscapes Image Pairs

    • kaggle.com
    Updated Apr 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DanB (2018). Cityscapes Image Pairs [Dataset]. https://www.kaggle.com/datasets/dansbecker/cityscapes-image-pairs/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DanB
    Description

    Context

    Cityscapes data (dataset home page) contains labeled videos taken from vehicles driven in Germany. This version is a processed subsample created as part of the Pix2Pix paper. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks.

    Content

    This dataset has 2975 training images files and 500 validation image files. Each image file is 256x512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.

    Acknowledgements

    This dataset is the same as what is available here from the Berkeley AI Research group.

    License

    The Cityscapes data available from cityscapes-dataset.com has the following license:

    This dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    • That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Daimler AG, MPI Informatics, TU Darmstadt) do not accept any responsibility for errors or omissions.
    • That you include a reference to the Cityscapes Dataset in any work that makes use of the dataset. For research papers, cite our preferred publication as listed on our website; for other media cite our preferred publication as listed on our website or link to the Cityscapes website.
    • That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character.
    • That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain.
    • That all rights not expressly granted to you are reserved by (Daimler AG, MPI Informatics, TU Darmstadt).

    Inspiration

    Can you identify you identify what objects are where in these images from a vehicle.

  13. Pixar Movies

    • kaggle.com
    Updated Oct 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rummage Labs (2024). Pixar Movies [Dataset]. https://www.kaggle.com/datasets/rummagelabs/pixar-movies
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Dataset provided by
    Kaggle
    Authors
    Rummage Labs
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Pixar Movies Dataset

    A comprehensive dataset of Pixar movies, including details on their release dates, directors, writers, cast, box office performance, and ratings. This dataset is gathered from official sources, including Pixar, Rotten Tomatoes, and IMDb, to provide accurate and relevant information for anyone interested in analyzing Pixar's films.

    About Pixar Movies

    Pixar Animation Studios, known for its quality animation and storytelling, has produced a series of animated movies that have captivated audiences around the world. This dataset captures key details from Pixar’s filmography, including box office earnings, critical ratings, and character information, making it a valuable resource for those analyzing trends in animation, its movie plot lines and beloved characters, and movie ratings. For more information, visit Pixar, Rotten Tomatoes, and IMDb.

    Dataset Information

    • Source: Data is compiled from public sources, including official information from Pixar, Rotten Tomatoes, IMDb, and Wikipedia. Cells are each derived from one or more sources and then selected/verified.
    • Purpose: The dataset is intended for research, educational, and analytical purposes.
    • Accuracy: Efforts have been made to ensure accuracy, though users are encouraged to verify individual data points for critical use.
    • Updates: This dataset captures information available up to the latest Pixar releases.

    Data Structure

    Dataset Columns

    ColumnDescription
    movieThe title of the Pixar movie
    date_releasedThe exact release date of the movie (e.g., YYYY-MM-DD)
    year_releasedThe year the movie was released (e.g., YYYY)
    length_minDuration of the movie in minutes
    plot_summaryA brief summary of the movie's plot
    directorThe name(s) of the director(s) of the movie
    writerThe name(s) of the writer(s) of the movie
    main_charactersList of main characters featured in the movie
    type_of_charactersDescription of the types of characters (e.g., human, toys, animals, vehicles)
    main_voice_actorsList of actors who voiced the main characters
    opening_weekend_box_office_salesGross box office earnings on the opening weekend in USD
    total_worldwide_gross_salesTotal gross box office earnings worldwide in USD
    rotten_tomatoes_ratingRotten Tomatoes rating, typically out of 100
    imdb_ratingIMDb rating, typically out of 10
    movie_genrePrimary genre(s) of the movie (e.g., Animation, Adventure, Comedy)
    movie_ratingThe movie’s rating (e.g., G, PG, PG-13)

    This data was compiled, enriched, reviewed, and curated using Research by Rummage Labs. Research by Rummage Labs enables you to curate verified datasets to power your enterprise. Read more here: https://rummagelabs.com/.

  14. Game of thrones character classification

    • kaggle.com
    Updated Sep 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kushagra (2020). Game of thrones character classification [Dataset]. https://www.kaggle.com/kushagrakinjawadekar/game-of-thrones-character-classification/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kushagra
    Description

    Context

    Many people including me confused due to many chracters when I started watching this famous T.V series, hence I decided to classify and recognize them.

    Content

    The dataset contains the images of each character to be recognized and identified if used to predict. 1. John Snow
    2. Daenerys Targaryen 3. Arya Stark
    4. Sansa Stark 5. Jaime Lannister
    6. Tyrion Lannister

    If you do like the content please give an upvote.

  15. US Household Income Statistics

    • kaggle.com
    zip
    Updated Apr 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Golden Oak Research Group (2018). US Household Income Statistics [Dataset]. https://www.kaggle.com/goldenoakresearch/us-household-income-stats-geo-locations
    Explore at:
    zip(2344717 bytes)Available download formats
    Dataset updated
    Apr 16, 2018
    Dataset authored and provided by
    Golden Oak Research Group
    Description

    New Upload:

    Added +32,000 more locations. For information on data calculations please refer to the methodology pdf document. Information on how to calculate the data your self is also provided as well as how to buy data for $1.29 dollars.

    What you get:

    The database contains 32,000 records on US Household Income Statistics & Geo Locations. The field description of the database is documented in the attached pdf file. To access, all 348,893 records on a scale roughly equivalent to a neighborhood (census tract) see link below and make sure to up vote. Up vote right now, please. Enjoy!

    Household & Geographic Statistics:

    • Mean Household Income (double)
    • Median Household Income (double)
    • Standard Deviation of Household Income (double)
    • Number of Households (double)
    • Square area of land at location (double)
    • Square area of water at location (double)

    Geographic Location:

    • Longitude (double)
    • Latitude (double)
    • State Name (character)
    • State abbreviated (character)
    • State_Code (character)
    • County Name (character)
    • City Name (character)
    • Name of city, town, village or CPD (character)
    • Primary, Defines if the location is a track and block group.
    • Zip Code (character)
    • Area Code (character)

    Abstract

    The dataset originally developed for real estate and business investment research. Income is a vital element when determining both quality and socioeconomic features of a given geographic location. The following data was derived from over +36,000 files and covers 348,893 location records.

    License

    Only proper citing is required please see the documentation for details. Have Fun!!!

    Golden Oak Research Group, LLC. “U.S. Income Database Kaggle”. Publication: 5, August 2017. Accessed, day, month year.

    Sources, don't have 2 dollars? Get the full information yourself!

    2011-2015 ACS 5-Year Documentation was provided by the U.S. Census Reports. Retrieved August 2, 2017, from https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_by_state/

    Found Errors?

    Please tell us so we may provide you the most accurate data possible. You may reach us at: research_development@goldenoakresearch.com

    for any questions you can reach me on at 585-626-2965

    please note: it is my personal number and email is preferred

    Check our data's accuracy: Census Fact Checker

    Access all 348,893 location records and more:

    Don't settle. Go big and win big. Optimize your potential. Overcome limitation and outperform expectation. Access all household income records on a scale roughly equivalent to a neighborhood, see link below:

    Website: Golden Oak Research Kaggle Deals all databases $1.29 Limited time only

    A small startup with big dreams, giving the every day, up and coming data scientist professional grade data at affordable prices It's what we do.

  16. F

    Tamil Product Image OCR Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Tamil Product Image OCR Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/tamil-product-image-ocr-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Tamil Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Tamil language.

    Dataset Contain & Diversity:

    Containing a total of 2000 images, this Tamil OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Tamil text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.

    All these images were captured by native Tamil people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Tamil crowd community.

    If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Tamil language. Your journey to enhanced language understanding and processing starts here.

  17. F

    Arabic Product Image OCR Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Arabic Product Image OCR Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/arabic-product-image-ocr-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Arabic Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Arabic language.

    Dataset Contain & Diversity:

    Containing a total of 2000 images, this Arabic OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Arabic text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.

    All these images were captured by native Arabic people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Arabic text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Arabic crowd community.

    If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Arabic language. Your journey to enhanced language understanding and processing starts here.

  18. F

    Korean Product Image OCR Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Korean Product Image OCR Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/korean-product-image-ocr-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the Korean Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Korean language.

    Dataset Contain & Diversity:

    Containing a total of 2000 images, this Korean OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.

    To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Korean text.

    Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.

    All these images were captured by native Korean people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.

    Metadata:

    Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Korean text recognition models.

    Update & Custom Collection:

    We're committed to expanding this dataset by continuously adding more images with the assistance of our native Korean crowd community.

    If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.

    Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.

    License:

    This Image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Korean language. Your journey to enhanced language understanding and processing starts here.

  19. ASL Sign Language Alphabet Pictures [Minus J, Z]

    • kaggle.com
    Updated Sep 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2020
    Dataset provided by
    Kaggle
    Authors
    SigNN Team
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5845582%2Fad18ab19f798402fd7efebfe1063c5d5%2F800px-Asl_alphabet_gallaudet.svg.png?generation=1601115599739683&alt=media" alt="">

    Images were taken with the chart above as a reference. There are many different dialects of ASL and that should be noted before proceeding.

    Context

    This dataset was created as part of the SigNN project, to create a free and open-source real-time translation app for the American Sign Language (ASL) alphabet. The link to the GitHub can be found here: https://github.com/AriAlavi/SigNN

    SigNN was built on top of Mediapipe: a free and open-source library that allows for easy implementation of real-time neural networks in many different environments, such as Android and Ubuntu. The link to the GitHub can be found here: https://github.com/google/mediapipe

    We were able to reach a theoretical accuracy of 95% through this dataset. In practice, it does not seem to be correct 95% of the time, but we believe it is sufficient. The completed app is downloadable at: https://play.google.com/store/apps/details?id=com.signn.mediapipe.apps.handtrackinggpu

    Content

    Inside each folder, you will find hundreds of pictures which were captured by our volunteers. Each picture is slightly different than the others as the angle and position are slightly varied. For our use case, we expect the user to face the camera head-on when translating, so angle variations of the sign are less than 45 degrees. Each image will also display no more than one hand.

    Acknowledgements

    We would like to thank all members of the SigNN team for collecting the data and writing the infrastructure required to capture, process, and train the neural network. It was not an easy task.

    For this dataset, we would especially like to thank our data collectors who spent hours of their free time taking videos of their hands:

    Kenny Yip Albert Yuk Zhuang Daniel Lohn Rafael Trinidad Gokul Deep

    Inspiration

    We have learned through our year-long journey creating SigNN that the primary reason for the lack of other ASL alphabet translators was the lack of publically available data. That is why we unanimously voted to make the data publically available for others to create similar software. While we were the first to the app store, we hope to inspire better apps and invite people to learn from our mistakes and take notes where we made breakthroughs. If you are interested more in our story, you can learn more at: signn.live

  20. F

    French Handwritten Sticky Notes OCR Image Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Handwritten Sticky Notes OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/french-sticky-notes-ocr-image-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Introducing the French Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.

    Dataset Contain & Diversity:

    Containing more than 2000 images, this French OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.

    To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible French text.

    The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.

    All these sticky notes were written and images were captured by native French people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.

    Metadata:

    In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.

    This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of French text recognition models.

    Update & Custom Collection:

    We are committed to continually expanding this dataset by adding more images with the help of our native French crowd community.

    If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.

    Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.

    License:

    This image dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the French language. Your journey to improved language understanding and processing begins here.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Michael Roberts (2023). Anime Character Traits Dataset [Dataset]. https://www.kaggle.com/datasets/mjrone/anime-character-traits-dataset
Organization logo

Anime Character Traits Dataset

Character Data of 120,000 characters from AnimeCharactersDatabase

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2023
Dataset provided by
Kaggle
Authors
Michael Roberts
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This is a dataset I scraped from the 'AnimeCharacterDatabase' website, a place where anime fans contribute data about characters in a large variety of shows. It contains data about the physical and temperamental qualities of many popular anime characters. The traits are stored in the 'tags' column, and can either be read in as a python list, or parsed using regular expressions to get individual traits. I should also mention the index, which is the popularity ranking of each character in the database, and can be used to rank popular traits or for other such uses.

Search
Clear search
Close search
Google apps
Main menu