https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset I scraped from the 'AnimeCharacterDatabase' website, a place where anime fans contribute data about characters in a large variety of shows. It contains data about the physical and temperamental qualities of many popular anime characters. The traits are stored in the 'tags' column, and can either be read in as a python list, or parsed using regular expressions to get individual traits. I should also mention the index, which is the popularity ranking of each character in the database, and can be used to rank popular traits or for other such uses.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F950187%2Fd8a0b40fa9a5ad45c65e703b28d4a504%2Fbackground.png?generation=1703873571061442&alt=media" alt="">
The process for collecting this dataset was documented in paper "https://doi.org/10.12913/22998624/122567">"Development of Extensive Polish Handwritten Characters Database for Text Recognition Research" by Mikhail Tokovarov, dr Monika Kaczorowska and dr Marek Miłosz. Link to download the original dataset: https://cs.pollub.pl/phcd/. The source fileset also contains a dataset of raw images of whole sentences written in Polish.
PHCD (Polish Handwritten Characters Database) is a collection of handwritten texts in Polish. It was created by researchers at Lublin University of Technology for the purpose of offline handwritten text recognition. The database contains more than 530 000 images of handwritten characters. Each image is a 32x32 pixel grayscale image representing one of 89 classes (10 digits, 26 lowercase latin letters, 26 uppercase latin letters, 9 lowercase polish letters, 9 uppercase polish letters and 9 special characters), with around 6 000 examples per class.
This notebook contains a PyTorch example of how to load the dataset from .npz files and train a CNN model. You can also use the dataset with other frameworks, such as TensorFlow, Keras, etc.
For .npz files, use numpy.load method.
The dataset contains the following:
I want to express my gratitude to the following people: Dr. Edyta Łukasik for introducing me to this dataset and to authors of this dataset - Mikhail Tokovarov, dr. Monika Kaczorowska and dr. Marek Miłosz from Lublin University of Technology in Poland.
You can use this data the same way you used MNIST, KMNIST of Fashion MNIST: refine your image classification skills, use GPU & TPU to implement CNN architectures for models to perform such multiclass classifications.
This dataset contains 5,162 handwriting images from 262 individuals, covering Traditional Chinese characters used in Taiwan. Each text in the data were annotated with quadrilateral bounding boxes. The handwriting ocr data can be used for training and evaluating OCR models, Traditional Chinese character recognition systems, and AI-based handwriting applications. The accuracy of line-level annotation and transcription is >= 97%.
State and Local Public Health Departments in the United States Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Town health officers are included in Vermont and boards of health are included in Massachusetts. Both of these types of entities are elected or appointed to a term of office during which they make and enforce policies and regulations related to the protection of public health. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Since many town health officers in Vermont work out of their personal homes, TechniGraphics represented these entities at the town hall. This is denoted in the [DIRECTIONS] field. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard HSIP fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. At the request of NGA, text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. At the request of NGA, all diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/18/2009 and the newest record dates from 01/08/2010.
Jails and Prisons (Correctional Institutions). The Jails and Prisons sub-layer is part of the Emergency Law Enforcement Sector and the Critical Infrastructure Category. A Jail or Prison consists of any facility or location where individuals are regularly and lawfully detained against their will. This includes Federal and State prisons, local jails, and juvenile detention facilities, as well as law enforcement temporary holding facilities. Work camps, including camps operated seasonally, are included if they otherwise meet the definition. A Federal Prison is a facility operated by the Federal Bureau of Prisons for the incarceration of individuals. A State Prison is a facility operated by a state, commonwealth, or territory of the US for the incarceration of individuals for a term usually longer than 1 year. A Juvenile Detention Facility is a facility for the incarceration of those who have not yet reached the age of majority (usually 18 years). A Local Jail is a locally administered facility that holds inmates beyond arraignment (usually 72 hours) and is staffed by municipal or county employees. A temporary holding facility, sometimes referred to as a "police lock up" or "drunk tank", is a facility used to detain people prior to arraignment. Locations that are administrative offices only are excluded from the dataset. This definition of Jails is consistent with that used by the Department of Justice (DOJ) in their "National Jail Census", with the exception of "temporary holding facilities", which the DOJ excludes. Locations which function primarily as law enforcement offices are included in this dataset if they have holding cells. If the facility is enclosed with a fence, wall, or structure with a gate around the buildings only, the locations were depicted as "on entity" at the center of the facility. If the facility's buildings are not enclosed, the locations were depicted as "on entity" on the main building or "block face" on the correct street segment. Personal homes, administrative offices, and temporary locations are intended to be excluded from this dataset. TGS has made a concerted effort to include all correctional institutions. This dataset includes non license restricted data from the following federal agencies: Bureau of Indian Affairs; Bureau of Reclamation; U.S. Park Police; Federal Bureau of Prisons; Bureau of Alcohol, Tobacco, Firearms and Explosives; U.S. Marshals Service; U.S. Fish and Wildlife Service; National Park Service; U.S. Immigration and Customs Enforcement; and U.S. Customs and Border Protection. This dataset is comprised completely of license free data. The Law Enforcement dataset and the Correctional Institutions dataset were merged into one working file. TGS processed as one file and then separated for delivery purposes. With the merge of the Law Enforcement and the Correctional Institutions datasets, NAICS Codes & Descriptions were assigned based on the facility's main function which was determined by the entity's name, facility type, web research, and state supplied data. In instances where the entity's primary function is both law enforcement and corrections, the NAICS Codes and Descriptions are assigned based on the dataset in which the record is located (i.e., a facility that serves as both a Sheriff's Office and as a jail is designated as [NAICSDESCR]="SHERIFFS' OFFICES (EXCEPT COURT FUNCTIONS ONLY)" in the Law Enforcement layer and as [NAICSDESCR]="JAILS (EXCEPT PRIVATE OPERATION OF)" in the Correctional Institutions layer). Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields that TGS populated. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on the values in this field, the oldest record dates from 12/27/2004 and the newest record dates from 09/08/2009
The Peak District National Park contains an amazing variety of landscapes including broad open moorlands, more intimate enclosed farmlands and wooded valleys. The landscapes have been shaped by variations in geology and landform and the long settlement and use of these landscapes by people. Todays landscapes have a rich diversity of natural and cultural heritage and this diversity is enjoyed by local communities and visitors. Landscape Character Assessment is a tool for identifying what makes one place different from another. It identifies what makes a place distinctive and does not assign value to particular landscapes. Landscape Character Assessment provides a framework for describing an area systematically, ensuring that judgments about future landscape change can be made based on knowledge of what is distinctive. This study has gathered information from published maps and documents, completed a full field survey of the National Park and held a series of consultation workshops to gather the views of local communities. Formal consultation was carried out on the draft report and amendments made to the maps and text documents. This report shows how the landscapes of the National Park and its surrounding area has been divided into a series of Regional Character Areas representing broad tracts of landscape which share common characteristics. Within each Regional Character Area a number of Landscape Character Types have been defined based upon the pattern of natural and cultural characteristics. This document is the first stage of an ongoing project. The coming year will see the development of a landscape strategy and action plan for the Peak District National Park. The landscape strategy will build on an analysis of condition and forces for change in the landscape and further consultation with stakeholders. The Landscape Character Assessment establishes a baseline audit of the current character of the landscape and provides a framework for the measurement of future landscape change. The assessment will also help to promote appreciation and understanding of the landscape of the National Park.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far.
Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages.
In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images.
The whole dataset consists of five categories:
This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.
Homeland Security Use Cases: Use cases describe how the data may be used and help to define and clarify requirements. 1) In the event of a threat against truck driving schools, this dataset could be used to locate truck driving schools in need of protection. 2) Identification of large groups of people that may need to be evacuated in the event of an emergency.This dataset is composed of any type of Post Secondary Education facility such as: colleges, universities, technical schools, or trade schools that provide training and certification in the field of professional truck driving. This dataset does not include Administration Only locations. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g. the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] attribute.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Balinese Script for Imaginary Spelling using Electroencephalogram (BISE dataset) is a collection of data related to pronunciation/spelling and imagination of Balinese script based on electroencephalogram (EEG) signals. This dataset consists of character spelling (CS) and character imagination (CI) datasets. Providing CS and CI datasets is very important to ensure that the EEG signal pattern for the spoken script matches the imagined script. So, this dataset would actually need to be provided. In addition, previous researchers have yet to collect a Balinese script imagination dataset. The participants used as samples in this study were 31 healthy people, 8 males and 23 females, from students of the Balinese language education study program at Universitas Pendidikan Ganesha. The participants' EEG signals were recorded using Contec KT 88 with 16 channels. This dataset consists of 7 types of data, namely: (1) Raw data from the 1st experiment, (2) Raw data from the 2nd experiment, (3) Data analysis of character spelling (CS) in 1st experiment, (4) Data analysis of character imagination (CI) in 1st experiment, (5) Data analysis of character spelling (CS) in 2nd experiment, (6) Data analysis of character imagination (CI) in 2nd experiment, and (7) Raw data from calm conditions. The first experimental data contains EEG signal data from participants experimenting with pronouncing and imagining 18 Balinese scripts sequentially and randomly. Furthermore, the raw data of the second experiment contains EEG signal data from participants who performed the spelling (CS) and imagining (CI) of 6 Balinese vowel scripts sequentially and randomly. Furthermore, from the raw data, a data analysis process was carried out for the raw data in the 1st and 2nd experiments. In the first experiment's raw data, two analysis data were produced for 18 Balinese scripts spelling and 18 Balinese scripts imagined. Furthermore, for the raw data in the 2nd experiment, two analysis data were produced for 6 Balinese vowel scripts that were spelling and 6 Balinese vowel scripts that were imagined. Finally, the raw data of the calm condition contains EEG signal data from participants in a quiet condition before starting the experiment.
A benchmark dataset is always required for any classification or recognition system. To the best of our knowledge, no benchmark dataset exists for handwritten character recognition of Manipuri Meetei-Mayek script in public domain so far. Manipuri, also referred to as Meeteilon or sometimes Meiteilon, is a Sino-Tibetan language and also one of the Eight Scheduled languages of Indian Constitution. It is the official language and lingua franca of the southeastern Himalayan state of Manipur, in northeastern India. This language is also used by a significant number of people as their communicating language over the north-east India, and some parts of Bangladesh and Myanmar. It is the most widely spoken language in Northeast India after Bengali and Assamese languages. In this work, we introduce a handwritten Manipuri Meetei-Mayek character dataset which consists of more than 5000 data samples which were collected from a diverse population group that belongs to different age groups (from 4 years to 60 years), genders, educational backgrounds, occupations, communities from three different districts of Manipur, India (Imphal East District, Thoubal District and Kangpokpi District) during March and April 2019. Each individual was asked to write down all the Manipuri characters on one A4-size paper. The recorded responses are scanned with the help of a scanner and then each character is manually segmented from the scanned images. This dataset consists of segmented scanned images of handwritten Manipuri Meetei-Mayek characters (Mapi Mayek, Lonsum Mayek, Cheitap Mayek, Cheising Mayek, Khutam Mayek) of size 128X128 pixels in .JPG format as well as in .MAT format.
Cityscapes data (dataset home page) contains labeled videos taken from vehicles driven in Germany. This version is a processed subsample created as part of the Pix2Pix paper. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks.
This dataset has 2975 training images files and 500 validation image files. Each image file is 256x512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.
This dataset is the same as what is available here from the Berkeley AI Research group.
The Cityscapes data available from cityscapes-dataset.com has the following license:
This dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:
Can you identify you identify what objects are where in these images from a vehicle.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A comprehensive dataset of Pixar movies, including details on their release dates, directors, writers, cast, box office performance, and ratings. This dataset is gathered from official sources, including Pixar, Rotten Tomatoes, and IMDb, to provide accurate and relevant information for anyone interested in analyzing Pixar's films.
Pixar Animation Studios, known for its quality animation and storytelling, has produced a series of animated movies that have captivated audiences around the world. This dataset captures key details from Pixar’s filmography, including box office earnings, critical ratings, and character information, making it a valuable resource for those analyzing trends in animation, its movie plot lines and beloved characters, and movie ratings. For more information, visit Pixar, Rotten Tomatoes, and IMDb.
Column | Description |
---|---|
movie | The title of the Pixar movie |
date_released | The exact release date of the movie (e.g., YYYY-MM-DD) |
year_released | The year the movie was released (e.g., YYYY) |
length_min | Duration of the movie in minutes |
plot_summary | A brief summary of the movie's plot |
director | The name(s) of the director(s) of the movie |
writer | The name(s) of the writer(s) of the movie |
main_characters | List of main characters featured in the movie |
type_of_characters | Description of the types of characters (e.g., human, toys, animals, vehicles) |
main_voice_actors | List of actors who voiced the main characters |
opening_weekend_box_office_sales | Gross box office earnings on the opening weekend in USD |
total_worldwide_gross_sales | Total gross box office earnings worldwide in USD |
rotten_tomatoes_rating | Rotten Tomatoes rating, typically out of 100 |
imdb_rating | IMDb rating, typically out of 10 |
movie_genre | Primary genre(s) of the movie (e.g., Animation, Adventure, Comedy) |
movie_rating | The movie’s rating (e.g., G, PG, PG-13) |
This data was compiled, enriched, reviewed, and curated using Research by Rummage Labs. Research by Rummage Labs enables you to curate verified datasets to power your enterprise. Read more here: https://rummagelabs.com/.
Many people including me confused due to many chracters when I started watching this famous T.V series, hence I decided to classify and recognize them.
The dataset contains the images of each character to be recognized and identified if used to predict.
1. John Snow
2. Daenerys Targaryen
3. Arya Stark
4. Sansa Stark
5. Jaime Lannister
6. Tyrion Lannister
If you do like the content please give an upvote.
Added +32,000 more locations. For information on data calculations please refer to the methodology pdf document. Information on how to calculate the data your self is also provided as well as how to buy data for $1.29 dollars.
The database contains 32,000 records on US Household Income Statistics & Geo Locations. The field description of the database is documented in the attached pdf file. To access, all 348,893 records on a scale roughly equivalent to a neighborhood (census tract) see link below and make sure to up vote. Up vote right now, please. Enjoy!
The dataset originally developed for real estate and business investment research. Income is a vital element when determining both quality and socioeconomic features of a given geographic location. The following data was derived from over +36,000 files and covers 348,893 location records.
Only proper citing is required please see the documentation for details. Have Fun!!!
Golden Oak Research Group, LLC. “U.S. Income Database Kaggle”. Publication: 5, August 2017. Accessed, day, month year.
2011-2015 ACS 5-Year Documentation was provided by the U.S. Census Reports. Retrieved August 2, 2017, from https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_by_state/
Please tell us so we may provide you the most accurate data possible. You may reach us at: research_development@goldenoakresearch.com
for any questions you can reach me on at 585-626-2965
please note: it is my personal number and email is preferred
Check our data's accuracy: Census Fact Checker
Don't settle. Go big and win big. Optimize your potential. Overcome limitation and outperform expectation. Access all household income records on a scale roughly equivalent to a neighborhood, see link below:
Website: Golden Oak Research Kaggle Deals all databases $1.29 Limited time only
A small startup with big dreams, giving the every day, up and coming data scientist professional grade data at affordable prices It's what we do.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Tamil Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Tamil language.
Dataset Contain & Diversity:Containing a total of 2000 images, this Tamil OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Tamil text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.
All these images were captured by native Tamil people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Tamil text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native Tamil crowd community.
If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Tamil language. Your journey to enhanced language understanding and processing starts here.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Arabic Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Arabic language.
Dataset Contain & Diversity:Containing a total of 2000 images, this Arabic OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Arabic text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.
All these images were captured by native Arabic people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Arabic text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native Arabic crowd community.
If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Arabic language. Your journey to enhanced language understanding and processing starts here.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Korean Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Korean language.
Dataset Contain & Diversity:Containing a total of 2000 images, this Korean OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Korean text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.
All these images were captured by native Korean people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Korean text recognition models.
Update & Custom Collection:We're committed to expanding this dataset by continuously adding more images with the assistance of our native Korean crowd community.
If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.
License:This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Korean language. Your journey to enhanced language understanding and processing starts here.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5845582%2Fad18ab19f798402fd7efebfe1063c5d5%2F800px-Asl_alphabet_gallaudet.svg.png?generation=1601115599739683&alt=media" alt="">
Images were taken with the chart above as a reference. There are many different dialects of ASL and that should be noted before proceeding.
This dataset was created as part of the SigNN project, to create a free and open-source real-time translation app for the American Sign Language (ASL) alphabet. The link to the GitHub can be found here: https://github.com/AriAlavi/SigNN
SigNN was built on top of Mediapipe: a free and open-source library that allows for easy implementation of real-time neural networks in many different environments, such as Android and Ubuntu. The link to the GitHub can be found here: https://github.com/google/mediapipe
We were able to reach a theoretical accuracy of 95% through this dataset. In practice, it does not seem to be correct 95% of the time, but we believe it is sufficient. The completed app is downloadable at: https://play.google.com/store/apps/details?id=com.signn.mediapipe.apps.handtrackinggpu
Inside each folder, you will find hundreds of pictures which were captured by our volunteers. Each picture is slightly different than the others as the angle and position are slightly varied. For our use case, we expect the user to face the camera head-on when translating, so angle variations of the sign are less than 45 degrees. Each image will also display no more than one hand.
We would like to thank all members of the SigNN team for collecting the data and writing the infrastructure required to capture, process, and train the neural network. It was not an easy task.
For this dataset, we would especially like to thank our data collectors who spent hours of their free time taking videos of their hands:
Kenny Yip Albert Yuk Zhuang Daniel Lohn Rafael Trinidad Gokul Deep
We have learned through our year-long journey creating SigNN that the primary reason for the lack of other ASL alphabet translators was the lack of publically available data. That is why we unanimously voted to make the data publically available for others to create similar software. While we were the first to the app store, we hope to inspire better apps and invite people to learn from our mistakes and take notes where we made breakthroughs. If you are interested more in our story, you can learn more at: signn.live
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the French Sticky Notes Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.
Dataset Contain & Diversity:Containing more than 2000 images, this French OCR dataset offers a wide distribution of different types of sticky note images. Within this dataset, you'll discover a variety of handwritten text, including quotes, sentences, and individual words on sticky notes. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible French text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these sticky notes were written and images were captured by native French people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of French text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native French crowd community.
If you require a customized OCR dataset containing sticky note images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this sticky notes image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the French language. Your journey to improved language understanding and processing begins here.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset I scraped from the 'AnimeCharacterDatabase' website, a place where anime fans contribute data about characters in a large variety of shows. It contains data about the physical and temperamental qualities of many popular anime characters. The traits are stored in the 'tags' column, and can either be read in as a python list, or parsed using regular expressions to get individual traits. I should also mention the index, which is the popularity ranking of each character in the database, and can be used to rank popular traits or for other such uses.