100+ datasets found

F
Chinese Conversation Chat Dataset for Real Estate Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/chinese-realestate-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Chinese participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Property Inquiry
•Rental Property Search & Availability
•Renovation Inquiries
•Property Features & Amenities Inquiry
•Investment Property Analysis & Advice
•Property History & Ownership Details, and many more
•Outbound Chats:
•New Property Listing Update
•Post Purchase Follow-ups
•Investment Opportunities & Property Recommendations
•Property Value Updates
•Customer Satisfaction Surveys, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Chinese Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Chinese speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Chinese personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Chinese-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Chinese forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Chinese Real Estate conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Chinese Real Estate interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<span
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
F
Arabic Conversation Chat Dataset for Real Estate Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/arabic-realestate-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Arabic participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Property Inquiry
•Rental Property Search & Availability
•Renovation Inquiries
•Property Features & Amenities Inquiry
•Investment Property Analysis & Advice
•Property History & Ownership Details, and many more
•Outbound Chats:
•New Property Listing Update
•Post Purchase Follow-ups
•Investment Opportunities & Property Recommendations
•Property Value Updates
•Customer Satisfaction Surveys, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Arabic Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Arabic speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Arabic personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Arabic-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Arabic forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Arabic Real Estate conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Arabic Real Estate interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<span
F
English Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English Healthcare Chat Dataset is a rich collection of over 12,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in English-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native English speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of English healthcare communication and includes:
•
Authentic Naming Patterns: English personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional English formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with English-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
F
Telugu Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Telugu Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/telugu-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Telugu Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content:
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Telugu language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Telugu people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity:
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats:
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Telugu Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy:
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Telugu version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Telugu Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
F
Spanish Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Spanish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Spanish-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Spanish speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Spanish healthcare communication and includes:
•
Authentic Naming Patterns: Spanish personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Spanish formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Spanish-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
f
Hierarchical Representations of Freebase Topics
figshare.com
search.dataone.org
application/x-rar
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Elbattah (2023). Hierarchical Representations of Freebase Topics [Dataset]. http://doi.org/10.6084/m9.figshare.6530825.v3
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6530825.v3
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Mahmoud Elbattah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains more than 21M hierarchical relationships about ≈10M topics extracted from Freebase knowledgebase. The topics span the various categories of Freebase including Science & Technology, Arts & Entertainment, Sports, Society, Products & Services, Transportation, Time & Space, Special Interests, and Commons. The relationships describe the hierarchies of topics in terms of Types, Domains, and Categories. For example, ‘Albert Einstein’ can be found as a topic that is a sub-class of ‘Person’, belonging to the ‘People’ domain and ‘Society’ category. While another entity named as ‘Albert Einstein’ can also be found as a sub-class of ‘Book’, belonging to the ‘Books’ domain and ‘Arts & Entertainment’ category. The dataset is published in JSON and CSV formats, sample files are provided to help explore how the dataset is structured. The dataset is believed to be useful for studying the inter-related connections among topics in different domains of knowledge. The first author may be contacted at (mahmoud.elbattah@nuigalway.ie) for more information. The following paper may kindly be cited in case of using the dataset. Mahmoud Elbattah, Mohamed Roushdy, Mostafa Aref, Abdel-Badeeh M. Salem. “Large-Scale Entity Clustering Using Graph-Based Structural Similarity within Knowledge Graphs”, Big Data Analytics: Tools, Technology for Effective Planning, CRC Press. https://www.researchgate.net/publication/321716589_Large-Scale_Entity_Clustering_Based_on_Structural_Similarity_within_Knowledge_Graphs
List of Companies in India
kaggle.com
Updated Jan 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alisha Gera (2021). List of Companies in India [Dataset]. https://www.kaggle.com/alishagera/list-of-companies-in-india/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alisha Gera
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
Company Data

The dataset is an amalgamation of: -data that was scraped, which comprised a comprehensive list of companies available on job site - AmbitionBox -Companies dataset

Content

Each entry contains the following information: Name - Name of a company Rating - Average Rating of a company Reviews - No of reviews Domain - Company type Year Old- How old is a company ? Location - office locations as per AmbitionBox website

Inspiration

How many companies in India are listed on job sites ? What are the Top rated companies in India ? What are the companies with highest workforce ? What are some newly developed companies ? What are the oldest companies in India ? what is the most reviewed company ?
E-commerce - Users of a French C2C fashion store
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Jeffrey Mvutu Mabilama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Foreword

This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

My Telegram bot will answer your queries and allow you to contact me.

Context

There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

This dataset is part of a preview of a much larger dataset. Please contact me for more.

Content

The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Questions you might want to answer using this dataset:

Are e-commerce users interested in social network feature ?

Are my users active enough (compared to those of this dataset) ?

How likely are people from other countries to sign up in a C2C website ?

How many users are likely to drop off after years of using my service ?

Example works:

Report(s) made using SQL queries can be found on the data.world page of the dataset.

Notebooks may be found on the Kaggle page of the dataset.

License

CC-BY-NC-SA 4.0

For other licensing options, contact me.
🦈 Shark Tank India dataset 🇮🇳
kaggle.com
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya Thirumani (2025). 🦈 Shark Tank India dataset 🇮🇳 [Dataset]. https://www.kaggle.com/datasets/thirumani/shark-tank-india
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Satya Thirumani
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Shark Tank India Data set.

Shark Tank India - Season 1 to season 4 information, with 80 fields/columns and 630+ records.

All seasons/episodes of 🦈 SHARKTANK INDIA 🇮🇳 were broadcasted on SonyLiv OTT/Sony TV.

Here is the data dictionary for (Indian) Shark Tank season's dataset.

Season Number - Season number

Startup Name - Company name or product name

Episode Number - Episode number within the season

Pitch Number - Overall pitch number

Season Start - Season first aired date

Season End - Season last aired date

Original Air Date - Episode original/first aired date, on OTT/TV

Episode Title - Episode title in SonyLiv

Anchor - Name of the episode presenter/host

Industry - Industry name or type

Business Description - Business Description

Company Website - Company Website URL

Started in - Year in which startup was started/incorporated

Number of Presenters - Number of presenters

Male Presenters - Number of male presenters

Female Presenters - Number of female presenters

Transgender Presenters - Number of transgender/LGBTQ presenters

Couple Presenters - Are presenters wife/husband ? 1-yes, 0-no

Pitchers Average Age - All pitchers average age, <30 young, 30-50 middle, >50 old

Pitchers City - Presenter's town/city or place where company head office exists

Pitchers State - Indian state pitcher hails from or state where company head office exists

Yearly Revenue - Yearly revenue, in lakhs INR, -1 means negative revenue, 0 means pre-revenue

Monthly Sales - Total monthly sales, in lakhs

Gross Margin - Gross margin/profit of company, in percentages

Net Margin - Net margin/profit of company, in percentages

EBITDA - Earnings Before Interest, Taxes, Depreciation, and Amortization

Cash Burn - In loss in current year; burning/paying money from their pocket (yes/no)

SKUs - Stock Keeping Units or number of varieties, at the time of pitch

Has Patents - Pitcher has Patents/Intellectual property (filed/granted), at the time of pitch

Bootstrapped - Startup is bootstrapped or not (yes/no)

Part of Match off - Competition between two similar brands, pitched at same time

Original Ask Amount - Original Ask Amount, in lakhs INR

Original Offered Equity - Original Offered Equity, in percentages

Valuation Requested - Valuation Requested, in lakhs INR

Received Offer - Received offer or not, 1-received, 0-not received

Accepted Offer - Accepted offer or not, 1-accepted, 0-rejected

Total Deal Amount - Total Deal Amount, in lakhs INR

Total Deal Equity - Total Deal Equity, in percentages

Total Deal Debt - Total Deal debt/loan amount, in lakhs INR

Debt Interest - Debt interest rate, in percentages

Deal Valuation - Deal Valuation, in lakhs INR

Number of sharks in deal - Number of sharks involved in deal

Deal has conditions - Deal has conditions or not? (yes or no)

Royalty Percentage - Royalty percentage, if it's royalty deal

Royalty Recouped Amount - Royalty recouped amount, if it's royalty deal, in lakhs

Advisory Shares Equity - Deal with Advisory shares or equity, in percentages

Namita Investment Amount - Namita Investment Amount, in lakhs INR

Namita Investment Equity - Namita Investment Equity, in percentages

Namita Debt Amount - Namita Debt Amount, in lakhs INR

Vineeta Investment Amount - Vineeta Investment Amount, in lakhs INR

Vineeta Investment Equity - Vineeta Investment Equity, in percentages

Vineeta Debt Amount - Vineeta Debt Amount, in lakhs INR

Anupam Investment Amount - Anupam Investment Amount, in lakhs INR

Anupam Investment Equity - Anupam Investment Equity, in percentages

Anupam Debt Amount - Anupam Debt Amount, in lakhs INR

Aman Investment Amount - Aman Investment Amount, in lakhs INR

Aman Investment Equity - Aman Investment Equity, in percentages

Aman Debt Amount - Aman Debt Amount, in lakhs INR

Peyush Investment Amount - Peyush Investment Amount, in lakhs INR

Peyush Investment Equity - Peyush Investment Equity, in percentages

Peyush Debt Amount - Peyush Debt Amount, in lakhs INR

Ritesh Investment Amount - Ritesh Investment Amount, in lakhs INR

Ritesh Investment Equity - Ritesh Investment Equity, in percentages

Ritesh Debt Amount - Ritesh Debt Amount, in lakhs INR

Amit Investment Amount - Amit Investment Amount, in lakhs INR

Amit Investment Equity - Amit Investment Equity, in percentages

Amit Debt Amount - Amit Debt Amount, in lakhs INR

Guest Investment Amount - Guest Investment Amount, in lakhs INR

Guest Investment Equity - Guest Investment Equity, in percentages

Guest Debt Amount - Guest Debt Amount, in lakhs INR

Invested Guest Name - Name of the guest(s) who invested in deal

All Guest Names - Name of all guests, who are present in episode

Namita Present - Whether Namita present in episode or not

Vineeta Present - Whether Vineeta present in episode or not

Anupam ...
F
Swedish Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Swedish Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/swedish-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Swedish Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Swedish-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Swedish speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Swedish healthcare communication and includes:
•
Authentic Naming Patterns: Swedish personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Swedish formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Swedish-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
Applications
<p
Synthetic Financial Datasets For Fraud Detection
kaggle.com
zip
Updated Apr 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edgar Lopez-Rojas (2017). Synthetic Financial Datasets For Fraud Detection [Dataset]. https://www.kaggle.com/ealaxi/paysim1
Explore at:
zip(186385561 bytes)Available download formats
Dataset updated
Apr 3, 2017
Authors
Edgar Lopez-Rojas
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Content

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

Headers

This is a sample of 1 row with headers explanation:

1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

Past Research

There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).

We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

Acknowledgements

This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

Please refer to this dataset using the following citations:

PaySim first paper of the simulator:

E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016
F
Bahasa Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bahasa-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Bahasa Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Bahasa language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Bahasa. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Bahasa people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Bahasa Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Bahasa are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Bahasa Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
top 200 cryptocurriences
kaggle.com
Updated Aug 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BURLAGADDA SHYAM (2021). top 200 cryptocurriences [Dataset]. https://www.kaggle.com/burlagaddashyam/top-200-cryptocurriences/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BURLAGADDA SHYAM
Description
What is Blockchain? Blockchain seems complicated, and it definitely can be, but its core concept is really quite simple. A blockchain is a type of database. To be able to understand blockchain, it helps to first understand what a database actually is.

A database is a collection of information that is stored electronically on a computer system. Information, or data, in databases is typically structured in table format to allow for easier searching and filtering for specific information. What is the difference between someone using a spreadsheet to store information rather than a database?

Spreadsheets are designed for one person, or a small group of people, to store and access limited amounts of information. In contrast, a database is designed to house significantly larger amounts of information that can be accessed, filtered, and manipulated quickly and easily by any number of users at once.

Large databases achieve this by housing data on servers that are made of powerful computers. These servers can sometimes be built using hundreds or thousands of computers in order to have the computational power and storage capacity necessary for many users to access the database simultaneously. While a spreadsheet or database may be accessible to any number of people, it is often owned by a business and managed by an appointed individual that has complete control over how it works and the data within it.

conclusion: the theme is to do the perfect EDA of those 200 cryptos and explain them finey wrt features.

aoj_legal_kg

kaggle.com

Updated Apr 30, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdellah Hamouda (2024). aoj_legal_kg [Dataset]. https://www.kaggle.com/datasets/abdellahhamouda/aoj-legal-kg/data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 30, 2024

Dataset provided by

Kaggle

Authors

Abdellah Hamouda

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Context:

The legal domain is complex, with a vast body of knowledge governing societal and institutional interactions, encompassing specific terminology and covering legal concepts, procedures, and regulations. Information is typically found in legal documents like the official journal, which contain links to related documents. Access to this information is currently managed by the General Secretariat of the Algerian Government through a website, but the search interface has limitations, leading to a time-consuming process, particularly with imprecise keywords. This challenge highlights the need for innovative solutions to enhance discoverability and understanding.

Motivation:

Knowledge Graphs (KGs) offer a powerful approach to structuring and representing complex domain knowledge. Unlike traditional text documents, KGs organize information as entities and relationships, facilitating efficient information retrieval and reasoning. KGs also promote consistency and reduce redundancy by providing a single source of truth for domain knowledge. Moreover, KGs act as a grounding force for Large Language Models (LLMs), helping address the hallucination problem often encountered with ambiguous or incomplete input.

KG Construction Method:

The construction involves several steps: data collection and preprocessing, entities and relationships extraction, graph storing, and graph entity disambiguation.

- data collection and preprocessing involve extracting text from documents and crawling the website database.
- the collected data is parsed to extract entities and relationships using rule-based approaches and Named Entity Recognition (NER) models. This step helps identify the most useful concepts.
- the extracted entities and relationships are then stored in a Neo4j graph database.
-the process of graph entity disambiguation is applied to resolve any ambiguity, ensuring the accuracy and consistency of the graph.

Content of the KG:

Nodes:

The table bellow details the specific entities identified within the AOJ and included in the legal KG. Each entity label has set of properties and a corresponding description that clarifies its role within the legal domain.

Entity Label	Properties	Description
OFFICIAL_JOURNAL	number, year, date	the published official journal issues
LEGAL_TEXT	type, number, date, subject	legal texts published in the official journal
MINISTRY	name	the ministry of interest that creates and publishes legal texts
SECTOR	name	the sector where the ministry belongs
AUTHORITY	name	the authority that signed the text
ARTICLE	number, content	elements of legal texts
ORGANIZATION	name	organizations mentioned in legal texts
PERSON	name	person names mentioned in legal texts
LOCATION	name	geographic locations (places, countries, etc.) mentioned in legal texts

Edges:

Edge	Description
Published_in	Links LEGAL_TEXT entities to their corresponding OFFICIAL_JOURNAL entities, with the page number.
Mentions	Connects a LEGAL_TEXT entity to organization, person, or location entities mentioned in its title.
Concerns	Establishes a link between a LEGAL_TEXT entity and the relevant MINISTRY entity.
Related_to	Connects a LEGAL_TEXT entity to another, with properties like "abrogated by", "modified by", etc.

...

F
German Conversation Chat Dataset for Real Estate Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). German Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/german-realestate-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native German participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Property Inquiry
•Rental Property Search & Availability
•Renovation Inquiries
•Property Features & Amenities Inquiry
•Investment Property Analysis & Advice
•Property History & Ownership Details, and many more
•Outbound Chats:
•New Property Listing Update
•Post Purchase Follow-ups
•Investment Opportunities & Property Recommendations
•Property Value Updates
•Customer Satisfaction Surveys, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in German Real Estate interactions. This diversity ensures the dataset accurately represents the language used by German speakers in Real Estate contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of German personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different German-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in German forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in German Real Estate conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to German Real Estate interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<span
F
English Conversation Chat Dataset for Delivery & Logistics Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Conversation Chat Dataset for Delivery & Logistics Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Delivery & Logistics related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native English participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Delivery & Logistics topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Delivery & Logistics use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Order Tracking
•Delivery Complaint
•Undeliverable Address
•Delivery Method Selection
•Return Process Enquiry
•Order Modification, and many more
•Outbound Chats:
•Delivery Confirmation
•Delivery Subscription
•Incorrect Address
•Missed Delivery Attempt
•Delivery Feedback
•Out-of-Stock Notification
•Delivery Satisfaction Survey, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in English Delivery & Logistics interactions. This diversity ensures the dataset accurately represents the language used by English speakers in Delivery & Logistics contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of English personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different English-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in English forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in English Delivery & Logistics conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to English Delivery & Logistics interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Delivery & Logistics customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
<span
F
English Conversation Chat Dataset for Retail & E-commerce Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Conversation Chat Dataset for Retail & E-commerce Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-retail-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 12,000 chat conversations, each focusing on specific Retail & E-Commerce related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 200+ native English participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Retail & E-Commerce topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Retail & E-Commerce use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Product Inquiry
•Return/Exchange Request
•Order Cancellation
•Refund Request
•Membership/Subscriptions Enquiry
•Order Cancellations, and many more
•Outbound Chats:
•Order Confirmation
•Cross-selling and Upselling
•Account Updates
•Loyalty Program Offers
•Special Offers and Promotions
•Customer Verification, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in English Retail & E-Commerce interactions. This diversity ensures the dataset accurately represents the language used by English speakers in Retail & E-Commerce contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of English personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different English-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in English forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in English Retail & E-Commerce conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to English Retail & E-Commerce interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Retail & E-Commerce customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
•Solution Delivery
•Closing and Follow-ups
<div
F
Arabic Agent-Customer Chat Dataset for BFSI Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Agent-Customer Chat Dataset for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/arabic-bfsi-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Arabic BFSI Chat Dataset is a comprehensive collection of over 10,000 text-based chat conversations between customers and call center agents. Focused on Banking, Financial Services, and Insurance (BFSI) interactions, this dataset captures real-world service dialogues, complete with domain-specific language, customer intents, and varied conversational flows.
Participant & Chat Overview
•
Participants: 150 native Arabic speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

Topic Diversity
This dataset reflects the wide range of customer interactions typically encountered in the BFSI sector:
•Inbound Chats (Customer-Initiated)
•Account opening and management
•Transaction-related queries
•Loan inquiries and applications
•Credit card issues
•Insurance questions and requests
•Outbound Chats (Agent-Initiated)
•Product and service promotions
•Cross-selling and upselling efforts
•Loan follow-ups and reminders
•Customer retention and loyalty program outreach
•Insurance policy renewals and verifications
This topic spread ensures applicability across customer service automation, intent classification, and domain-specific model training.
Language Nuance & Cultural Relevance
Conversations capture natural Arabic as spoken in BFSI contexts, incorporating:
•
Names & Branding: Realistic Arabic personal and business names

•
Local Contextual Elements: Emails, phone numbers, addresses, time/date references, and currency in Arabic format

•
Colloquial Speech & Slang: Regional idioms, informal expressions, and domain-specific jargon

•
Numerical Expressions: Use of Arabic numerals, amounts, dates, and measurements as per local conventions

This linguistic richness enables the training of models that can understand real-world customer queries in culturally relevant contexts.
Conversational Structure & Flow
The dataset reflects structured dialogue flow and interaction dynamics seen in BFSI customer service environments:
•Types of Conversations:
•Simple inquiries
•Complex problem-solving discussions
•Transactional updates
•Advisory sessions
•Follow-ups and routine status checks
•Typical Chat Components:
•Greetings and opening
•Customer authentication
•Information gathering
<div style="margin-top:10px; margin-bottom: 10px; margin-left:
F
Danish Agent-Customer Chat Dataset for Delivery & Logistics
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Danish Agent-Customer Chat Dataset for Delivery & Logistics [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/danish-delivery-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Danish Delivery & Logistics Chat Dataset is a comprehensive collection of over 10,000 text-based conversations between customers and call center agents. Focused on real-world delivery and logistics interactions, this dataset captures the language, tone, and service patterns essential for developing robust Danish-language conversational AI, chatbots, and NLP systems across the delivery ecosystem.
Participant & Chat Overview
•
Participants: 150+ native Danish speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns between customer and agent

•
Chat Types: Inbound (customer-initiated) and outbound (agent-initiated)

•
Sentiment Coverage: Includes positive, neutral, and negative interaction outcomes

Topic Diversity
The dataset spans a wide range of delivery and logistics scenarios, ensuring strong coverage across customer service and operational workflows.
•Inbound Chats (Customer-Initiated)
•Order tracking and delivery status inquiries
•Complaints about late or missing deliveries
•Undeliverable or incorrect address resolution
•Return process and pickup scheduling
•Order modifications and change requests
•Enquiries about delivery method options
•Outbound Chats (Agent-Initiated)
•Delivery confirmations and dispatch updates
•Subscription renewal or delivery reminders
•Notification of delivery issues or missed attempts
•Out-of-stock or product unavailability alerts
•Satisfaction surveys and service feedback collection
•Address verification for upcoming deliveries
This topical spread ensures wide applicability in both customer support automation and logistics optimization use cases.
Language Diversity & Realism
The conversations reflect the authentic language and interaction style of Danish-speaking customers and delivery agents, incorporating:
•
Naming Patterns: Personal names, business names, and logistics company references

•
Localized Details: Danish-format emails, phone numbers, regional addresses, and delivery zones

•
Temporal and Numeric Expressions: Dates, delivery windows, prices, and tracking IDs in Danish formats

•
Slang and Informal Speech: Everyday expressions and delivery-specific idioms used across Danish dialects

This linguistic realism enables the development of context-aware and naturally responsive AI systems.
Conversational Structure & Flow
The dataset captures a diverse range of interaction types and delivery workflows:
•Dialogue Types:
•Quick status checks and confirmations
•Multi-turn issue resolution
•Process walkthroughs and guidance
•Feedback and escalation handling
•Common Flow Elements:
•Greetings and caller verification
•Request or complaint initiation
<div style="margin-left: 60px; font-weight: 300;

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). Chinese Conversation Chat Dataset for Real Estate Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/chinese-realestate-domain-conversation-text-dataset

Chinese Conversation Chat Dataset for Real Estate Domain

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by

FutureBeeAI

Description

Introduction

The dataset comprises over 10,000 chat conversations, each focusing on specific Real Estate related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

•

Participants Details: 150+ native Chinese participants from the FutureBeeAI community.

•

Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity

The chat dataset covers a wide range of conversations on Real Estate topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Real Estate use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

•Inbound Chats:

•Property Inquiry

•Rental Property Search & Availability

•Renovation Inquiries

•Property Features & Amenities Inquiry

•Investment Property Analysis & Advice

•Property History & Ownership Details, and many more

•Outbound Chats:

•New Property Listing Update

•Post Purchase Follow-ups

•Investment Opportunities & Property Recommendations

•Property Value Updates

•Customer Satisfaction Surveys, and many more

Language Variety & Nuances

The conversations in this dataset capture the diverse language styles and expressions prevalent in Chinese Real Estate interactions. This diversity ensures the dataset accurately represents the language used by Chinese speakers in Real Estate contexts.

The dataset encompasses a wide array of language elements, including:

•

Naming Conventions: Chats include a variety of Chinese personal and business names.

•

Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Chinese-speaking regions.

•

Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Chinese forms, adhering to local conventions.

•

Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Chinese Real Estate conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Chinese Real Estate interactions.

Conversational Flow and Interaction Types

The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Real Estate customer-agent interactions.

•Simple Inquiries

•Detailed Discussions

•Transactional Interactions

•Problem-Solving Dialogues

•Advisory Sessions

•Routine Checks and Follow-Ups

Each of these conversations contains various aspects of conversation flow like:

•Greetings

•Authentication

•Information gathering

•Resolution identification

•Solution Delivery

•Closing and Follow-ups

<span

Clear search

Close search

Google apps

Main menu

Chinese Conversation Chat Dataset for Real Estate Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

Arabic Conversation Chat Dataset for Real Estate Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

English Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Telugu Open Ended Classification Prompt & Response Dataset

What’s Included

Spanish Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Hierarchical Representations of Freebase Topics

List of Companies in India

Company Data

Content

Inspiration

E-commerce - Users of a French C2C fashion store

Foreword

Context

Content

Acknowledgements

Inspiration

License

🦈 Shark Tank India dataset 🇮🇳

Shark Tank India Data set.

Swedish Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Applications

Synthetic Financial Datasets For Fraud Detection

Context

Content

Headers

Past Research

Acknowledgements

Bahasa Open Ended Question Answer Text Dataset

What’s Included

top 200 cryptocurriences

aoj_legal_kg

Context:

Motivation:

KG Construction Method:

Content of the KG:

Nodes:

Edges:

German Conversation Chat Dataset for Real Estate Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

English Conversation Chat Dataset for Delivery & Logistics Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

English Conversation Chat Dataset for Retail & E-commerce Domain

Introduction

Topic Diversity