14 datasets found
  1. Bitext Gen AI Chatbot Customer Support Dataset

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext Gen AI Chatbot Customer Support Dataset [Dataset]. https://www.kaggle.com/datasets/bitext/bitext-gen-ai-chatbot-customer-support-dataset
    Explore at:
    zip(3007665 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Bitext
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

    Overview

    This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.

    The dataset has the following specs:

    • Use Case: Intent Detection
    • Vertical: Customer Service
    • 27 intents assigned to 10 categories
    • 26872 question/answer pairs, around 1000 per intent
    • 30 entity/slot types
    • 12 different types of language generation tags

    The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

    • Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management

    For a full list of verticals and its intents see https://www.bitext.com/chatbot-verticals/.

    The question/answer pairs have been generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds from these texts, and NLG technology to expand the seed texts. All steps in the process are curated by computational linguists.

    Dataset Token Count

    The dataset contains an extensive amount of text data across its 'instruction' and 'response' columns. After processing and tokenizing the dataset, we've identified a total of 3.57 million tokens. This rich set of tokens is essential for training advanced LLMs for AI Conversational, AI Generative, and Question and Answering (Q&A) models.

    Fields of the Dataset

    Each entry in the dataset contains the following fields:

    • flags: tags (explained below in the Language Generation Tags section)
    • instruction: a user request from the Customer Service domain
    • category: the high-level semantic category for the intent
    • intent: the intent corresponding to the user instruction
    • response: an example expected response from the virtual assistant

    Categories and Intents

    The categories and intents covered by the dataset are:

    • ACCOUNT: create_account, delete_account, edit_account, recover_password, registration_problems, switch_account
    • CANCELLATION_FEE: check_cancellation_fee
    • CONTACT: contact_customer_service, contact_human_agent
    • DELIVERY: delivery_options, delivery_period
    • FEEDBACK: complaint, review
    • INVOICE: check_invoice, get_invoice
    • ORDER: cancel_order, change_order, place_order, track_order
    • PAYMENT: check_payment_methods, payment_issue
    • REFUND: check_refund_policy, get_refund, track_refund
    • SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address
    • SUBSCRIPTION: newsletter_subscription

    Entities

    The entities covered by the dataset are:

    • {{Order Number}}, typically present in:
    • Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund
    • {{Invoice Number}}, typically present in:
      • Intents: check_invoice, get_invoice
    • {{Online Order Interaction}}, typically present in:
      • Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund
    • {{Online Payment Interaction}}, typically present in:
      • Intents: cancel_order, check_payment_methods
    • {{Online Navigation Step}}, typically present in:
      • Intents: complaint, delivery_options
    • {{Online Customer Support Channel}}, typically present in:
      • Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account
    • {{Profile}}, typically present in:
      • Intent: switch_account
    • {{Profile Type}}, typically present in:
      • Intent: switch_account
    • {{Settings}}, typically present in:
      • Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund
    • {{Online Company Portal Info}}, typically present in:
      • Intents: cancel_order, edit_account
    • {{Date}}, typically present in:
      • Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund
    • {{Date Range}}, typically present in:
      • Intents: check_cancellation_fee, check_invoice, get_invoice
    • {{Shipping Cut-off Time}}, typically present in:
      • Intent: delivery_options
    • {{Delivery City}}, typically present in:
      • Inten...
  2. d

    Data from: Clearing your Desk! Software and Data Services for Collaborative...

    • dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Tarboton (2021). Clearing your Desk! Software and Data Services for Collaborative Web Based GIS Analysis [Dataset]. https://dataone.org/datasets/sha256%3A348683249e397738f56d481edaa7a200abf4f7c1043a95c4efd14ca4b2273991
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    David Tarboton
    Description

    Can your desktop computer crunch the large GIS datasets that are becoming increasingly common across the geosciences? Do you have access to or the know-how to take advantage of advanced high performance computing (HPC) capability? Web based cyberinfrastructure takes work off your desk or laptop computer and onto infrastructure or "cloud" based data and processing servers. This talk will describe the HydroShare collaborative environment and web based services being developed to support the sharing and processing of hydrologic data and models. HydroShare supports the upload, storage, and sharing of a broad class of hydrologic data including time series, geographic features and raster datasets, multidimensional space-time data, and other structured collections of data. Web service tools and a Python client library provide researchers with access to HPC resources without requiring them to become HPC experts. This reduces the time and effort spent in finding and organizing the data required to prepare the inputs for hydrologic models and facilitates the management of online data and execution of models on HPC systems. This presentation will illustrate the use of web based data and computation services from both the browser and desktop client software. These web-based services implement the Terrain Analysis Using Digital Elevation Model (TauDEM) tools for watershed delineation, generation of hydrology-based terrain information, and preparation of hydrologic model inputs. They allow users to develop scripts on their desktop computer that call analytical functions that are executed completely in the cloud, on HPC resources using input datasets stored in the cloud, without installing specialized software, learning how to use HPC, or transferring large datasets back to the user's desktop. These cases serve as examples for how this approach can be extended to other models to enhance the use of web and data services in the geosciences.

    Slides for AGU 2015 presentation IN51C-03, December 18, 2015

  3. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  4. m

    Ultimate Arabic News Dataset

    • data.mendeley.com
    Updated May 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.1
    Explore at:
    Dataset updated
    May 9, 2022
    Authors
    Ahmed Hashim Al-Dulaimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

    Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

    • The data we collect consists of two Primary files:

    UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

    UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

    • We add two samples of data collected by web scraping techniques:

    Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website.

    Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

    • The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
  5. S

    Customer Service Requests (CSR) Survey Feedback Responses

    • splitgraph.com
    • data.cincinnati-oh.gov
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cincinnati-oh-gov (2024). Customer Service Requests (CSR) Survey Feedback Responses [Dataset]. https://www.splitgraph.com/cincinnati-oh-gov/customer-service-requests-csr-survey-feedback-umfh-cri7/
    Explore at:
    application/vnd.splitgraph.image, json, application/openapi+jsonAvailable download formats
    Dataset updated
    Oct 15, 2024
    Authors
    cincinnati-oh-gov
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Data Description: This data set contains a record of all Citizen Service Requests (CSRs) feedback survey responses. When CSRs are closed out by the City, customers who provide an email address are automatically sent a notification that their work has been completed, as well as a link to a customer service satisfaction survey. Customers are able to provide feedback on work completion, satisfaction level, and any additional information. No identifying personal customer/citizen information (name, contact information, or additional comments) is included in this data.

    Data Creation: Data generated when CSR feedback surveys are submitted

    Data Created By: DPS

    Refresh Frequency:

    CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/Customer-Service-CSR-Satisfaction/ks8a-xggj/

    Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.

    Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).

    Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  6. ACL-ARC dataset

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TONG ZENG (2023). ACL-ARC dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12573872.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    TONG ZENG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"For the pre-processing of the dataset, please refer to the paper Bonab et al., 2018 (http://doi.org/10.1145/3209978.3210162)We downloaded a copy of that dataset, adjusted some fields. The data are stored in jsonl format (each row is an json object), we list a couple of rows as example:{"cur_sent":"the nespole uses a client server architecture to allow a common user who is initially browsing through the web pages of a service provider on the internet to connect seamlessly to a human agent of the service provider who speaks another language and provides speech to speech translation service between the two parties","cur_scaled_len_features":{"type":1,"values":[0.06936542669584245,0.07202216066481995]},"cur_has_citation":1}
    For the code using this dataset to modeling citation worthiness, please refer to https://github.com/sciosci/cite-worthiness

  7. Cloud Computing for Science Data Processing in Support of Emergency Response...

    • data.wu.ac.at
    xml
    Updated Sep 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2017). Cloud Computing for Science Data Processing in Support of Emergency Response [Dataset]. https://data.wu.ac.at/schema/data_gov/ZjY5OThlZjYtOWNhMi00YTEwLTgyN2EtZGQyZjIwZGFjMDgx
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Sep 16, 2017
    Dataset provided by
    NASAhttp://nasa.gov/
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Cloud computing enables users to create virtual computers, each one with the optimal configuration of hardware and software for a job. The number of virtual computers can be increased to process large data sets or reduce processing time. Large scale scientific applications of the cloud, in many cases, are still in development.

    For example, in the event of an environmental crisis, such as the Deepwater Horizon oil spill, tornadoes, Mississippi River flooding, or a hurricane, up to date information is one of the most important commodities for decision makers. The volume of remote sensing data that is needed to be processed to accurately retrieve ocean properties from satellite measurements can easily exceed a terabyte, even for a small region such as the Mississippi Sound. Often, with current infrastructure, the time required to download, process and analyze the large volumes of remote sensing data, limits data processing capabilities to provide timely information to emergency responders. The use of a cloud computing platform, like NASA’s Nebula, can help eliminate those barriers.

    NASA Nebula was developed as an open-source cloud computing platform to provide an easily quantifiable and improved alternative to building additional expensive data centers and to provide an easier way for NASA scientists and researchers to share large, complex data sets with external partners and the public. Nebula was designed as an Infrastructure-as-a-Service (IaaS) implementation that provided scalable computing and storage for science data and Web-based applications. Nebula IaaS allowed users to unilaterally provision, manage, and decommission computing capabilities (virtual machine instances, storage, etc.) on an as-needed basis through a Web interface or a set of command-line tools.

    This project demonstrated a novel way to conduct large scale scientific data processing utilizing NASA’s cloud computer, Nebula. Remote sensing data from the Deepwater Horizon oil spill site was analyzed to assess changes in concentration of suspended sediments in the area surrounding the spill site.

    Software for processing time series of satellite remote sensing data was packaged together with a computer code that uses web services to download the data sets from a NASA data archive and distribution system. The new application package was able to be quickly deployed on a cloud computing platform when, and only for as long as, processing of the time series data is required to support emergency response. Fast network connection between the cloud system and the data archive enabled remote processing of the satellite data without the need for downloading the input data to a local computer system: only the output data products are transferred for further analysis.

    NASA was a pioneer in cloud computing by having established its own private cloud computing data center called Nebula in 2009 at the Ames Research Center (Ames). Nebula provided high-capacity computing and data storage services to NASA Centers, Mission Directorates, and external customers. In 2012, NASA shut down Nebula based on the results of a 5-month test that benchmarked Nebula’s capabilities against those of Amazon and Microsoft. The test found that public clouds were more reliable and cost effective and offered much greater computing capacity and better IT support services than Nebula.

  8. w

    Canada Radiogenic Heat Production Observations

    • data.wu.ac.at
    arcgis_rest, pdf, wfs +1
    Updated Dec 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Canada Radiogenic Heat Production Observations [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/YmRjNGE2ZjctNzU5Zi00N2ZhLWEwMDQtZDdjZmUyZDE2NDJm
    Explore at:
    wfs, pdf, wms, arcgis_restAvailable download formats
    Dataset updated
    Dec 5, 2017
    Area covered
    Canada, 0190fb6bc06bdbf76f22cb85c6f8052fe032dce6
    Description

    Data related to 2,319 radiogenic heat production observations for Canada, obtained by the Canadian Geothermal Data Compilation. The data table includes measurements of Radiogenic Heat Production based on analysis of individual rock samples. Calculation of heat production is based on measured U, Th, and K content, which may derive from chemical analysis, gamma ray spectral analysis or other techniques.Data processing to load and aggregate delimited text data from the OFR into a database, and web service deployment by SM Richard and Christy Caudill.

  9. r

    USA NLCD Land Cover

    • opendata.rcmrd.org
    • prep-response-portal.napsgfoundation.org
    • +9more
    Updated Jun 6, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2019). USA NLCD Land Cover [Dataset]. https://opendata.rcmrd.org/datasets/3ccf118ed80748909eb85c6d262b426f
    Explore at:
    Dataset updated
    Jun 6, 2019
    Dataset authored and provided by
    Esri
    Area covered
    Description

    Land cover describes the surface of the earth. This time-enabled service of the National Land Cover Database groups land cover into 20 classes based on a modified Anderson Level II classification system. Classes include vegetation type, development density, and agricultural use. Areas of water, ice and snow and barren lands are also identified.The National Land Cover Database products are created through a cooperative project conducted by the Multi-Resolution Land Characteristics Consortium (MRLC). The MRLC Consortium is a partnership of federal agencies, consisting of the U.S. Geological Survey, the National Oceanic and Atmospheric Administration, the U.S. Environmental Protection Agency, the U.S. Department of Agriculture, the U.S. Forest Service, the National Park Service, the U.S. Fish and Wildlife Service, the Bureau of Land Management and the USDA Natural Resources Conservation Service.Time Extent: 2001, 2004, 2006, 2008, 2011, 2013, 2016, 2019, and 2021 for the conterminous United States. The layer displays land cover for Alaska for the years 2001, 2011, and 2016. For Puerto Rico there is only data for 2001. For Hawaii, Esri reclassed land cover data from NOAA Office for Coastal Management, C-CAP into NLCD codes. These reclassed C-CAP data were available for Hawaii for the years 2001, 2005, and 2011. Hawaii C-CAP land cover in its original form can be used in your maps by adding the Hawaii CCAP Land Cover layer directly from the Living Atlas.Units: (Thematic dataset)Cell Size: 30m Source Type: Thematic Pixel Type: Unsigned 8 bitData Projection: North America Albers Equal Area Conic (102008)Mosaic Projection: North America Albers Equal Area Conic (102008)Extent: 50 US States, District of Columbia, Puerto RicoSource: National Land Cover DatabasePublication date: June 30, 2023Time SeriesThis layer is served as a time series. To display a particular year of land cover data, select the year of interest with the time slider in your map client. You may also use the time slider to play the service as an animation. We recommend a one year time interval when displaying the series. If you would like a particular year of data to use in analysis, be sure to use the analysis renderer along with the time slider to choose a valid year.North America Albers ProjectionThis layer is served in North America Albers projection. Albers is an equal area projection, and this allows users of this service to accurately calculate acreage without additional data preparation steps. This also means it takes a tiny bit longer to project on the fly into Web Mercator projection, if that is the destination projection of the service.Processing TemplatesCartographic Renderer - The default. Land cover drawn with Esri symbols. Each year's land cover data is displayed in the time series until there is a newer year of data available.Cartographic Renderer (saturated) - This renderer has the same symbols as the cartographic renderer, but the colors are extra saturated so a transparency may be applied to the layer. This renderer is useful for land cover over a basemap or relief. MRLC Cartographic Renderer - Cartographic renderer using the land cover symbols as issued by NLCD (the same symbols as is on the dataset when you download them from MRLC).Analytic Renderer - Use this in analysis. The time series is restricted by the analytic template to display a raster in only the year the land cover raster is valid. In a cartographic renderer, land cover data is displayed until a new year of data is available so that it plays well in a time series. In the analytic renderer, data is displayed for only the year it is valid. The analytic renderer won't look good in a time series animation, but in analysis this renderer will make sure you only use data for its appropriate year.Simplified Renderer - NLCD reclassified into 10 broad classes. These broad classes may be easier to use in some applications or maps.Forest Renderer - Cartographic renderer which only displays the three forest classes, deciduous, coniferous, and mixed forest.Developed Renderer - Cartographic renderer which only displays the four developed classes, developed open space plus low, medium, and high intensity development classes.Hawaii data has a different sourceMRLC redirects users interested in land cover data for Hawaii to a NOAA product called C-CAP or Coastal Change Analysis Program Regional Land Cover. This C-CAP land cover data was available for Hawaii for the years 2001, 2005, and 2011 at the time of the latest update of this layer. The USA NLCD Land Cover layer reclasses C-CAP land cover codes into NLCD land cover codes for display and analysis, although it may be beneficial for analytical purposes to use the original C-CAP data, which has finer resolution and untranslated land cover codes. The C-CAP land cover data for Hawaii is served as its own 2.4m resolution land cover layer in the Living Atlas.Because it's a different original data source than the rest of NLCD, different years for Hawaii may not be able to be compared in the same way different years for the other states can. But the same method was used to produce each year of this C-CAP derived land cover to make this layer. Note: Because there was no C-CAP data for Kaho'olawe Island in 2011, 2005 data were used for that island.The land cover is projected into the same projection and cellsize as the rest of the layer, using nearest neighbor method, then it is reclassed to approximate the NLCD codes. The following is the reclass table used to make Hawaii C-CAP data closely match the NLCD classification scheme:C-CAP code,NLCD code0,01,02,243,234,225,216,827,818,719,4110,4211,4312,5213,9014,9015,9516,9017,9018,9519,3120,3121,1122,1123,1124,025,12USA NLCD Land Cover service classes with corresponding index number (raster value):11. Open Water - areas of open water, generally with less than 25% cover of vegetation or soil.12. Perennial Ice/Snow - areas characterized by a perennial cover of ice and/or snow, generally greater than 25% of total cover.21. Developed, Open Space - areas with a mixture of some constructed materials, but mostly vegetation in the form of lawn grasses. Impervious surfaces account for less than 20% of total cover. These areas most commonly include large-lot single-family housing units, parks, golf courses, and vegetation planted in developed settings for recreation, erosion control, or aesthetic purposes.22. Developed, Low Intensity - areas with a mixture of constructed materials and vegetation. Impervious surfaces account for 20% to 49% percent of total cover. These areas most commonly include single-family housing units.23. Developed, Medium Intensity - areas with a mixture of constructed materials and vegetation. Impervious surfaces account for 50% to 79% of the total cover. These areas most commonly include single-family housing units.24. Developed High Intensity - highly developed areas where people reside or work in high numbers. Examples include apartment complexes, row houses and commercial/industrial. Impervious surfaces account for 80% to 100% of the total cover.31. Barren Land (Rock/Sand/Clay) - areas of bedrock, desert pavement, scarps, talus, slides, volcanic material, glacial debris, sand dunes, strip mines, gravel pits and other accumulations of earthen material. Generally, vegetation accounts for less than 15% of total cover.41. Deciduous Forest - areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. More than 75% of the tree species shed foliage simultaneously in response to seasonal change.42. Evergreen Forest - areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. More than 75% of the tree species maintain their leaves all year. Canopy is never without green foliage.43. Mixed Forest - areas dominated by trees generally greater than 5 meters tall, and greater than 20% of total vegetation cover. Neither deciduous nor evergreen species are greater than 75% of total tree cover. 51. Dwarf Scrub - Alaska only areas dominated by shrubs less than 20 centimeters tall with shrub canopy typically greater than 20% of total vegetation. This type is often co-associated with grasses, sedges, herbs, and non-vascular vegetation.52. Shrub/Scrub - areas dominated by shrubs; less than 5 meters tall with shrub canopy typically greater than 20% of total vegetation. This class includes true shrubs, young trees in an early successional stage or trees stunted from environmental conditions.71. Grassland/Herbaceous - areas dominated by gramanoid or herbaceous vegetation, generally greater than 80% of total vegetation. These areas are not subject to intensive management such as tilling, but can be utilized for grazing.72. Sedge/Herbaceous - Alaska only areas dominated by sedges and forbs, generally greater than 80% of total vegetation. This type can occur with significant other grasses or other grass like plants, and includes sedge tundra, and sedge tussock tundra.73. Lichens - Alaska only areas dominated by fruticose or foliose lichens generally greater than 80% of total vegetation.74. Moss - Alaska only areas dominated by mosses, generally greater than 80% of total vegetation.Planted/Cultivated 81. Pasture/Hay - areas of grasses, legumes, or grass-legume mixtures planted for livestock grazing or the production of seed or hay crops, typically on a perennial cycle. Pasture/hay vegetation accounts for greater than 20% of total vegetation.82. Cultivated Crops - areas used for the production of annual crops, such as corn, soybeans, vegetables, tobacco, and cotton, and also perennial woody crops such as orchards and vineyards. Crop vegetation accounts for greater than 20% of total vegetation. This class also includes all land being actively tilled.90. Woody Wetlands - areas where forest or shrubland vegetation accounts for greater than 20% of vegetative cover and the soil or

  10. r

    Global Land Cover 1992-2020 (Proxied for public use)

    • opendata.rcmrd.org
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ArcGIS StoryMaps (2025). Global Land Cover 1992-2020 (Proxied for public use) [Dataset]. https://opendata.rcmrd.org/datasets/6c35c80af3c84675a700d739111bc04c
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset authored and provided by
    ArcGIS StoryMaps
    Area covered
    Earth
    Description

    This service is available to all ArcGIS Online users with organizational accounts. For more information on this service, including the terms of use, visit us online at https://goto.arcgisonline.com/earthobs3/ESA_CCI_Land_Cover_Time_SeriesThis layer is a time series of the annual ESA CCI (Climate Change Initiative) land cover maps of the world. ESA has produced land cover maps for the years 1992-2020. These are available at the European Space Agency Climate Change Initiative website. Time Extent: 1992-2020Cell Size: 300 meterSource Type: ThematicPixel Type: 8 Bit UnsignedData Projection: GCS WGS84Mosaic Projection: Web Mercator Auxiliary SphereExtent: GlobalSource: ESA Climate Change InitiativeUpdate Cycle: Annual until 2020, no updates thereafter What can you do with this layer?This layer may be added to ArcGIS Online maps and applications and shown in a time series to watch a "time lapse" view of land cover change since 1992 for any part of the world. The same behavior exists when the layer is added to ArcGIS Pro.In addition to displaying all layers in a series, this layer may be queried so that only one year is displayed in a map. This layer can be used in analysis. For example, the layer may be added to ArcGIS Pro with a query set to display just one year. Then, an area count of land cover types may be produced for a feature dataset using the zonal statistics tool. Statistics may be compared with the statistics from other years to show a trend.To sum up area by land cover using this service, or any other analysis, be sure to use an equal area projection, such as Albers or Equal Earth. Different Classifications Available to Map Five processing templates are included in this layer. The processing templates may be used to display a smaller set of land cover classes.Cartographic Renderer (Default Template)Displays all ESA CCI land cover classes.*Forested lands TemplateThe forested lands template shows only forested lands (classes 50-90).Urban Lands TemplateThe urban lands template shows only urban areas (class 190).Converted Lands TemplateThe converted lands template shows only urban lands and lands converted to agriculture (classes 10-40 and 190). Simplified RendererDisplays the map in ten simple classes which match the ten simplified classes used in 2050 Land Cover projections from Clark University. Any of these variables can be displayed or analyzed by selecting their processing template. In ArcGIS Online, select the Image Display Options on the layer. Then pull down the list of variables from the Renderer options. Click Apply and Close. In ArcGIS Pro, go into the Layer Properties. Select Processing Templates from the left hand menu. From the Processing Template pull down menu, select the variable to display. Using TimeBy default, the map will display as a time series animation, one year per frame. A time slider will appear when you add this layer to your map. To see the most current data, move the time slider until you see the most current year. In addition to displaying the past quarter century of land cover maps as an animation, this time series can also display just one year of data by use of a definition query. For a step by step example using ArcGIS Pro on how to display just one year of this layer, as well as to compare one year to another, see the blog called Calculating Impervious Surface Change.Hierarchical ClassificationLand cover types are defined using the land cover classification (LCCS) developed by the United Nations, FAO. It is designed to be as compatible as possible with other products, namely GLCC2000, GlobCover 2005 and 2009. This is a heirarchical classification system. For example, class 60 means "closed to open" canopy broadleaved deciduous tree cover. But in some places a more specific type of broadleaved deciduous tree cover may be available. In that case, a more specific code 61 or 62 may be used which specifies "open" (61) or "closed" (62) cover. Land Cover ProcessingTo provide consistency over time, these maps are produced from baseline land cover maps, and are revised for changes each year depending on the best available satellite data from each period in time. These revisions were made from AVHRR 1km time series from 1992 to 1999, SPOT-VGT time series between 1999 and 2013, and PROBA-V data for years 2013, 2014 and 2015. When MERIS FR or PROBA-V time series are available, changes detected at 1 km are re-mapped at 300 m. The last step consists in back- and up-dating the 10-year baseline LC map to produce the 24 annual LC maps from 1992 to 2015.Source dataThe datasets behind this layer were extracted from NetCDF files and TIFF files produced by ESA. Years 1992-2015 were acquired from ESA CCI LC version 2.0.7 in TIFF format, and years 2016-2018 were acquired from version 2.1.1 in NetCDF format. These are downloadable from ESA with an account, after agreeing to their terms of use. https://maps.elie.ucl.ac.be/CCI/viewer/download.php CitationESA. Land Cover CCI Product User Guide Version 2. Tech. Rep. (2017). Available at: maps.elie.ucl.ac.be/CCI/viewer/download/ESACCI-LC-Ph2-PUGv2_2.0.pdf More technical documentation on the source datasets is available here:https://cds.climate.copernicus.eu/cdsapp#!/dataset/satellite-land-cover?tab=doc *Index of all classes in this layer:10 Cropland, rainfed11 Herbaceous cover12 Tree or shrub cover20 Cropland, irrigated or post-flooding30 Mosaic cropland (>50%) / natural vegetation (tree, shrub, herbaceous cover) (<50%)40 Mosaic natural vegetation (tree, shrub, herbaceous cover) (>50%) / cropland (<50%)50 Tree cover, broadleaved, evergreen, closed to open (>15%)60 Tree cover, broadleaved, deciduous, closed to open (>15%)61 Tree cover, broadleaved, deciduous, closed (>40%)62 Tree cover, broadleaved, deciduous, open (15-40%)70 Tree cover, needleleaved, evergreen, closed to open (>15%)71 Tree cover, needleleaved, evergreen, closed (>40%)72 Tree cover, needleleaved, evergreen, open (15-40%)80 Tree cover, needleleaved, deciduous, closed to open (>15%)81 Tree cover, needleleaved, deciduous, closed (>40%)82 Tree cover, needleleaved, deciduous, open (15-40%)90 Tree cover, mixed leaf type (broadleaved and needleleaved)100 Mosaic tree and shrub (>50%) / herbaceous cover (<50%)110 Mosaic herbaceous cover (>50%) / tree and shrub (<50%)120 Shrubland121 Shrubland evergreen122 Shrubland deciduous130 Grassland140 Lichens and mosses150 Sparse vegetation (tree, shrub, herbaceous cover) (<15%)151 Sparse tree (<15%)152 Sparse shrub (<15%)153 Sparse herbaceous cover (<15%)160 Tree cover, flooded, fresh or brakish water170 Tree cover, flooded, saline water180 Shrub or herbaceous cover, flooded, fresh/saline/brakish water190 Urban areas200 Bare areas201 Consolidated bare areas202 Unconsolidated bare areas210 Water bodies

  11. Databricks Dolly (15K)

    • kaggle.com
    • huggingface.co
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Databricks Dolly (15K) [Dataset]. https://www.kaggle.com/datasets/thedevastator/databricks-chatgpt-dataset/code
    Explore at:
    zip(4621394 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Databricks Dolly (15K)

    Over 15,000 Language Models and Dialogues for Interactive Chat Applications

    By Huggingface Hub [source]

    About this dataset

    This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

    For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

    In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

    Research Ideas

    • Generating deep learning models to detect and respond to conversational intent.
    • Training language models to use natural language processing (NLP) for customer service queries.
    • Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------| | Instruction | Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text) | | Context | Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  12. Bot_IoT

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vignesh Venkateswaran (2023). Bot_IoT [Dataset]. https://www.kaggle.com/datasets/vigneshvenkateswaran/bot-iot
    Explore at:
    zip(1257092644 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Vignesh Venkateswaran
    Description

    INFO ABOUT THE BOT-IOT DATASET, NOTE: only the csv files stated in the description are used

    The BoT-IoT dataset can be downloaded from HERE. You can also use our new datasets: the TON_IoT and UNSW-NB15.

    --------------------------------------------------------------------------

    The BoT-IoT dataset was created by designing a realistic network environment in the Cyber Range Lab of UNSW Canberra. The network environment incorporated a combination of normal and botnet traffic. The dataset’s source files are provided in different formats, including the original pcap files, the generated argus files and csv files. The files were separated, based on attack category and subcategory, to better assist in labeling process.

    The captured pcap files are 69.3 GB in size, with more than 72.000.000 records. The extracted flow traffic, in csv format is 16.7 GB in size. The dataset includes DDoS, DoS, OS and Service Scan, Keylogging and Data exfiltration attacks, with the DDoS and DoS attacks further organized, based on the protocol used.

    To ease the handling of the dataset, we extracted 5% of the original dataset via the use of select MySQL queries. The extracted 5%, is comprised of 4 files of approximately 1.07 GB total size, and about 3 million records.

    --------------------------------------------------------------------------

    Free use of the Bot-IoT dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes should be agreed by the authors. The authors have asserted their rights under the Copyright. To whom intent the use of the Bot-IoT dataset, the authors have to cite the following papers that has the dataset’s details: .

    Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull. "Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset." Future Generation Computer Systems 100 (2019): 779-796. Public Access Here.

    Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Jill Slay. "Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques." In International Conference on Mobile Networks and Management, pp. 30-44. Springer, Cham, 2017.

    Koroniotis, Nickolaos, Nour Moustafa, and Elena Sitnikova. "A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework." Future Generation Computer Systems 110 (2020): 91-106.

    Koroniotis, Nickolaos, and Nour Moustafa. "Enhancing network forensics with particle swarm and deep learning: The particle deep framework." arXiv preprint arXiv:2005.00722 (2020).

    Koroniotis, Nickolaos, Nour Moustafa, Francesco Schiliro, Praveen Gauravaram, and Helge Janicke. "A Holistic Review of Cybersecurity and Reliability Perspectives in Smart Airports." IEEE Access (2020).

    Koroniotis, Nickolaos. "Designing an effective network forensic framework for the investigation of botnets in the Internet of Things." PhD diss., The University of New South Wales Australia, 2020.

    --------------------------------------------------------------------------

  13. Amazon AWS SaaS Sales Dataset

    • kaggle.com
    Updated May 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nhat Thanh, Nguyen (2023). Amazon AWS SaaS Sales Dataset [Dataset]. https://www.kaggle.com/datasets/nnthanh101/aws-saas-sales
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nhat Thanh, Nguyen
    License

    http://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html

    Description

    This dataset contains transaction data from a fictitious SaaS company selling sales and marketing software to other companies (B2B). In the dataset, each row represents a single transaction/order (9,994 transactions), and the columns include:

    Here is the Original Dataset: https://ee-assets-prod-us-east-1.s3.amazonaws.com/modules/337d5d05acc64a6fa37bcba6b921071c/v1/SaaS-Sales.csv

    Features

    | # | Name of the attribute | Description | | -- | --------------------- | -------------------------------------------------------- | | 1 | Row ID | A unique identifier for each transaction. | | 2 | Order ID | A unique identifier for each order. | | 3 | Order Date | The date when the order was placed. | | 4 | Date Key | A numerical representation of the order date (YYYYMMDD). | | 5 | Contact Name | The name of the person who placed the order. | | 6 | Country | The country where the order was placed. | | 7 | City | The city where the order was placed. | | 8 | Region | The region where the order was placed. | | 9 | Subregion | The subregion where the order was placed. | | 10 | Customer | The name of the company that placed the order. | | 11 | Customer ID | A unique identifier for each customer. | | 13 | Industry | The industry the customer belongs to. | | 14 | Segment | The customer segment (SMB, Strategic, Enterprise, etc.). | | 15 | Product | The product was ordered. | | 16 | License | The license key for the product. | | 17 | Sales | The total sales amount for the transaction. | | 18 | Quantity | The total number of items in the transaction. | | 19 | Discount | The discount applied to the transaction. | | 20 | Profit | The profit from the transaction. |

    Inspiration: The CRoss Industry Standard Process for Data Mining (CRISP-DM) CRISP-DM methodology

    • [ ] Understanding the business
    • [ ] Understanding the data
    • [x] Preparing the data
    • [ ] Modelling
    • [ ] Evaluating
    • [ ] Implementing the analysis.
  14. Lead Scoring Case Study

    • kaggle.com
    zip
    Updated Jan 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Venkatasubramanian Sundaramahadevan (2022). Lead Scoring Case Study [Dataset]. https://www.kaggle.com/datasets/venkatasubramanian/lead-scoring-case-study
    Explore at:
    zip(420583 bytes)Available download formats
    Dataset updated
    Jan 31, 2022
    Authors
    Venkatasubramanian Sundaramahadevan
    Description

    Problem Statement An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

    https://cdn.upgrad.com/UpGrad/temp/189f213d-fade-4fe4-b506-865f1840a25a/XNote_201901081613670.jpg" alt="Lead Conversion Process - Demonstrated as a funnel"> As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

    X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

    Data You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted. You can learn more about the dataset from the data dictionary provided in the zip folder at the end of the page. Another thing that you also need to check out for are the levels present in the categorical variables. Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bitext (2024). Bitext Gen AI Chatbot Customer Support Dataset [Dataset]. https://www.kaggle.com/datasets/bitext/bitext-gen-ai-chatbot-customer-support-dataset
Organization logo

Bitext Gen AI Chatbot Customer Support Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(3007665 bytes)Available download formats
Dataset updated
Mar 18, 2024
Authors
Bitext
License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.

The dataset has the following specs:

  • Use Case: Intent Detection
  • Vertical: Customer Service
  • 27 intents assigned to 10 categories
  • 26872 question/answer pairs, around 1000 per intent
  • 30 entity/slot types
  • 12 different types of language generation tags

The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

  • Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management

For a full list of verticals and its intents see https://www.bitext.com/chatbot-verticals/.

The question/answer pairs have been generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds from these texts, and NLG technology to expand the seed texts. All steps in the process are curated by computational linguists.

Dataset Token Count

The dataset contains an extensive amount of text data across its 'instruction' and 'response' columns. After processing and tokenizing the dataset, we've identified a total of 3.57 million tokens. This rich set of tokens is essential for training advanced LLMs for AI Conversational, AI Generative, and Question and Answering (Q&A) models.

Fields of the Dataset

Each entry in the dataset contains the following fields:

  • flags: tags (explained below in the Language Generation Tags section)
  • instruction: a user request from the Customer Service domain
  • category: the high-level semantic category for the intent
  • intent: the intent corresponding to the user instruction
  • response: an example expected response from the virtual assistant

Categories and Intents

The categories and intents covered by the dataset are:

  • ACCOUNT: create_account, delete_account, edit_account, recover_password, registration_problems, switch_account
  • CANCELLATION_FEE: check_cancellation_fee
  • CONTACT: contact_customer_service, contact_human_agent
  • DELIVERY: delivery_options, delivery_period
  • FEEDBACK: complaint, review
  • INVOICE: check_invoice, get_invoice
  • ORDER: cancel_order, change_order, place_order, track_order
  • PAYMENT: check_payment_methods, payment_issue
  • REFUND: check_refund_policy, get_refund, track_refund
  • SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address
  • SUBSCRIPTION: newsletter_subscription

Entities

The entities covered by the dataset are:

  • {{Order Number}}, typically present in:
  • Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund
  • {{Invoice Number}}, typically present in:
    • Intents: check_invoice, get_invoice
  • {{Online Order Interaction}}, typically present in:
    • Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund
  • {{Online Payment Interaction}}, typically present in:
    • Intents: cancel_order, check_payment_methods
  • {{Online Navigation Step}}, typically present in:
    • Intents: complaint, delivery_options
  • {{Online Customer Support Channel}}, typically present in:
    • Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account
  • {{Profile}}, typically present in:
    • Intent: switch_account
  • {{Profile Type}}, typically present in:
    • Intent: switch_account
  • {{Settings}}, typically present in:
    • Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund
  • {{Online Company Portal Info}}, typically present in:
    • Intents: cancel_order, edit_account
  • {{Date}}, typically present in:
    • Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund
  • {{Date Range}}, typically present in:
    • Intents: check_cancellation_fee, check_invoice, get_invoice
  • {{Shipping Cut-off Time}}, typically present in:
    • Intent: delivery_options
  • {{Delivery City}}, typically present in:
    • Inten...
Search
Clear search
Close search
Google apps
Main menu