22 datasets found
  1. Sign Language Classification Challenge Zindi

    • kaggle.com
    zip
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khalid benlyazid (2021). Sign Language Classification Challenge Zindi [Dataset]. https://www.kaggle.com/khalidbenlyazid/sign-language-classification-challenge-zindi
    Explore at:
    zip(1180875247 bytes)Available download formats
    Dataset updated
    Dec 1, 2021
    Authors
    khalid benlyazid
    Description

    About

    The data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.

    The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.

    Files available for download:

    Images.zip: is a zip file that contains all images in test and train.
    Train.csv: contains the target. This is the dataset that you will use to train your model.
    Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
    SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.
    

    THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge

  2. Arabizi Dialect Training Data

    • kaggle.com
    Updated Mar 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). Arabizi Dialect Training Data [Dataset]. https://www.kaggle.com/kingabzpro/aranizi-dailect-training-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abid Ali Awan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    TUNIZI is the first 100% Tunisian Arabizi sentiment analysis dataset, developed as part of AI4D’s ongoing NLP project for African languages. Tunisian Arabizi is the representation of the Tunisian dialect written in Latin characters and numbers rather than Arabic letters.

    iCompass gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs.

    Tunizi was preprocessed by removing links, emoji symbols, and punctuations.

    The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and neutral (0). The annotators were diverse in gender, age and social background.

    Content

    Variable definition:

    text_id: Unique identifier of the text text: Text label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)

    Files available for download are:

    Train.csv - contains text on which to train your model. Test.csv - contains text on which you must classify using your trained model. SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the ID must be correct. Values in the 'label' column should -1, 0 or 1.

    Acknowledgements

    About AI4D-Africa; Artificial Intelligence for Development-Africa Network (ai4d.ai)

    AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policy makers.

  3. c

    Ingredients Dataset – 18K+ Product Records with Ingredients Data from...

    • crawlfeeds.com
    csv, zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Ingredients Dataset – 18K+ Product Records with Ingredients Data from Beauty, Pets, Groceries & Health (CSV for AI & NLP) [Dataset]. https://crawlfeeds.com/datasets/ingredients-dataset-18k-product-records-with-ingredients-data-from-beauty-pets-groceries-health-csv-for-ai-nlp
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Ingredients Dataset (18K+ records) provides a high-quality, structured collection of product information with detailed ingredients data. Covering a wide variety of categories including beauty, pet care, groceries, and health products, this dataset is designed to power AI, NLP, and machine learning applications that require domain-specific knowledge of consumer products.

    Why This Dataset Matters

    In today’s data-driven economy, access to structured and clean datasets is critical for building intelligent systems. For industries like healthcare, beauty, food-tech, and retail, the ability to analyze product ingredients enables deeper insights, including:

    • Identifying allergens or harmful substances

    • Comparing ingredient similarities across brands

    • Training LLMs and NLP models for better understanding of consumer products

    • Supporting regulatory compliance and labeling standards

    • Enhancing recommendation engines for personalized shopping

    This dataset bridges the gap between raw, unstructured product data and actionable information by providing well-organized CSV files with fields that are easy to integrate into your workflows.

    Dataset Coverage

    The 18,000+ product records span several consumer categories:

    • 🛍 Beauty & Personal Care – cosmetics, skincare, haircare products with full ingredient transparency

    • 🐾 Pet Supplies – pet food and wellness products with detailed formulations

    • 🥫 Groceries & Packaged Foods – snacks, beverages, pantry staples with structured ingredients lists

    • 💊 Health & Wellness – supplements, vitamins, and healthcare products with nutritional components

    By including multiple categories, this dataset allows cross-domain analysis and model training that reflects real-world product diversity.

    Key Features

    • 📂 18,000+ records with structured ingredient fields

    • 🧾 Covers beauty, pet care, groceries, and health products

    • 📊 Delivered in CSV format, ready to use for analytics or machine learning

    • 🏷 Includes categories and breadcrumbs for taxonomy and classification

    • 🔎 Useful for AI, NLP, LLM fine-tuning, allergen detection, and product recommendation systems

    Use Cases

    1. AI & NLP Training – fine-tune LLMs on structured ingredients data for food, beauty, and healthcare applications.

    2. Retail Analytics – analyze consumer product composition across categories to inform pricing, positioning, and product launches.

    3. Food & Health Research – detect allergens, evaluate ingredient safety, and study nutritional compositions.

    4. Recommendation Engines – build smarter product recommendation systems for e-commerce platforms.

    5. Regulatory & Compliance Tools – ensure products meet industry and government standards through ingredient validation.

    Why Choose This Dataset

    Unlike generic product feeds, this dataset emphasizes ingredient transparency across multiple categories. With 18K+ records, it strikes a balance between being comprehensive and affordable, making it suitable for startups, researchers, and enterprise teams looking to experiment with product intelligence.

    Note: Each record includes a url (main page) and a buy_url (purchase page). Records are based on the buy_url to ensure unique, product-level data.

  4. Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

    Dataset Details

    The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

    • Subject matter triples file
      • fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
        • Example of a row in train.txt, valid.txt, and test.txt:
          • 2, 192, 0
        • Example of a row in entity2id.txt:
          • /g/112yfy2xr, 2
        • Example of a row in relation2id.txt:
          • /music/album/release_type, 192
        • Explaination
          • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
    • Type system file
      • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
        • Example
          • 92, 47178872, 90
        • Explanation
          • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
    • Metadata files
      • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
        • Example
          • /g/11b41c22g, /type/object/type, /people/person
        • Explanation
          • The entity with MID "/g/11b41c22g" has a type "/people/person"
      • object_names: Each row maps the MID of a Freebase object to its textual label.
        • Example
          • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
        • Explanation
          • The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
      • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
        • Example
          • /m/05v3y9r, /type/object/id, "/music/live_album/concert"
        • Explanation
          • The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
      • domains_id_label: Each row maps the MID of a Freebase domain to its label.
        • Example
          • /m/05v4pmy, geology, 77
        • Explanation
          • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
      • types_id_label: Each row maps the MID of a Freebase type to its label.
        • Example
          • /m/01xljxh, /government/political_party, 147
        • Explanation
          • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
      • entities_id_label: Each row maps the MID of a Freebase entity to its label.
        • Example
          • /g/11b78qtr5m, Viroliano Tries Jazz, 2234
        • Explanation
          • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
        • properties_id_label: Each row maps the MID of a Freebase property to its label.
          • Example
            • /m/010h8tp2, /comedy/comedy_group/members, 47178867
          • Explanation
            • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
        • uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.

  5. iX Mobile Banking

    • kaggle.com
    zip
    Updated Jun 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza (2021). iX Mobile Banking [Dataset]. https://www.kaggle.com/hamzaghanmi/ix-mobile-banking
    Explore at:
    zip(4151851 bytes)Available download formats
    Dataset updated
    Jun 4, 2021
    Authors
    Hamza
    Description

    Context

    iX Mobile Banking Prediction Challenge

    Content

    This data was imported from the zindi platform link

    The train set contains ~100 000 and the test contains ~45 000 survey responses from around Africa and the world.

    Train.csv - contains the target. This is the dataset that you will use to train your model.

    Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.

    SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv and the ‘target’ column containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.

    VariableDefinitions.csv - A file that contains the definitions of each column in the dataset. For columns(FQ1 - FQ37), Value 1 - Yes, 2 - No, 3 - Don’t Know 4 - refused to answer

  6. Learning Management System

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USAID (2018). Learning Management System [Dataset]. https://catalog.data.gov/es/dataset/learning-management-system
    Explore at:
    Dataset updated
    Nov 12, 2018
    Dataset provided by
    United States Agency for International Developmenthttp://usaid.gov/
    Description

    Although the commercial name for the The USAID University - Learning Management System is CSOD InCompass, the agencies that use the system have renamed (or rebranded) their specific agency portals to meet their own needs. lnCompass is a comprehensive talent management system that incorporates the following functional modules: 1) Learning -- The Learning module supports the management and tracking of training events and individual training records. Training events may be instructor Jed or online. Courses may be managed within the system to provide descriptions, availability, and registration. Online content is stored on the system. Training information stored for individuals includes courses completed, scores, and courses registered for, 2) Connect -- The Connect module supports employee collaboration efforts. Features include communities of practice, expertise location, blogs, and knowledge sharing support. Profile information that may be stored by the system includes job position, subject matter expertise, and previous accomplishments, 3) Performance -- The Performance module supports management of organizational goals and alignment of those goals to individual performance. The module supports managing skills and competencies for the organization. The module also supports employee performance reviews. The types of information gathered about employees include their skills, competencies, and performance evaluation, 4) Succession -- The Succession module supports workforce management and planning. The type of information gathered for this module includes prior work experience, skills, and competencies, and 5) Extended Enterprise -- The Extended Enterprise module supports delivery of training outside of the organization. Training provided may be for a fee. The type of information collected for this module includes individual data for identifying the person for training records management and related information for commercial transactions.

  7. Accompanying data for publication: "Learning the Optimal Power Flow:...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Wolgast; Thomas Wolgast; Astrid Nieße; Astrid Nieße (2025). Accompanying data for publication: "Learning the Optimal Power Flow: Environment Design Matters" [Dataset]. http://doi.org/10.5281/zenodo.13284446
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Wolgast; Thomas Wolgast; Astrid Nieße; Astrid Nieße
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All the data created for the publication "Learning the Optimal Power Flow: Environment Design Matters" by Wolgast and Nieße. The dataset contains all training runs performed, including the final neural network weights, meta-data about the training run, and various metrics during the course of training, which were used to generate the results and plots. The source code to re-produce the plots for the publication (and everything else) can be found on GitHub: https://github.com/Digitalized-Energy-Systems/rl-opf-env-design

  8. Data from: A new machine learning approach to seabed biotope classification

    • cefas.co.uk
    • environment.data.gov.uk
    Updated 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Environment, Fisheries and Aquaculture Science (2020). A new machine learning approach to seabed biotope classification [Dataset]. http://doi.org/10.14466/CefasDataHub.72
    Explore at:
    Dataset updated
    2020
    Dataset authored and provided by
    Centre for Environment, Fisheries and Aquaculture Science
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    Mar 31, 1969 - Jan 11, 2018
    Description

    Files for use with the R script accompanying the paper Cooper (2019). Note that this script also uses files from https://doi.org/10.14466/CefasDataHub.34 (details provided in script). Cooper, K.M. (2019). A new machine learning approach to seabed biotope classification. Science Advances. Files include: BiotopePredictionScript.R (R script), EUROPE.shp (European Coastline), EuropeLiteScoWal.shp (European Coastline with UK boundaries), DEFRADEMKC8.shp (Seabed bathymetry), C5922DATASETFAM13022017.csv (Training dataset), PARTC16112018.csv (Test dataset), PARTCAGG16112018.csv (Aggregation data). Description of C5922DATASETFAM13022017.csv: This file is based on the RSMP dataset (see https://www.cefas.co.uk/cefas-data-hub/dois/rsmp-baseline-dataset/), but with macrofaunal data output at the level of family or above. A variety of gear types have been used for sample collection including grabs (0.1m2 Hamon, 0.2m2 Hamon, 0.1m2 Day, 0.1m2 Van Veen and 0.1m2 Smith McIntrye) and cores. Of these various devices, 93% of samples were acquired using either a 0.1m2 Hamon grab or a 0.1m2 Day grab. Sieve sizes used in sample processing include 1mm and 0.5mm, reflecting the conventional preference for 1mm offshore and 0.5mm inshore. Of the samples collected using either a 0.1m2 Hamon grab or a 0.1m2 Day grab, 88% were processed using a 1mm sieve. Taxon names were standardised according to the WoRMS (World Register of Marine Species) list using the Taxon Match Tool (http://www.marinespecies.org/aphia.php?p=match). Of the initial 13,449 taxon names, only 774 remained after correction and aggregation to family level. The final dataset comprises of a single sheet comma-separated values (.csv) file. Colonials accounted for less than 20% of the total number of taxa and, where present, were given a value of 1 in the dataset. This component of the fauna was missing from 325 out of the 777 surveys, reflecting either a true absence, or simply that colonial taxa were ignored by the analyst. Sediment particle size data were provided as percentage weight by sieve mesh size, with the dataset including 99 different sieve sizes. Sediment samples have been processed using sieve, and a combination of sieve and laser diffraction techniques. Key metadata fields include: Sample coordinates (Latitude & Longitude), Survey Name, Gear, Date, Grab Sample Volume (litres) and Water Depth (m). A number of additional explanatory variables are also provided (salinity, temperature, chlorophyll a, Suspended particulate matter, Water depth, Wave Orbital Velocity, Average Current, Bed Stress). In total, the dataset dimensions are 33,198 rows (samples) x 900 columns (variables/factors), yielding a matrix of 29,878,200 individual data values.

  9. Malawi News Classification Challenge

    • kaggle.com
    zip
    Updated Jan 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza (2021). Malawi News Classification Challenge [Dataset]. https://www.kaggle.com/hamzaghanmi/malawi-news-classification-challenge
    Explore at:
    zip(1600974 bytes)Available download formats
    Dataset updated
    Jan 25, 2021
    Authors
    Hamza
    Area covered
    Malawi
    Description

    Context

    The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues

    Content

    Train.csv - contains the target. This is the dataset that you will use to train your model. Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode. SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct List of classes: ['SOCIAL ISSUES', 'EDUCATION', 'RELATIONSHIPS', 'ECONOMY', 'RELIGION', 'POLITICS', 'LAW/ORDER', 'SOCIAL', 'HEALTH', 'ARTS AND CRAFTS', 'FARMING', 'CULTURE', 'FLOODING', 'WITCHCRAFT', 'MUSIC', 'TRANSPORT', 'WILDLIFE/ENVIRONMENT', 'LOCALCHIEFS', 'SPORTS', 'OPINION/ESSAY']

    Inspiration

    Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.

  10. o

    mushroom

    • openml.org
    Updated Apr 6, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Schlimmer (2014). mushroom [Dataset]. https://www.openml.org/d/24
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2014
    Authors
    Jeff Schlimmer
    Description

    Author: Jeff Schlimmer
    Source: UCI - 1981
    Please cite: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

    Description

    This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.

    Source

    (a) Origin: 
    Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf 
    
    (b) Donor: 
    Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)
    

    Dataset description

    This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

    Attributes Information

    1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
    2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
    3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
    4. bruises?: bruises=t,no=f 
    5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
    6. gill-attachment: attached=a,descending=d,free=f,notched=n 
    7. gill-spacing: close=c,crowded=w,distant=d 
    8. gill-size: broad=b,narrow=n 
    9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
    10. stalk-shape: enlarging=e,tapering=t 
    11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
    14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
    15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
    16. veil-type: partial=p,universal=u 
    17. veil-color: brown=n,orange=o,white=w,yellow=y 
    18. ring-number: none=n,one=o,two=t 
    19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
    20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
    21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
    22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
    

    Relevant papers

    Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine.

    Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann.

    Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link]

    Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.

  11. Cryptocurrency Price Prediction by IEEE ENSI SB

    • kaggle.com
    zip
    Updated Apr 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafat Haque (2021). Cryptocurrency Price Prediction by IEEE ENSI SB [Dataset]. https://www.kaggle.com/rafat97/cryptocurrency-price-prediction-by-ieee-ensi-sb
    Explore at:
    zip(2102088 bytes)Available download formats
    Dataset updated
    Apr 21, 2021
    Authors
    Rafat Haque
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time-related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.

    Files available for download:

    • Train.csv - contains the target. This is the dataset that you will use to train your model.
    • Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
    • SampleSubmission.csv - shows the submission format for this competition, with the ‘id’ column mirroring that of Test.csv and the close column containing your predictions. The order of the rows does not matter, but the names of the id must be correct.

    Fields definitions

    asset_id: An asset ID. We refer to all supported cryptocurrencies as assets
    
    open: Open price for the time period
    
    close: Close price for the time period
    
    high: The highest price of the time period
    
    low: Lowest price of the time period
    
    volume: Number of tweets
    
    market_cap: Total available supply multiplied by the current price in USD
    
    url_shares: Every time an identified relevant URL is shared within relevant social posts that contain relevant terms
    
    unique_url_shares: Number of unique url shares posted and collected on social media
    
    reddit_posts: Number of latest Reddit posts for supported coins
    
    reddit_posts_score: Reddit Karma score on individual posts
    
    reddit_comments: Comments on Reddit that contain relevant terms
    
    Reddit_comments_score: Reddit Karma score on comments
    
    tweets: Number of crypto-specific tweets based on tuned search and filtering criteria
    
    tweet_spam: Number of tweets classified as spam
    
    tweet_followers: Number of followers on selected tweets
    
    tweet_quotes: Number of quotes on selected tweets
    
    tweet_retweets: Number of retweets of selected tweets
    
    tweet_replies: Number of replies on selected tweets
    
    tweet_favorites: Number of likes on an individual social post that contains a relevant term
    
    tweet_sentiment1: Number of tweets which has a sentiment of “very bullish”
    
    tweet_sentiment2: Number of tweets which has a sentiment of “bullish”
    
    tweet_sentiment3: Number of tweets which has a sentiment of “neutral”
    
    tweet_sentiment4: Number of tweets which has a sentiment of “bearish”
    
    tweet_sentiment5: Number of tweets which has a sentiment of “very bearish”
    
    tweet_sentiment_impact1: “Very bearish” sentiment impact
    
    tweet_sentiment_impact2: “Bearish” sentiment impact
    
    tweet_sentiment_impact3: “Neutral” sentiment impact
    
    tweet_sentiment_impact4: “Bullish” sentiment impact
    
    tweet_sentiment_impact5: “Very bullish” sentiment impact
    
    social_score: Sum of followers, retweets, likes, reddit karma etc of social posts collected
    
    average_sentiment: The average score of sentiments, an indicator of the general sentiment being spread about a coin
    
    news: Number of news articles for supported coins
    
    price_score: A score we derive from a moving average that gives the coin some indication of an upward or downward based solely on the market value
    
    social_impact_score: A score of the volume/interaction/impact of social to give a sense of the size of the market or awareness of the coin
    
    correlation_rank: The algorithm that determines the correlation of our social data to the coin price/volume
    
    galaxy_score: An indicator of how well a coin is doing
    
    volatility: Volatility indicator
    
    market_cap_rank: The rank based on the total available supply multiplied by the current price in USD
    
    percent_change_24h_rank: The rank based on the percent change in price since 24 hours ago
    
    volume_24h_rank: The rank based on volume in the last 24 hours
    
    social_volume_24h_rank: The rank based on the number of social posts that contain relevant terms in the last 24 hours
    
    social_score_24h_rank: The rank based on the sum of followers, retweets, likes, Reddit karma etc of social posts collected in the last 24 hours
    
    medium: Number of Medium articles for supported coins
    
    youtube: Number of videos with description that contains relevant terms
    
    social_volume: Number of social posts that contain relevant terms
    
    price_btc: Exchange rate with another coin
    
    market_cap_global: Total available supply multiplied by the current price in USD
    
    percent_change_24h: Percent change in price since 24 hours ago
    

    Credit

    IEEE ENSI SB

  12. Cryptocurrency Price

    • kaggle.com
    zip
    Updated Jul 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza (2021). Cryptocurrency Price [Dataset]. https://www.kaggle.com/hamzaghanmi/cryptocurrency-price
    Explore at:
    zip(2102088 bytes)Available download formats
    Dataset updated
    Jul 4, 2021
    Authors
    Hamza
    Description

    Context

    This is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.

    Note that this data is from the competition Cryptocurrency Closing Price Prediction link

    Content

    • Train.csv : contains the target. This is the dataset that you will use to train your model.
    • Test.csv : resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
    • SampleSubmission.csv : shows the submission format for this competition, with the ‘id’ column mirroring that of Test.csv and the close column containing your predictions. The order of the rows does not matter, but the names of the id must be correct..
  13. Data from: Image-based yield prediction for tall fescue using random forests...

    • zenodo.org
    csv, zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Ghysels; Sarah Ghysels; Steven Maenhout; Steven Maenhout; Reena Dubey; Reena Dubey; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Dirk Reheul; Dirk Reheul; Margot Van Rysselberghe; Kevin Dewitte; Kevin Dewitte; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Margot Van Rysselberghe (2025). Image-based yield prediction for tall fescue using random forests and convolutional neural networks [Dataset]. http://doi.org/10.5281/zenodo.14289667
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sarah Ghysels; Sarah Ghysels; Steven Maenhout; Steven Maenhout; Reena Dubey; Reena Dubey; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Dirk Reheul; Dirk Reheul; Margot Van Rysselberghe; Kevin Dewitte; Kevin Dewitte; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Margot Van Rysselberghe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains all data used in the research paper 'Image-based yield prediction for tall fescue using random forests and convolutional neural networks' by Ghysels, S., De Baets, B., Reheul, D. and Maenhout, S. 'Train_dataset.zip' and 'Test_dataset.zip' contain the RGB images of individual tall fescue plants, split into a training set and test set respectively. 'Multigras_data.csv' contains the dry matter yield measurements ('DMY (kg/ha)'), the breeder's evaluation scores ('Score MG') and the location of each individual plant on the field ('Blok_Rij_Plantnr', meaning Block-row-column).

  14. 2

    1970 British Cohort Study - Linked Administrative Data

    • datacatalogue.ukdataservice.ac.uk
    Updated Mar 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Data Service (2025). 1970 British Cohort Study - Linked Administrative Data [Dataset]. http://doi.org/10.5255/UKDA-SN-8769-1
    Explore at:
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Time period covered
    Dec 1, 1984 - Apr 30, 2016
    Area covered
    Scotland
    Description

    The 1970 British Cohort Study (BCS70) is a longitudinal birth cohort study, following a nationally representative sample of over 17,000 people born in England, Scotland and Wales in a single week of 1970. Cohort members have been surveyed throughout their childhood and adult lives, mapping their individual trajectories and creating a unique resource for researchers. It is one of very few longitudinal studies following people of this generation anywhere in the world.

    Since 1970, cohort members have been surveyed at ages 5, 10, 16, 26, 30, 34, 38, 42, 46, and 51. Featuring a range of objective measures and rich self-reported data, BCS70 covers an incredible amount of ground and can be used in research on many topics. Evidence from BCS70 has illuminated important issues for our society across five decades. Key findings include how reading for pleasure matters for children's cognitive development, why grammar schools have not reduced social inequalities, and how childhood experiences can impact on mental health in mid-life. Every day researchers from across the scientific community are using this important study to make new connections and discoveries.

    BCS70 is run by the Centre for Longitudinal Studies (CLS), a research centre in the UCL Institute of Education, which is part of University College London. The content of BCS70 studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    For information on how to access biomedical data from BCS70 that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

    Polygenic Indices
    Polygenic indices are available under Special Licence SN 9439. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

    Secure Access datasets
    Secure Access versions of BCS70 have more restrictive access conditions than versions available under the standard Safeguarded Licence.

    The BCS70 linked Scottish Medical Records (SMR) datasets include data files from the Information Services Division (ISD) part of the NHS National Services Scotland database for those cohort members who provided consent to health data linkage in the Age 42 sweep.

    The SMR database contains information about all hospital admissions in Scotland. The following linked HES datasets are available:

    • SN 8768: 1970 British Cohort Study: Linked Administrative Data, Prescribing Information System, Scottish Medical Records 2009-2015: Secure Access (PIS)
    • SN 8769 (this study): 1970 British Cohort Study: Linked Administrative Data, Maternity Attendance, Scottish Medical Records 1984-2016: Secure Access (SMR02)
    • SN 8770: 1970 British Cohort Study: Linked Administrative Data, Inpatient Attendance, Scottish Medical Records 1981-2016: Secure Access (SMR01)
    • SN 8771: 1970 British Cohort Study: Linked Administrative Data, Outpatient Attendance, Scottish Medical Records, 1981-2016: Secure Access (SMR00)

    Researchers who require access to more than one dataset need to apply for them individually.

    Further information about the SMR database can be found on the https://www.ndc.scot.nhs.uk/Data-Dictionary/SMR-Datasets/">Information Services Division Scotland SMR Datasets webpage.

  15. Medical Conversation Corpus (100k+)

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Medical Conversation Corpus (100k+) [Dataset]. https://www.kaggle.com/datasets/thedevastator/medical-conversation-corpus-100k
    Explore at:
    zip(46487525 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Medical Conversation Corpus (100k+)

    Generative Language Modeling for Medical Applications

    By Huggingface Hub [source]

    About this dataset

    This comprehensive and open-source dataset of 100k+ conversations and instructions that include medical terminologies is perfect for training Generative Language Models for various medical applications. With samples collected from human conversations, this dataset contains a variety of options and suggestions to assist in creating useful language models. From prescribed medications to home remedies such as yoga exercises, breathing exercises, and natural remedies—this collection has it all! Only if you trust the language model you build with the right data can you use it to make decisions that matter in real life. This data is sure to give your project the boost it needs with legitimate information power-packed into every sample!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Download the dataset. The dataset can be downloaded by clicking on the “Download” button located at the top of this page and following the prompts.
    • Unzip and save the file in a location of your choice on your computer or device.
    • Open up the ‘train’ or ‘test’ CSV file, depending on whether you would like to use it for training or testing purposes respectively. Both contain conversations and instructions utilizing medical terminologies which can be used to train a generative language model for medical applications.
    • Read through each conversation/instruction that is provided in each row outlined in data frame column labeled 'Conversation'. These conversations provide examples of transaction between doctors, patients, pharmacists etc., discussing topics such as health advice, natural home remedies and prescriptions etc., as well as conversation involving diagnosis, symptoms, medication side effects and health concerns pertaining to certain medical conditions etc..
    • Note that all conversations are written according to varying levels of complexity with an emphasis on effectiveness when communicating within a healthcare environment eiher directly with patients or amongst colleagues discussing about cases via Verbal/written exchanges utilizing Medical terminologies).

    6 Utilize natural language processing (NLP) techniques such as BERT Embeddings Or word embeddings corresponding to different domains Of medicine that might help relate And sort these conversations With regard To specific categories Of interest identified By domain experts For further Research purposes eiher Mathematically & statistically Or for wider Understanding contexts In diverse languages Such As Chinese , Spanish , Portuguese & French Etc

    Research Ideas

    • Natural language processing applications such as automated medical transcription.
    • Feature extraction and detection of health-related keywords for predictive analytics in healthcare applications.
    • Automated diagnostics utilizing the language models trained on this dataset to identify diseases and illnesses based on user inputs, either through symptoms or other risk factors (e.g., age, lifestyle etc.)

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

    File: test.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

    Acknowledgements

    If you use this dataset in your research, please cred...

  16. Expresso Churn Prediction Challenge

    • kaggle.com
    zip
    Updated Aug 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza (2021). Expresso Churn Prediction Challenge [Dataset]. https://www.kaggle.com/hamzaghanmi/expresso-churn-prediction-challenge
    Explore at:
    zip(115406175 bytes)Available download formats
    Dataset updated
    Aug 30, 2021
    Authors
    Hamza
    Description

    Context

    This data was imported from the zindi platform in the context of competition and here is the link to the competition The objective of the competition is to develop a predictive model that determines the likelihood for a customer to churn - to stop purchasing airtime and data from Expresso.

    Content

    The data describes 2.5 million Expresso clients. * Train.csv - contains information about 2 million customers. There is a column called CHURN that indicates if a client churned or did not churn. This is the target. You must estimate the likelihood that these clients churned. You will use this file to train your model. * Test.csv - is similar to train, but without the Churn column. You will use this file to test your model on. * SampleSubmission.csv - is an example of what your submission should look like. The order of the rows does not matter but the name of the user_id must be correct.

  17. Conversations on Coding, Debugging, Storytelling

    • kaggle.com
    zip
    Updated Dec 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Conversations on Coding, Debugging, Storytelling [Dataset]. https://www.kaggle.com/datasets/thedevastator/conversations-on-coding-debugging-storytelling-s
    Explore at:
    zip(1371478 bytes)Available download formats
    Dataset updated
    Dec 1, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Conversations on Coding, Debugging, Storytelling & Science

    Conversations on Coding, Debugging, Storytelling & Science

    By Peevski (From Huggingface) [source]

    About this dataset

    The OpenLeecher/GPT4-10k dataset is a comprehensive collection of 100 diverse conversations, presented in text format, revolving around a wide range of topics. These conversations cover various domains such as coding, debugging, storytelling, and science. Aimed at facilitating training and analysis purposes for researchers and developers alike, this dataset offers an extensive array of conversation samples.

    Each conversation within this dataset delves into different subject matters related to coding techniques, debugging strategies, storytelling methods; while also exploring concepts like spatial thinking, logical thinking. Furthermore, the conversations touch upon scientific fields including chemistry, physics and biology. To add further depth to the dataset's content, it also includes discussions on the topic of law.

    By providing this rich assortment of conversations spanning across multiple domains and disciplines in one cohesive dataset format on Kaggle platform as train.csv file , it empowers users to delve into these dialogue examples for exploration and analysis effortlessly. This compilation serves as an invaluable resource for understanding various aspects of coding practices alongside stimulating scientific discussions on subjects spanning across multiple fields

    How to use the dataset

    Introduction:

    • Understanding the Dataset Structure: The dataset consists of a CSV file named 'train.csv'. When examining the file's columns using software or programming language of your choice (e.g., Python), you will notice two key columns: 'chat' and '**chat'. Both these columns contain text data representing conversations between two or more participants.

    • Exploring Different Topics: The dataset covers a vast spectrum of subjects including coding techniques, debugging strategies, storytelling methods, spatial thinking, logical thinking, chemistry, physics, biology, and law each conversation:

      • Coding Techniques: Discover discussions on various programming concepts and best practices.
      • Debugging Strategies: Explore conversations related to identifying and fixing software issues.
      • Storytelling Methods: Dive into dialogues about effective storytelling techniques in different contexts.
      • Spatial Thinking: Engage with conversations that involve developing spatial reasoning skills for problem-solving.
      • Logical Thinking: Learn from discussions focused on enhancing logical reasoning abilities related to different domains.
      • Chemistry
      • Physics
      • Biology
      • Law
    • Analyzing Conversations: leverage natural language processing (NLP) tools or techniques such as sentiment analysis print(Number of Conversations:, len(df)) together

    • Accessible Code Examples

    Maximize Training Efficiency:

    • Taking Advantage of Diversity:

    • Creating New Applications:

    Conclusion:

    Research Ideas

    • Natural Language Processing Research: Researchers can leverage this dataset to train and evaluate natural language processing models, particularly in the context of conversational understanding and generation. The diverse conversations on coding, debugging, storytelling, and science can provide valuable insights into modeling human-like conversation patterns.
    • Chatbot Development: The dataset can be utilized for training chatbots or virtual assistants that can engage in conversations related to coding, debugging, storytelling, and science. By exposing the chatbot to a wide range of conversation samples from different domains, developers can ensure that their chatbots are capable of providing relevant and accurate responses.
    • Domain-specific Intelligent Assistants: Organizations or individuals working in fields such as coding education or scientific research may use this dataset to develop intelligent assistants tailored specifically for these domains. These assistants can help users navigate complex topics by answering questions related to coding techniques, debugging strategies, storytelling methods, or scientific concepts. Overall,'train.csv' provides a rich resource for researchers and developers interested in building conversational AI systems with knowledge across multiple domains including even legal matters

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **Li...

  18. Milan AirQuality and Weather Dataset(daily&hourly)

    • kaggle.com
    zip
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Mosca (2025). Milan AirQuality and Weather Dataset(daily&hourly) [Dataset]. https://www.kaggle.com/datasets/edmos07/milan-air-quality-and-weather-dataset-daily
    Explore at:
    zip(5079748 bytes)Available download formats
    Dataset updated
    Mar 21, 2025
    Authors
    Eduardo Mosca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [WARNING: in-depth description for hourly data is missing at the moment. Please refer to the Open-Meteo website(Air Quality and Historical Weather APIs specifically) for descriptions on columns included in hourly data for the time being. In short though, the hourly data info can be obtained from the daily data info, as the horly data is used to construct the daily data; example: if avg_nitrogen_dioxide is the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day, the "nitrogen_dioxide" column will consist of the hourly instant measurements of nitrogen dioxide (10 meters above ground in μg/m3.]

    Result of a course project in the context of the Master's Degree in Data Science at Università Degli Studi di Milano-Bicocca. The dataset was built in hopes of finding ways to tackle the bad air quality for which Milan is becoming renown for, and to make the training of ML models possible. The data was collected through Open-Meteo's APIs, who in turn got it from "Reanalyses Models" of Europea initiative, used for weather and air quality forecast. The data used was validated by the owners of the reanalyses datasets from which the data comes from, and through the construction of this specific dataset it's data quality was assessed across accuracy, completeness and consistency dimensions. We aggregated the data from hourly to daily, it is possible to consult the entire Data Management process in the attached pdf.

    File descriptions: - weatheraqDataset.csv : contains DAILY data on weather and air quality for the city of Milan in comma separateda values (csv) format. - weatheraqDataset_Report.pdf : report built to illustrate and explicit the process followed in order to build the final dataset starting from the original data sources; it also explains any processing and aggregation/integration operations carried out. - weatheraqHourly.csv : HOURLY data, counterpart to those in daily dataset(daily data is result of aggregation of hourly data). Higher granularity and number of rows can help with achieving better results, for detailed descriptions on how these hourly values are recorded and at what resolutions please visit the OpenMeteo website as stated in the warning at the start of the description.

    GitHub repo of the project: https://github.com/edmos7/weather-aqMilan

    Column descriptions for DAILY data (weatheraqDataset.csv):

    note: both 'date' in DAILY data and 'datetime' in HOURLY data is in local Milan Time(CET&CEST), adjusted with Daylight Savings(DST).

    • date: refers to day in calendar year which other values are relative to. YYYY-MM-DD format, in Milan local time.
    • avg_nitrogen_dioxide : the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day.
    • max_nitrogen_dioxide : the maximum value among the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day
    • min_nitrogen_dioxide : the minimum value among the hourly instant (10 meters above ground in μg/m3) nitrogen dioxidevalues for a particular day
    • max_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their maximum, HH:MM:SS
    • min_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their minimum, HH:MM:SS NOTE: all other "pollutant" columns (pm10, pm2_5, sulphur_dioxide, ozone) follow same structure as the above unless specified below.
    • pm2_5_avgdRolls : the average of the 24hr rolling averages for particulate matter with diameter below μg/m (pm2.5), in a particular day. Rolling averages are used to compute the European Air Quality Index(EAQI) in a given moment, so in the computation of our Daily EAQI, averages of rolling averages were used. NOTE: the above goes also for the 'pm10_avgdRolls' field.
    • eaqi : the computed air quality level according to European Environment Agency thresholds, considering daily averages for ozone, sulphur dioxide and nitrogen dioxide, and average of daily rolling averages for pm10 and pm2.5. The value corresponds to the highest level among single pollutant levels.
    • nitrogen_dioxide_eaqi : the air quality level computed through EAQI thresholds for nitrogen dioxide individually, all other [pollutant]_eaqi fields follow same reasoning.
    • avg_temperature_2m: average of hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • max_temperature_2m: maximum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • min_temperature_2m: minimum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • avg_relative_humidity_2m: average of hourly humidity recorded at 2 meters above ground level for the day(%);
    • avg_dew_point_2m: average of hourly dew point temperatures recorded at 2 meters above ground for the day(°C);
    • avg_apparent_temperature: average of hourly...
  19. GSM8K - Grade School Math 8K Q&A

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). GSM8K - Grade School Math 8K Q&A [Dataset]. https://www.kaggle.com/datasets/thedevastator/grade-school-math-8k-q-a
    Explore at:
    zip(3418660 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    GSM8K - Grade School Math 8K Q&A

    A Linguistically Diverse Dataset for Multi-Step Reasoning Question Answering

    By Huggingface Hub [source]

    About this dataset

    This Grade School Math 8K Linguistically Diverse Training & Test Set is designed to help you develop and improve your understanding of multi-step reasoning question answering. The dataset contains three separate data files: the socratic_test.csv, main_test.csv, and main_train.csv, each containing a set of questions and answers related to grade school math that consists of multiple steps. Each file contains the same columns: question, answer. The questions contained in this dataset are thoughtfully crafted to lead you through the reasoning journey for arriving at the correct answer each time, allowing you immense opportunities for learning through practice. With over 8 thousand entries for both training and testing purposes in this GSM8K dataset, it takes advanced multi-step reasoning skills to ace these questions! Deepen your knowledge today and master any challenge with ease using this amazing GSM8K set!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a unique opportunity to study multi-step reasoning for question answering. The GSM8K Linguistically Diverse Training & Test Set consists of 8,000 questions and answers that have been created to simulate real-world scenarios in grade school mathematics. Each question is paired with one answer based on a comprehensive test set. The questions cover topics such as algebra, arithmetic, probability and more.

    The dataset consists of two files: main_train.csv and main_test.csv; the former contains questions and answers specifically related to grade school math while the latter includes multi-step reasoning tests for each category of the Ontario Math Curriculum (OMC). In addition, it has three columns - Question (Question), Answer ([Answer]) – meaning that each row contains 3 sequential question/answer pairs making it possible to take a single path from the start of any given answer or branch out from there according to the logic construction required by each respective problem scenario; these columns can be used in combination with text analysis algorithms like ELMo or BERT to explore different formats of representation for responding accurately during natural language processing tasks such as Q&A or building predictive models for numerical data applications like measuring classifying resource efficiency initiatives or forecasting sales volumes in retail platforms..

    To use this dataset efficiently you should first get familiar with its structure by reading through its documentation so you are aware all available info regarding items content definition & format requirements then study examples that best suits your specific purpose whether is performing an experiment inspired by education research needs, generate insights related marketing analytics reports making predictions over artificial intelligence project capacity improvements optimization gains etcetera having full access knowledge about available source keeps you up & running from preliminary background work toward knowledge mining endeavor completion success Support User success qualitative exploration sessions make sure learn all variables definitions employed heterogeneous tools before continue Research journey starts experienced Researchers come prepared valuable resource items employed go beyond discovery false alarm halt advancement flow focus unprocessed raw values instead ensure clear cutting vision behind objectives support UserHelp plans going mean project meaningful campaign deliverables production planning safety milestones dovetail short deliveries enable design interfaces session workforce making everything automated fun entry functioning final transformation awaited offshoot Goals outcome parameters monitor life cycle management ensures ongoing projects feedbacks monitored video enactment resources tapped Proficiently balanced activity sheets tracking activities progress deliberation points evaluation radius highlights outputs primary phase visit egress collaboration agendas Client cumulative returns records capture performance illustrated collectively diarized successive setup sweetens conditions researched environments overview debriefing arcane matters turn acquaintances esteemed directives social

    Research Ideas

    • Training language models for improving accuracy in natural language processing applications such as question answering or dialogue systems.
    • Generating new grade school math questions and answers using g...
  20. ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

    • kaggle.com
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-the-power-of-english-asl-with-aslg-pc12-c/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

    Interactions between Corpus and Lexicon LREC

    By Huggingface Hub [source]

    About this dataset

    This English-ASL Bilingual Corpus 2012, or ASLG-PC12, provides valuable insight and data to users interested in the study of language. The dataset contains columns of easily readable gloss and text pairs. This Gloss and Text Pairing can greatly assist with the study of conversational habits by providing written accompaniments to American Sign Language (ASL) signs. Whether you are looking for a diverse sampling of ASL usage, or just want to delve deeper into sign language research, this corpus has plenty to offer linguists, therapists, teachers and students alike. With over 12000 entries altogether in one organized source any researcher would find it useful!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides an interesting and insightful look into the relationship between American Sign Language (ASL) and English. The ASLG-PC12 corpus contains a collection of English-ASL gloss and text pairs, meaning you can learn not just about the words and signs used in ASL, but also their relationship to one another.

    To get started using this dataset, first you'll want to explore the data sample. This can be done by opening up the train.csv file included in this dataset. It includes columns for both gloss descriptions of each sign as well as their corresponding translations in English.

    Once familiar with the data, it's time to dive deeper! You can use this dataset for various purposes; from training a machine learning algorithm to recognizing signs through image processing techniques or even creating an online dictionary of sorts that maps out ASL words from commonly used English language words . No matter what application you are planning on building out of this dataset, it promises insights into human communication that cannot be found elsewhere!

    So unlock your power with American Sign Language - start exploring all that ASLG-PC12 corpus has to offer!

    Research Ideas

    • Training ASL language recognition algorithms.
    • Developing machine translation systems to translate between English and ASL.
    • Designing a web or mobile application to help teach users how to fluently sign in either language

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | gloss | The literal sign for sign translation of a word. (Text) | | text | The standard English equivalent of the ASL gloss. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
khalid benlyazid (2021). Sign Language Classification Challenge Zindi [Dataset]. https://www.kaggle.com/khalidbenlyazid/sign-language-classification-challenge-zindi
Organization logo

Sign Language Classification Challenge Zindi

Explore at:
zip(1180875247 bytes)Available download formats
Dataset updated
Dec 1, 2021
Authors
khalid benlyazid
Description

About

The data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.

The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.

Files available for download:

Images.zip: is a zip file that contains all images in test and train.
Train.csv: contains the target. This is the dataset that you will use to train your model.
Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.

THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge

Search
Clear search
Close search
Google apps
Main menu