22 datasets found

Sign Language Classification Challenge Zindi
kaggle.com
zip
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
khalid benlyazid (2021). Sign Language Classification Challenge Zindi [Dataset]. https://www.kaggle.com/khalidbenlyazid/sign-language-classification-challenge-zindi
Explore at:
zip(1180875247 bytes)Available download formats
Dataset updated
Dec 1, 2021
Authors
khalid benlyazid
Description
About

The data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.

The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.

Files available for download:

Images.zip: is a zip file that contains all images in test and train. Train.csv: contains the target. This is the dataset that you will use to train your model. Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to. SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.

THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge
Arabizi Dialect Training Data
kaggle.com
Updated Mar 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abid Ali Awan (2021). Arabizi Dialect Training Data [Dataset]. https://www.kaggle.com/kingabzpro/aranizi-dailect-training-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abid Ali Awan
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

TUNIZI is the first 100% Tunisian Arabizi sentiment analysis dataset, developed as part of AI4D’s ongoing NLP project for African languages. Tunisian Arabizi is the representation of the Tunisian dialect written in Latin characters and numbers rather than Arabic letters.

iCompass gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs.

Tunizi was preprocessed by removing links, emoji symbols, and punctuations.

The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and neutral (0). The annotators were diverse in gender, age and social background.

Content

Variable definition:

text_id: Unique identifier of the text text: Text label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)

Files available for download are:

Train.csv - contains text on which to train your model. Test.csv - contains text on which you must classify using your trained model. SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the ID must be correct. Values in the 'label' column should -1, 0 or 1.

Acknowledgements

About AI4D-Africa; Artificial Intelligence for Development-Africa Network (ai4d.ai)

AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policy makers.
c
Ingredients Dataset – 18K+ Product Records with Ingredients Data from...
crawlfeeds.com
csv, zip
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Ingredients Dataset – 18K+ Product Records with Ingredients Data from Beauty, Pets, Groceries & Health (CSV for AI & NLP) [Dataset]. https://crawlfeeds.com/datasets/ingredients-dataset-18k-product-records-with-ingredients-data-from-beauty-pets-groceries-health-csv-for-ai-nlp
Explore at:
csv, zipAvailable download formats
Dataset updated
Aug 20, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Ingredients Dataset (18K+ records) provides a high-quality, structured collection of product information with detailed ingredients data. Covering a wide variety of categories including beauty, pet care, groceries, and health products, this dataset is designed to power AI, NLP, and machine learning applications that require domain-specific knowledge of consumer products.

Why This Dataset Matters

In today’s data-driven economy, access to structured and clean datasets is critical for building intelligent systems. For industries like healthcare, beauty, food-tech, and retail, the ability to analyze product ingredients enables deeper insights, including:

Identifying allergens or harmful substances

Comparing ingredient similarities across brands

Training LLMs and NLP models for better understanding of consumer products

Supporting regulatory compliance and labeling standards

Enhancing recommendation engines for personalized shopping

This dataset bridges the gap between raw, unstructured product data and actionable information by providing well-organized CSV files with fields that are easy to integrate into your workflows.

Dataset Coverage

The 18,000+ product records span several consumer categories:

🛍 Beauty & Personal Care – cosmetics, skincare, haircare products with full ingredient transparency

🐾 Pet Supplies – pet food and wellness products with detailed formulations

🥫 Groceries & Packaged Foods – snacks, beverages, pantry staples with structured ingredients lists

💊 Health & Wellness – supplements, vitamins, and healthcare products with nutritional components

By including multiple categories, this dataset allows cross-domain analysis and model training that reflects real-world product diversity.

Key Features

📂 18,000+ records with structured ingredient fields

🧾 Covers beauty, pet care, groceries, and health products

📊 Delivered in CSV format, ready to use for analytics or machine learning

🏷 Includes categories and breadcrumbs for taxonomy and classification

🔎 Useful for AI, NLP, LLM fine-tuning, allergen detection, and product recommendation systems

Use Cases

AI & NLP Training – fine-tune LLMs on structured ingredients data for food, beauty, and healthcare applications.

Retail Analytics – analyze consumer product composition across categories to inform pricing, positioning, and product launches.

Food & Health Research – detect allergens, evaluate ingredient safety, and study nutritional compositions.

Recommendation Engines – build smarter product recommendation systems for e-commerce platforms.

Regulatory & Compliance Tools – ensure products meet industry and government standards through ingredient validation.

Why Choose This Dataset

Unlike generic product feeds, this dataset emphasizes ingredient transparency across multiple categories. With 18K+ records, it strikes a balance between being comprehensive and affordable, making it suitable for startups, researchers, and enterprise teams looking to experiment with product intelligence.

Note: Each record includes a url (main page) and a buy_url (purchase page). Records are based on the buy_url to ensure unique, product-level data.
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
iX Mobile Banking
kaggle.com
zip
Updated Jun 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza (2021). iX Mobile Banking [Dataset]. https://www.kaggle.com/hamzaghanmi/ix-mobile-banking
Explore at:
zip(4151851 bytes)Available download formats
Dataset updated
Jun 4, 2021
Authors
Hamza
Description
Context

iX Mobile Banking Prediction Challenge

Content

This data was imported from the zindi platform link

The train set contains ~100 000 and the test contains ~45 000 survey responses from around Africa and the world.

Train.csv - contains the target. This is the dataset that you will use to train your model.

Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.

SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv and the ‘target’ column containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.

VariableDefinitions.csv - A file that contains the definitions of each column in the dataset. For columns(FQ1 - FQ37), Value 1 - Yes, 2 - No, 3 - Don’t Know 4 - refused to answer
Learning Management System
catalog.data.gov
datasets.ai
Updated Nov 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USAID (2018). Learning Management System [Dataset]. https://catalog.data.gov/es/dataset/learning-management-system
Explore at:
Dataset updated
Nov 12, 2018
Dataset provided by
United States Agency for International Developmenthttp://usaid.gov/
Description
Although the commercial name for the The USAID University - Learning Management System is CSOD InCompass, the agencies that use the system have renamed (or rebranded) their specific agency portals to meet their own needs. lnCompass is a comprehensive talent management system that incorporates the following functional modules: 1) Learning -- The Learning module supports the management and tracking of training events and individual training records. Training events may be instructor Jed or online. Courses may be managed within the system to provide descriptions, availability, and registration. Online content is stored on the system. Training information stored for individuals includes courses completed, scores, and courses registered for, 2) Connect -- The Connect module supports employee collaboration efforts. Features include communities of practice, expertise location, blogs, and knowledge sharing support. Profile information that may be stored by the system includes job position, subject matter expertise, and previous accomplishments, 3) Performance -- The Performance module supports management of organizational goals and alignment of those goals to individual performance. The module supports managing skills and competencies for the organization. The module also supports employee performance reviews. The types of information gathered about employees include their skills, competencies, and performance evaluation, 4) Succession -- The Succession module supports workforce management and planning. The type of information gathered for this module includes prior work experience, skills, and competencies, and 5) Extended Enterprise -- The Extended Enterprise module supports delivery of training outside of the organization. Training provided may be for a fee. The type of information collected for this module includes individual data for identifying the person for training records management and related information for commercial transactions.
Accompanying data for publication: "Learning the Optimal Power Flow:...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Wolgast; Thomas Wolgast; Astrid Nieße; Astrid Nieße (2025). Accompanying data for publication: "Learning the Optimal Power Flow: Environment Design Matters" [Dataset]. http://doi.org/10.5281/zenodo.13284446
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13284446
Dataset updated
May 1, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Wolgast; Thomas Wolgast; Astrid Nieße; Astrid Nieße
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All the data created for the publication "Learning the Optimal Power Flow: Environment Design Matters" by Wolgast and Nieße. The dataset contains all training runs performed, including the final neural network weights, meta-data about the training run, and various metrics during the course of training, which were used to generate the results and plots. The source code to re-produce the plots for the publication (and everything else) can be found on GitHub: https://github.com/Digitalized-Energy-Systems/rl-opf-env-design
Data from: A new machine learning approach to seabed biotope classification
cefas.co.uk
environment.data.gov.uk
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Environment, Fisheries and Aquaculture Science (2020). A new machine learning approach to seabed biotope classification [Dataset]. http://doi.org/10.14466/CefasDataHub.72
Explore at:
Unique identifier
https://doi.org/10.14466/CefasDataHub.72
Dataset updated
2020
Dataset authored and provided by
Centre for Environment, Fisheries and Aquaculture Science
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Time period covered
Mar 31, 1969 - Jan 11, 2018
Description
Files for use with the R script accompanying the paper Cooper (2019). Note that this script also uses files from https://doi.org/10.14466/CefasDataHub.34 (details provided in script). Cooper, K.M. (2019). A new machine learning approach to seabed biotope classification. Science Advances. Files include: BiotopePredictionScript.R (R script), EUROPE.shp (European Coastline), EuropeLiteScoWal.shp (European Coastline with UK boundaries), DEFRADEMKC8.shp (Seabed bathymetry), C5922DATASETFAM13022017.csv (Training dataset), PARTC16112018.csv (Test dataset), PARTCAGG16112018.csv (Aggregation data). Description of C5922DATASETFAM13022017.csv: This file is based on the RSMP dataset (see https://www.cefas.co.uk/cefas-data-hub/dois/rsmp-baseline-dataset/), but with macrofaunal data output at the level of family or above. A variety of gear types have been used for sample collection including grabs (0.1m2 Hamon, 0.2m2 Hamon, 0.1m2 Day, 0.1m2 Van Veen and 0.1m2 Smith McIntrye) and cores. Of these various devices, 93% of samples were acquired using either a 0.1m2 Hamon grab or a 0.1m2 Day grab. Sieve sizes used in sample processing include 1mm and 0.5mm, reflecting the conventional preference for 1mm offshore and 0.5mm inshore. Of the samples collected using either a 0.1m2 Hamon grab or a 0.1m2 Day grab, 88% were processed using a 1mm sieve. Taxon names were standardised according to the WoRMS (World Register of Marine Species) list using the Taxon Match Tool (http://www.marinespecies.org/aphia.php?p=match). Of the initial 13,449 taxon names, only 774 remained after correction and aggregation to family level. The final dataset comprises of a single sheet comma-separated values (.csv) file. Colonials accounted for less than 20% of the total number of taxa and, where present, were given a value of 1 in the dataset. This component of the fauna was missing from 325 out of the 777 surveys, reflecting either a true absence, or simply that colonial taxa were ignored by the analyst. Sediment particle size data were provided as percentage weight by sieve mesh size, with the dataset including 99 different sieve sizes. Sediment samples have been processed using sieve, and a combination of sieve and laser diffraction techniques. Key metadata fields include: Sample coordinates (Latitude & Longitude), Survey Name, Gear, Date, Grab Sample Volume (litres) and Water Depth (m). A number of additional explanatory variables are also provided (salinity, temperature, chlorophyll a, Suspended particulate matter, Water depth, Wave Orbital Velocity, Average Current, Bed Stress). In total, the dataset dimensions are 33,198 rows (samples) x 900 columns (variables/factors), yielding a matrix of 29,878,200 individual data values.
Malawi News Classification Challenge
kaggle.com
zip
Updated Jan 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza (2021). Malawi News Classification Challenge [Dataset]. https://www.kaggle.com/hamzaghanmi/malawi-news-classification-challenge
Explore at:
zip(1600974 bytes)Available download formats
Dataset updated
Jan 25, 2021
Authors
Hamza
Area covered
Malawi
Description
Context

The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues

Content

Train.csv - contains the target. This is the dataset that you will use to train your model. Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode. SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct List of classes: ['SOCIAL ISSUES', 'EDUCATION', 'RELATIONSHIPS', 'ECONOMY', 'RELIGION', 'POLITICS', 'LAW/ORDER', 'SOCIAL', 'HEALTH', 'ARTS AND CRAFTS', 'FARMING', 'CULTURE', 'FLOODING', 'WITCHCRAFT', 'MUSIC', 'TRANSPORT', 'WILDLIFE/ENVIRONMENT', 'LOCALCHIEFS', 'SPORTS', 'OPINION/ESSAY']

Inspiration

Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.
o
mushroom
openml.org
Updated Apr 6, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Schlimmer (2014). mushroom [Dataset]. https://www.openml.org/d/24
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2014
Authors
Jeff Schlimmer
Description
Author: Jeff Schlimmer
Source: UCI - 1981
Please cite: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

Description

This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.

Source

(a) Origin: Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)

Dataset description

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

Attributes Information

1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 4. bruises?: bruises=t,no=f 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 6. gill-attachment: attached=a,descending=d,free=f,notched=n 7. gill-spacing: close=c,crowded=w,distant=d 8. gill-size: broad=b,narrow=n 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 10. stalk-shape: enlarging=e,tapering=t 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 16. veil-type: partial=p,universal=u 17. veil-color: brown=n,orange=o,white=w,yellow=y 18. ring-number: none=n,one=o,two=t 19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

Relevant papers

Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine.

Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann.

Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link]

Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.

Cryptocurrency Price Prediction by IEEE ENSI SB

kaggle.com

zip

Updated Apr 21, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafat Haque (2021). Cryptocurrency Price Prediction by IEEE ENSI SB [Dataset]. https://www.kaggle.com/rafat97/cryptocurrency-price-prediction-by-ieee-ensi-sb

Explore at:

zip(2102088 bytes)Available download formats

Dataset updated

Apr 21, 2021

Authors

Rafat Haque

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

This is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time-related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.

Files available for download:

Train.csv - contains the target. This is the dataset that you will use to train your model.
Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv - shows the submission format for this competition, with the ‘id’ column mirroring that of Test.csv and the close column containing your predictions. The order of the rows does not matter, but the names of the id must be correct.

Fields definitions

asset_id: An asset ID. We refer to all supported cryptocurrencies as assets

open: Open price for the time period

close: Close price for the time period

high: The highest price of the time period

low: Lowest price of the time period

volume: Number of tweets

market_cap: Total available supply multiplied by the current price in USD

url_shares: Every time an identified relevant URL is shared within relevant social posts that contain relevant terms

unique_url_shares: Number of unique url shares posted and collected on social media

reddit_posts: Number of latest Reddit posts for supported coins

reddit_posts_score: Reddit Karma score on individual posts

reddit_comments: Comments on Reddit that contain relevant terms

Reddit_comments_score: Reddit Karma score on comments

tweets: Number of crypto-specific tweets based on tuned search and filtering criteria

tweet_spam: Number of tweets classified as spam

tweet_followers: Number of followers on selected tweets

tweet_quotes: Number of quotes on selected tweets

tweet_retweets: Number of retweets of selected tweets

tweet_replies: Number of replies on selected tweets

tweet_favorites: Number of likes on an individual social post that contains a relevant term

tweet_sentiment1: Number of tweets which has a sentiment of “very bullish”

tweet_sentiment2: Number of tweets which has a sentiment of “bullish”

tweet_sentiment3: Number of tweets which has a sentiment of “neutral”

tweet_sentiment4: Number of tweets which has a sentiment of “bearish”

tweet_sentiment5: Number of tweets which has a sentiment of “very bearish”

tweet_sentiment_impact1: “Very bearish” sentiment impact

tweet_sentiment_impact2: “Bearish” sentiment impact

tweet_sentiment_impact3: “Neutral” sentiment impact

tweet_sentiment_impact4: “Bullish” sentiment impact

tweet_sentiment_impact5: “Very bullish” sentiment impact

social_score: Sum of followers, retweets, likes, reddit karma etc of social posts collected

average_sentiment: The average score of sentiments, an indicator of the general sentiment being spread about a coin

news: Number of news articles for supported coins

price_score: A score we derive from a moving average that gives the coin some indication of an upward or downward based solely on the market value

social_impact_score: A score of the volume/interaction/impact of social to give a sense of the size of the market or awareness of the coin

correlation_rank: The algorithm that determines the correlation of our social data to the coin price/volume

galaxy_score: An indicator of how well a coin is doing

volatility: Volatility indicator

market_cap_rank: The rank based on the total available supply multiplied by the current price in USD

percent_change_24h_rank: The rank based on the percent change in price since 24 hours ago

volume_24h_rank: The rank based on volume in the last 24 hours

social_volume_24h_rank: The rank based on the number of social posts that contain relevant terms in the last 24 hours

social_score_24h_rank: The rank based on the sum of followers, retweets, likes, Reddit karma etc of social posts collected in the last 24 hours

medium: Number of Medium articles for supported coins

youtube: Number of videos with description that contains relevant terms

social_volume: Number of social posts that contain relevant terms

price_btc: Exchange rate with another coin

market_cap_global: Total available supply multiplied by the current price in USD

percent_change_24h: Percent change in price since 24 hours ago

Credit

IEEE ENSI SB

Cryptocurrency Price
kaggle.com
zip
Updated Jul 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza (2021). Cryptocurrency Price [Dataset]. https://www.kaggle.com/hamzaghanmi/cryptocurrency-price
Explore at:
zip(2102088 bytes)Available download formats
Dataset updated
Jul 4, 2021
Authors
Hamza
Description
Context

This is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.

Note that this data is from the competition Cryptocurrency Closing Price Prediction link

Content

Train.csv : contains the target. This is the dataset that you will use to train your model.

Test.csv : resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.

SampleSubmission.csv : shows the submission format for this competition, with the ‘id’ column mirroring that of Test.csv and the close column containing your predictions. The order of the rows does not matter, but the names of the id must be correct..
Data from: Image-based yield prediction for tall fescue using random forests...
zenodo.org
csv, zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Ghysels; Sarah Ghysels; Steven Maenhout; Steven Maenhout; Reena Dubey; Reena Dubey; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Dirk Reheul; Dirk Reheul; Margot Van Rysselberghe; Kevin Dewitte; Kevin Dewitte; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Margot Van Rysselberghe (2025). Image-based yield prediction for tall fescue using random forests and convolutional neural networks [Dataset]. http://doi.org/10.5281/zenodo.14289667
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14289667
Dataset updated
Mar 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sarah Ghysels; Sarah Ghysels; Steven Maenhout; Steven Maenhout; Reena Dubey; Reena Dubey; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Dirk Reheul; Dirk Reheul; Margot Van Rysselberghe; Kevin Dewitte; Kevin Dewitte; Michaël Goethals; Pieter De Wagter; Franky Van Peteghem; Margot Van Rysselberghe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains all data used in the research paper 'Image-based yield prediction for tall fescue using random forests and convolutional neural networks' by Ghysels, S., De Baets, B., Reheul, D. and Maenhout, S. 'Train_dataset.zip' and 'Test_dataset.zip' contain the RGB images of individual tall fescue plants, split into a training set and test set respectively. 'Multigras_data.csv' contains the dry matter yield measurements ('DMY (kg/ha)'), the breeder's evaluation scores ('Score MG') and the location of each individual plant on the field ('Blok_Rij_Plantnr', meaning Block-row-column).
2
1970 British Cohort Study - Linked Administrative Data
datacatalogue.ukdataservice.ac.uk
Updated Mar 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2025). 1970 British Cohort Study - Linked Administrative Data [Dataset]. http://doi.org/10.5255/UKDA-SN-8769-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8769-1
Dataset updated
Mar 6, 2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Time period covered
Dec 1, 1984 - Apr 30, 2016
Area covered
Scotland
Description
The 1970 British Cohort Study (BCS70) is a longitudinal birth cohort study, following a nationally representative sample of over 17,000 people born in England, Scotland and Wales in a single week of 1970. Cohort members have been surveyed throughout their childhood and adult lives, mapping their individual trajectories and creating a unique resource for researchers. It is one of very few longitudinal studies following people of this generation anywhere in the world.

Since 1970, cohort members have been surveyed at ages 5, 10, 16, 26, 30, 34, 38, 42, 46, and 51. Featuring a range of objective measures and rich self-reported data, BCS70 covers an incredible amount of ground and can be used in research on many topics. Evidence from BCS70 has illuminated important issues for our society across five decades. Key findings include how reading for pleasure matters for children's cognitive development, why grammar schools have not reduced social inequalities, and how childhood experiences can impact on mental health in mid-life. Every day researchers from across the scientific community are using this important study to make new connections and discoveries.

BCS70 is run by the Centre for Longitudinal Studies (CLS), a research centre in the UCL Institute of Education, which is part of University College London. The content of BCS70 studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from BCS70 that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Polygenic Indices
Polygenic indices are available under Special Licence SN 9439. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

Secure Access datasets
Secure Access versions of BCS70 have more restrictive access conditions than versions available under the standard Safeguarded Licence.

The BCS70 linked Scottish Medical Records (SMR) datasets include data files from the Information Services Division (ISD) part of the NHS National Services Scotland database for those cohort members who provided consent to health data linkage in the Age 42 sweep.
The SMR database contains information about all hospital admissions in Scotland. The following linked HES datasets are available:
SN 8768: 1970 British Cohort Study: Linked Administrative Data, Prescribing Information System, Scottish Medical Records 2009-2015: Secure Access (PIS)
SN 8769 (this study): 1970 British Cohort Study: Linked Administrative Data, Maternity Attendance, Scottish Medical Records 1984-2016: Secure Access (SMR02)
SN 8770: 1970 British Cohort Study: Linked Administrative Data, Inpatient Attendance, Scottish Medical Records 1981-2016: Secure Access (SMR01)
SN 8771: 1970 British Cohort Study: Linked Administrative Data, Outpatient Attendance, Scottish Medical Records, 1981-2016: Secure Access (SMR00)
Researchers who require access to more than one dataset need to apply for them individually.
Further information about the SMR database can be found on the https://www.ndc.scot.nhs.uk/Data-Dictionary/SMR-Datasets/">Information Services Division Scotland SMR Datasets webpage.
Medical Conversation Corpus (100k+)
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Medical Conversation Corpus (100k+) [Dataset]. https://www.kaggle.com/datasets/thedevastator/medical-conversation-corpus-100k
Explore at:
zip(46487525 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Medical Conversation Corpus (100k+)

Generative Language Modeling for Medical Applications

By Huggingface Hub [source]

About this dataset

This comprehensive and open-source dataset of 100k+ conversations and instructions that include medical terminologies is perfect for training Generative Language Models for various medical applications. With samples collected from human conversations, this dataset contains a variety of options and suggestions to assist in creating useful language models. From prescribed medications to home remedies such as yoga exercises, breathing exercises, and natural remedies—this collection has it all! Only if you trust the language model you build with the right data can you use it to make decisions that matter in real life. This data is sure to give your project the boost it needs with legitimate information power-packed into every sample!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Download the dataset. The dataset can be downloaded by clicking on the “Download” button located at the top of this page and following the prompts.

Unzip and save the file in a location of your choice on your computer or device.

Open up the ‘train’ or ‘test’ CSV file, depending on whether you would like to use it for training or testing purposes respectively. Both contain conversations and instructions utilizing medical terminologies which can be used to train a generative language model for medical applications.

Read through each conversation/instruction that is provided in each row outlined in data frame column labeled 'Conversation'. These conversations provide examples of transaction between doctors, patients, pharmacists etc., discussing topics such as health advice, natural home remedies and prescriptions etc., as well as conversation involving diagnosis, symptoms, medication side effects and health concerns pertaining to certain medical conditions etc..

Note that all conversations are written according to varying levels of complexity with an emphasis on effectiveness when communicating within a healthcare environment eiher directly with patients or amongst colleagues discussing about cases via Verbal/written exchanges utilizing Medical terminologies).

6 Utilize natural language processing (NLP) techniques such as BERT Embeddings Or word embeddings corresponding to different domains Of medicine that might help relate And sort these conversations With regard To specific categories Of interest identified By domain experts For further Research purposes eiher Mathematically & statistically Or for wider Understanding contexts In diverse languages Such As Chinese , Spanish , Portuguese & French Etc

Research Ideas

Natural language processing applications such as automated medical transcription.

Feature extraction and detection of health-related keywords for predictive analytics in healthcare applications.

Automated diagnostics utilizing the language models trained on this dataset to identify diseases and illnesses based on user inputs, either through symptoms or other risk factors (e.g., age, lifestyle etc.)

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

File: test.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |

Acknowledgements

If you use this dataset in your research, please cred...
Expresso Churn Prediction Challenge
kaggle.com
zip
Updated Aug 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza (2021). Expresso Churn Prediction Challenge [Dataset]. https://www.kaggle.com/hamzaghanmi/expresso-churn-prediction-challenge
Explore at:
zip(115406175 bytes)Available download formats
Dataset updated
Aug 30, 2021
Authors
Hamza
Description
Context

This data was imported from the zindi platform in the context of competition and here is the link to the competition The objective of the competition is to develop a predictive model that determines the likelihood for a customer to churn - to stop purchasing airtime and data from Expresso.

Content

The data describes 2.5 million Expresso clients. * Train.csv - contains information about 2 million customers. There is a column called CHURN that indicates if a client churned or did not churn. This is the target. You must estimate the likelihood that these clients churned. You will use this file to train your model. * Test.csv - is similar to train, but without the Churn column. You will use this file to test your model on. * SampleSubmission.csv - is an example of what your submission should look like. The order of the rows does not matter but the name of the user_id must be correct.
Conversations on Coding, Debugging, Storytelling
kaggle.com
zip
Updated Dec 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Conversations on Coding, Debugging, Storytelling [Dataset]. https://www.kaggle.com/datasets/thedevastator/conversations-on-coding-debugging-storytelling-s
Explore at:
zip(1371478 bytes)Available download formats
Dataset updated
Dec 1, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Conversations on Coding, Debugging, Storytelling & Science

Conversations on Coding, Debugging, Storytelling & Science

By Peevski (From Huggingface) [source]

About this dataset

The OpenLeecher/GPT4-10k dataset is a comprehensive collection of 100 diverse conversations, presented in text format, revolving around a wide range of topics. These conversations cover various domains such as coding, debugging, storytelling, and science. Aimed at facilitating training and analysis purposes for researchers and developers alike, this dataset offers an extensive array of conversation samples.

Each conversation within this dataset delves into different subject matters related to coding techniques, debugging strategies, storytelling methods; while also exploring concepts like spatial thinking, logical thinking. Furthermore, the conversations touch upon scientific fields including chemistry, physics and biology. To add further depth to the dataset's content, it also includes discussions on the topic of law.

By providing this rich assortment of conversations spanning across multiple domains and disciplines in one cohesive dataset format on Kaggle platform as train.csv file , it empowers users to delve into these dialogue examples for exploration and analysis effortlessly. This compilation serves as an invaluable resource for understanding various aspects of coding practices alongside stimulating scientific discussions on subjects spanning across multiple fields

How to use the dataset

Introduction:

Understanding the Dataset Structure: The dataset consists of a CSV file named 'train.csv'. When examining the file's columns using software or programming language of your choice (e.g., Python), you will notice two key columns: 'chat' and '**chat'. Both these columns contain text data representing conversations between two or more participants.

Exploring Different Topics: The dataset covers a vast spectrum of subjects including coding techniques, debugging strategies, storytelling methods, spatial thinking, logical thinking, chemistry, physics, biology, and law each conversation:

Coding Techniques: Discover discussions on various programming concepts and best practices.

Debugging Strategies: Explore conversations related to identifying and fixing software issues.

Storytelling Methods: Dive into dialogues about effective storytelling techniques in different contexts.

Spatial Thinking: Engage with conversations that involve developing spatial reasoning skills for problem-solving.

Logical Thinking: Learn from discussions focused on enhancing logical reasoning abilities related to different domains.

Chemistry

Physics

Biology

Law

Analyzing Conversations: leverage natural language processing (NLP) tools or techniques such as sentiment analysis print(Number of Conversations:, len(df)) together

Accessible Code Examples

Maximize Training Efficiency:

Taking Advantage of Diversity:

Creating New Applications:

Conclusion:

Research Ideas

Natural Language Processing Research: Researchers can leverage this dataset to train and evaluate natural language processing models, particularly in the context of conversational understanding and generation. The diverse conversations on coding, debugging, storytelling, and science can provide valuable insights into modeling human-like conversation patterns.

Chatbot Development: The dataset can be utilized for training chatbots or virtual assistants that can engage in conversations related to coding, debugging, storytelling, and science. By exposing the chatbot to a wide range of conversation samples from different domains, developers can ensure that their chatbots are capable of providing relevant and accurate responses.

Domain-specific Intelligent Assistants: Organizations or individuals working in fields such as coding education or scientific research may use this dataset to develop intelligent assistants tailored specifically for these domains. These assistants can help users navigate complex topics by answering questions related to coding techniques, debugging strategies, storytelling methods, or scientific concepts. Overall,'train.csv' provides a rich resource for researchers and developers interested in building conversational AI systems with knowledge across multiple domains including even legal matters

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**Li...
Milan AirQuality and Weather Dataset(daily&hourly)
kaggle.com
zip
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Mosca (2025). Milan AirQuality and Weather Dataset(daily&hourly) [Dataset]. https://www.kaggle.com/datasets/edmos07/milan-air-quality-and-weather-dataset-daily
Explore at:
zip(5079748 bytes)Available download formats
Dataset updated
Mar 21, 2025
Authors
Eduardo Mosca
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[WARNING: in-depth description for hourly data is missing at the moment. Please refer to the Open-Meteo website(Air Quality and Historical Weather APIs specifically) for descriptions on columns included in hourly data for the time being. In short though, the hourly data info can be obtained from the daily data info, as the horly data is used to construct the daily data; example: if avg_nitrogen_dioxide is the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day, the "nitrogen_dioxide" column will consist of the hourly instant measurements of nitrogen dioxide (10 meters above ground in μg/m3.]

Result of a course project in the context of the Master's Degree in Data Science at Università Degli Studi di Milano-Bicocca. The dataset was built in hopes of finding ways to tackle the bad air quality for which Milan is becoming renown for, and to make the training of ML models possible. The data was collected through Open-Meteo's APIs, who in turn got it from "Reanalyses Models" of Europea initiative, used for weather and air quality forecast. The data used was validated by the owners of the reanalyses datasets from which the data comes from, and through the construction of this specific dataset it's data quality was assessed across accuracy, completeness and consistency dimensions. We aggregated the data from hourly to daily, it is possible to consult the entire Data Management process in the attached pdf.

File descriptions: - weatheraqDataset.csv : contains DAILY data on weather and air quality for the city of Milan in comma separateda values (csv) format. - weatheraqDataset_Report.pdf : report built to illustrate and explicit the process followed in order to build the final dataset starting from the original data sources; it also explains any processing and aggregation/integration operations carried out. - weatheraqHourly.csv : HOURLY data, counterpart to those in daily dataset(daily data is result of aggregation of hourly data). Higher granularity and number of rows can help with achieving better results, for detailed descriptions on how these hourly values are recorded and at what resolutions please visit the OpenMeteo website as stated in the warning at the start of the description.

GitHub repo of the project: https://github.com/edmos7/weather-aqMilan

Column descriptions for DAILY data (weatheraqDataset.csv):

note: both 'date' in DAILY data and 'datetime' in HOURLY data is in local Milan Time(CET&CEST), adjusted with Daylight Savings(DST).

date: refers to day in calendar year which other values are relative to. YYYY-MM-DD format, in Milan local time.

avg_nitrogen_dioxide : the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day.

max_nitrogen_dioxide : the maximum value among the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day

min_nitrogen_dioxide : the minimum value among the hourly instant (10 meters above ground in μg/m3) nitrogen dioxidevalues for a particular day

max_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their maximum, HH:MM:SS

min_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their minimum, HH:MM:SS NOTE: all other "pollutant" columns (pm10, pm2_5, sulphur_dioxide, ozone) follow same structure as the above unless specified below.

pm2_5_avgdRolls : the average of the 24hr rolling averages for particulate matter with diameter below μg/m (pm2.5), in a particular day. Rolling averages are used to compute the European Air Quality Index(EAQI) in a given moment, so in the computation of our Daily EAQI, averages of rolling averages were used. NOTE: the above goes also for the 'pm10_avgdRolls' field.

eaqi : the computed air quality level according to European Environment Agency thresholds, considering daily averages for ozone, sulphur dioxide and nitrogen dioxide, and average of daily rolling averages for pm10 and pm2.5. The value corresponds to the highest level among single pollutant levels.

nitrogen_dioxide_eaqi : the air quality level computed through EAQI thresholds for nitrogen dioxide individually, all other [pollutant]_eaqi fields follow same reasoning.

avg_temperature_2m: average of hourly air temperatures recorded at 2 meters above ground level for the day(°C);

max_temperature_2m: maximum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);

min_temperature_2m: minimum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);

avg_relative_humidity_2m: average of hourly humidity recorded at 2 meters above ground level for the day(%);

avg_dew_point_2m: average of hourly dew point temperatures recorded at 2 meters above ground for the day(°C);

avg_apparent_temperature: average of hourly...
GSM8K - Grade School Math 8K Q&A
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). GSM8K - Grade School Math 8K Q&A [Dataset]. https://www.kaggle.com/datasets/thedevastator/grade-school-math-8k-q-a
Explore at:
zip(3418660 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GSM8K - Grade School Math 8K Q&A

A Linguistically Diverse Dataset for Multi-Step Reasoning Question Answering

By Huggingface Hub [source]

About this dataset

This Grade School Math 8K Linguistically Diverse Training & Test Set is designed to help you develop and improve your understanding of multi-step reasoning question answering. The dataset contains three separate data files: the socratic_test.csv, main_test.csv, and main_train.csv, each containing a set of questions and answers related to grade school math that consists of multiple steps. Each file contains the same columns: question, answer. The questions contained in this dataset are thoughtfully crafted to lead you through the reasoning journey for arriving at the correct answer each time, allowing you immense opportunities for learning through practice. With over 8 thousand entries for both training and testing purposes in this GSM8K dataset, it takes advanced multi-step reasoning skills to ace these questions! Deepen your knowledge today and master any challenge with ease using this amazing GSM8K set!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique opportunity to study multi-step reasoning for question answering. The GSM8K Linguistically Diverse Training & Test Set consists of 8,000 questions and answers that have been created to simulate real-world scenarios in grade school mathematics. Each question is paired with one answer based on a comprehensive test set. The questions cover topics such as algebra, arithmetic, probability and more.

The dataset consists of two files: main_train.csv and main_test.csv; the former contains questions and answers specifically related to grade school math while the latter includes multi-step reasoning tests for each category of the Ontario Math Curriculum (OMC). In addition, it has three columns - Question (Question), Answer ([Answer]) – meaning that each row contains 3 sequential question/answer pairs making it possible to take a single path from the start of any given answer or branch out from there according to the logic construction required by each respective problem scenario; these columns can be used in combination with text analysis algorithms like ELMo or BERT to explore different formats of representation for responding accurately during natural language processing tasks such as Q&A or building predictive models for numerical data applications like measuring classifying resource efficiency initiatives or forecasting sales volumes in retail platforms..

To use this dataset efficiently you should first get familiar with its structure by reading through its documentation so you are aware all available info regarding items content definition & format requirements then study examples that best suits your specific purpose whether is performing an experiment inspired by education research needs, generate insights related marketing analytics reports making predictions over artificial intelligence project capacity improvements optimization gains etcetera having full access knowledge about available source keeps you up & running from preliminary background work toward knowledge mining endeavor completion success Support User success qualitative exploration sessions make sure learn all variables definitions employed heterogeneous tools before continue Research journey starts experienced Researchers come prepared valuable resource items employed go beyond discovery false alarm halt advancement flow focus unprocessed raw values instead ensure clear cutting vision behind objectives support UserHelp plans going mean project meaningful campaign deliverables production planning safety milestones dovetail short deliveries enable design interfaces session workforce making everything automated fun entry functioning final transformation awaited offshoot Goals outcome parameters monitor life cycle management ensures ongoing projects feedbacks monitored video enactment resources tapped Proficiently balanced activity sheets tracking activities progress deliberation points evaluation radius highlights outputs primary phase visit egress collaboration agendas Client cumulative returns records capture performance illustrated collectively diarized successive setup sweetens conditions researched environments overview debriefing arcane matters turn acquaintances esteemed directives social

Research Ideas

Training language models for improving accuracy in natural language processing applications such as question answering or dialogue systems.

Generating new grade school math questions and answers using g...
ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)
kaggle.com
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-the-power-of-english-asl-with-aslg-pc12-c/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

Interactions between Corpus and Lexicon LREC

By Huggingface Hub [source]

About this dataset

This English-ASL Bilingual Corpus 2012, or ASLG-PC12, provides valuable insight and data to users interested in the study of language. The dataset contains columns of easily readable gloss and text pairs. This Gloss and Text Pairing can greatly assist with the study of conversational habits by providing written accompaniments to American Sign Language (ASL) signs. Whether you are looking for a diverse sampling of ASL usage, or just want to delve deeper into sign language research, this corpus has plenty to offer linguists, therapists, teachers and students alike. With over 12000 entries altogether in one organized source any researcher would find it useful!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides an interesting and insightful look into the relationship between American Sign Language (ASL) and English. The ASLG-PC12 corpus contains a collection of English-ASL gloss and text pairs, meaning you can learn not just about the words and signs used in ASL, but also their relationship to one another.

To get started using this dataset, first you'll want to explore the data sample. This can be done by opening up the train.csv file included in this dataset. It includes columns for both gloss descriptions of each sign as well as their corresponding translations in English.

Once familiar with the data, it's time to dive deeper! You can use this dataset for various purposes; from training a machine learning algorithm to recognizing signs through image processing techniques or even creating an online dictionary of sorts that maps out ASL words from commonly used English language words . No matter what application you are planning on building out of this dataset, it promises insights into human communication that cannot be found elsewhere!

So unlock your power with American Sign Language - start exploring all that ASLG-PC12 corpus has to offer!

Research Ideas

Training ASL language recognition algorithms.

Developing machine translation systems to translate between English and ASL.

Designing a web or mobile application to help teach users how to fluently sign in either language

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | gloss | The literal sign for sign translation of a word. (Text) | | text | The standard English equivalent of the ASL gloss. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Facebook

Twitter

Click to copy link

Link copied

Cite

khalid benlyazid (2021). Sign Language Classification Challenge Zindi [Dataset]. https://www.kaggle.com/khalidbenlyazid/sign-language-classification-challenge-zindi

Sign Language Classification Challenge Zindi

Explore at:

zip(1180875247 bytes)Available download formats

Dataset updated

Dec 1, 2021

Authors

khalid benlyazid

Description

About

The data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.

The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.

Files available for download:

Images.zip: is a zip file that contains all images in test and train.
Train.csv: contains the target. This is the dataset that you will use to train your model.
Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.

THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge

Clear search

Close search

Google apps

Main menu

Sign Language Classification Challenge Zindi

About

Files available for download:

Arabizi Dialect Training Data

Context

Content

Acknowledgements

Ingredients Dataset – 18K+ Product Records with Ingredients Data from...

Why This Dataset Matters

Dataset Coverage

Key Features

Use Cases

Why Choose This Dataset

Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

iX Mobile Banking

Context

Content

Learning Management System

Accompanying data for publication: "Learning the Optimal Power Flow:...

Data from: A new machine learning approach to seabed biotope classification

Malawi News Classification Challenge

Context

Content

Inspiration

mushroom

Description

Source

Dataset description

Attributes Information

Relevant papers

Cryptocurrency Price Prediction by IEEE ENSI SB

Context

Files available for download:

Fields definitions

Credit

Cryptocurrency Price

Context

Content

Data from: Image-based yield prediction for tall fescue using random forests...

1970 British Cohort Study - Linked Administrative Data

Medical Conversation Corpus (100k+)

Medical Conversation Corpus (100k+)

Generative Language Modeling for Medical Applications

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Expresso Churn Prediction Challenge

Context

Content

Conversations on Coding, Debugging, Storytelling

Conversations on Coding, Debugging, Storytelling & Science

Conversations on Coding, Debugging, Storytelling & Science

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Milan AirQuality and Weather Dataset(daily&hourly)

GSM8K - Grade School Math 8K Q&A

GSM8K - Grade School Math 8K Q&A

A Linguistically Diverse Dataset for Multi-Step Reasoning Question Answering

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

Interactions between Corpus and Lexicon LREC

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas