Facebook
TwitterThe data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.
The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.
Images.zip: is a zip file that contains all images in test and train.
Train.csv: contains the target. This is the dataset that you will use to train your model.
Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.
THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
TUNIZI is the first 100% Tunisian Arabizi sentiment analysis dataset, developed as part of AI4D’s ongoing NLP project for African languages. Tunisian Arabizi is the representation of the Tunisian dialect written in Latin characters and numbers rather than Arabic letters.
iCompass gathered comments from social media platforms that express sentiment about popular topics. For this purpose, we extracted 100k comments using public streaming APIs.
Tunizi was preprocessed by removing links, emoji symbols, and punctuations.
The collected comments were manually annotated using an overall polarity: positive (1), negative (-1) and neutral (0). The annotators were diverse in gender, age and social background.
Variable definition:
text_id: Unique identifier of the text text: Text label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)
Files available for download are:
Train.csv - contains text on which to train your model. Test.csv - contains text on which you must classify using your trained model. SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the ID must be correct. Values in the 'label' column should -1, 0 or 1.
About AI4D-Africa; Artificial Intelligence for Development-Africa Network (ai4d.ai)
AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policy makers.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Ingredients Dataset (18K+ records) provides a high-quality, structured collection of product information with detailed ingredients data. Covering a wide variety of categories including beauty, pet care, groceries, and health products, this dataset is designed to power AI, NLP, and machine learning applications that require domain-specific knowledge of consumer products.
In today’s data-driven economy, access to structured and clean datasets is critical for building intelligent systems. For industries like healthcare, beauty, food-tech, and retail, the ability to analyze product ingredients enables deeper insights, including:
Identifying allergens or harmful substances
Comparing ingredient similarities across brands
Training LLMs and NLP models for better understanding of consumer products
Supporting regulatory compliance and labeling standards
Enhancing recommendation engines for personalized shopping
This dataset bridges the gap between raw, unstructured product data and actionable information by providing well-organized CSV files with fields that are easy to integrate into your workflows.
The 18,000+ product records span several consumer categories:
🛍 Beauty & Personal Care – cosmetics, skincare, haircare products with full ingredient transparency
🐾 Pet Supplies – pet food and wellness products with detailed formulations
🥫 Groceries & Packaged Foods – snacks, beverages, pantry staples with structured ingredients lists
💊 Health & Wellness – supplements, vitamins, and healthcare products with nutritional components
By including multiple categories, this dataset allows cross-domain analysis and model training that reflects real-world product diversity.
📂 18,000+ records with structured ingredient fields
🧾 Covers beauty, pet care, groceries, and health products
📊 Delivered in CSV format, ready to use for analytics or machine learning
🏷 Includes categories and breadcrumbs for taxonomy and classification
🔎 Useful for AI, NLP, LLM fine-tuning, allergen detection, and product recommendation systems
AI & NLP Training – fine-tune LLMs on structured ingredients data for food, beauty, and healthcare applications.
Retail Analytics – analyze consumer product composition across categories to inform pricing, positioning, and product launches.
Food & Health Research – detect allergens, evaluate ingredient safety, and study nutritional compositions.
Recommendation Engines – build smarter product recommendation systems for e-commerce platforms.
Regulatory & Compliance Tools – ensure products meet industry and government standards through ingredient validation.
Unlike generic product feeds, this dataset emphasizes ingredient transparency across multiple categories. With 18K+ records, it strikes a balance between being comprehensive and affordable, making it suitable for startups, researchers, and enterprise teams looking to experiment with product intelligence.
Note: Each record includes a url (main page) and a buy_url (purchase page). Records are based on the buy_url to ensure unique, product-level data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.
Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Facebook
TwitteriX Mobile Banking Prediction Challenge
This data was imported from the zindi platform link
The train set contains ~100 000 and the test contains ~45 000 survey responses from around Africa and the world.
Train.csv - contains the target. This is the dataset that you will use to train your model.
Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv and the ‘target’ column containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.
VariableDefinitions.csv - A file that contains the definitions of each column in the dataset. For columns(FQ1 - FQ37), Value 1 - Yes, 2 - No, 3 - Don’t Know 4 - refused to answer
Facebook
TwitterAlthough the commercial name for the The USAID University - Learning Management System is CSOD InCompass, the agencies that use the system have renamed (or rebranded) their specific agency portals to meet their own needs. lnCompass is a comprehensive talent management system that incorporates the following functional modules: 1) Learning -- The Learning module supports the management and tracking of training events and individual training records. Training events may be instructor Jed or online. Courses may be managed within the system to provide descriptions, availability, and registration. Online content is stored on the system. Training information stored for individuals includes courses completed, scores, and courses registered for, 2) Connect -- The Connect module supports employee collaboration efforts. Features include communities of practice, expertise location, blogs, and knowledge sharing support. Profile information that may be stored by the system includes job position, subject matter expertise, and previous accomplishments, 3) Performance -- The Performance module supports management of organizational goals and alignment of those goals to individual performance. The module supports managing skills and competencies for the organization. The module also supports employee performance reviews. The types of information gathered about employees include their skills, competencies, and performance evaluation, 4) Succession -- The Succession module supports workforce management and planning. The type of information gathered for this module includes prior work experience, skills, and competencies, and 5) Extended Enterprise -- The Extended Enterprise module supports delivery of training outside of the organization. Training provided may be for a fee. The type of information collected for this module includes individual data for identifying the person for training records management and related information for commercial transactions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All the data created for the publication "Learning the Optimal Power Flow: Environment Design Matters" by Wolgast and Nieße. The dataset contains all training runs performed, including the final neural network weights, meta-data about the training run, and various metrics during the course of training, which were used to generate the results and plots. The source code to re-produce the plots for the publication (and everything else) can be found on GitHub: https://github.com/Digitalized-Energy-Systems/rl-opf-env-design
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Files for use with the R script accompanying the paper Cooper (2019). Note that this script also uses files from https://doi.org/10.14466/CefasDataHub.34 (details provided in script). Cooper, K.M. (2019). A new machine learning approach to seabed biotope classification. Science Advances. Files include: BiotopePredictionScript.R (R script), EUROPE.shp (European Coastline), EuropeLiteScoWal.shp (European Coastline with UK boundaries), DEFRADEMKC8.shp (Seabed bathymetry), C5922DATASETFAM13022017.csv (Training dataset), PARTC16112018.csv (Test dataset), PARTCAGG16112018.csv (Aggregation data). Description of C5922DATASETFAM13022017.csv: This file is based on the RSMP dataset (see https://www.cefas.co.uk/cefas-data-hub/dois/rsmp-baseline-dataset/), but with macrofaunal data output at the level of family or above. A variety of gear types have been used for sample collection including grabs (0.1m2 Hamon, 0.2m2 Hamon, 0.1m2 Day, 0.1m2 Van Veen and 0.1m2 Smith McIntrye) and cores. Of these various devices, 93% of samples were acquired using either a 0.1m2 Hamon grab or a 0.1m2 Day grab. Sieve sizes used in sample processing include 1mm and 0.5mm, reflecting the conventional preference for 1mm offshore and 0.5mm inshore. Of the samples collected using either a 0.1m2 Hamon grab or a 0.1m2 Day grab, 88% were processed using a 1mm sieve. Taxon names were standardised according to the WoRMS (World Register of Marine Species) list using the Taxon Match Tool (http://www.marinespecies.org/aphia.php?p=match). Of the initial 13,449 taxon names, only 774 remained after correction and aggregation to family level. The final dataset comprises of a single sheet comma-separated values (.csv) file. Colonials accounted for less than 20% of the total number of taxa and, where present, were given a value of 1 in the dataset. This component of the fauna was missing from 325 out of the 777 surveys, reflecting either a true absence, or simply that colonial taxa were ignored by the analyst. Sediment particle size data were provided as percentage weight by sieve mesh size, with the dataset including 99 different sieve sizes. Sediment samples have been processed using sieve, and a combination of sieve and laser diffraction techniques. Key metadata fields include: Sample coordinates (Latitude & Longitude), Survey Name, Gear, Date, Grab Sample Volume (litres) and Water Depth (m). A number of additional explanatory variables are also provided (salinity, temperature, chlorophyll a, Suspended particulate matter, Water depth, Wave Orbital Velocity, Average Current, Bed Stress). In total, the dataset dimensions are 33,198 rows (samples) x 900 columns (variables/factors), yielding a matrix of 29,878,200 individual data values.
Facebook
TwitterThe data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues
Train.csv - contains the target. This is the dataset that you will use to train your model. Test.csv- resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your mode. SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the IDs must be correct List of classes: ['SOCIAL ISSUES', 'EDUCATION', 'RELATIONSHIPS', 'ECONOMY', 'RELIGION', 'POLITICS', 'LAW/ORDER', 'SOCIAL', 'HEALTH', 'ARTS AND CRAFTS', 'FARMING', 'CULTURE', 'FLOODING', 'WITCHCRAFT', 'MUSIC', 'TRANSPORT', 'WILDLIFE/ENVIRONMENT', 'LOCALCHIEFS', 'SPORTS', 'OPINION/ESSAY']
Your task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.
Facebook
TwitterAuthor: Jeff Schlimmer
Source: UCI - 1981
Please cite: The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf
This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.
(a) Origin:
Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf
(b) Donor:
Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine.
Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann.
Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link]
Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time-related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.
asset_id: An asset ID. We refer to all supported cryptocurrencies as assets
open: Open price for the time period
close: Close price for the time period
high: The highest price of the time period
low: Lowest price of the time period
volume: Number of tweets
market_cap: Total available supply multiplied by the current price in USD
url_shares: Every time an identified relevant URL is shared within relevant social posts that contain relevant terms
unique_url_shares: Number of unique url shares posted and collected on social media
reddit_posts: Number of latest Reddit posts for supported coins
reddit_posts_score: Reddit Karma score on individual posts
reddit_comments: Comments on Reddit that contain relevant terms
Reddit_comments_score: Reddit Karma score on comments
tweets: Number of crypto-specific tweets based on tuned search and filtering criteria
tweet_spam: Number of tweets classified as spam
tweet_followers: Number of followers on selected tweets
tweet_quotes: Number of quotes on selected tweets
tweet_retweets: Number of retweets of selected tweets
tweet_replies: Number of replies on selected tweets
tweet_favorites: Number of likes on an individual social post that contains a relevant term
tweet_sentiment1: Number of tweets which has a sentiment of “very bullish”
tweet_sentiment2: Number of tweets which has a sentiment of “bullish”
tweet_sentiment3: Number of tweets which has a sentiment of “neutral”
tweet_sentiment4: Number of tweets which has a sentiment of “bearish”
tweet_sentiment5: Number of tweets which has a sentiment of “very bearish”
tweet_sentiment_impact1: “Very bearish” sentiment impact
tweet_sentiment_impact2: “Bearish” sentiment impact
tweet_sentiment_impact3: “Neutral” sentiment impact
tweet_sentiment_impact4: “Bullish” sentiment impact
tweet_sentiment_impact5: “Very bullish” sentiment impact
social_score: Sum of followers, retweets, likes, reddit karma etc of social posts collected
average_sentiment: The average score of sentiments, an indicator of the general sentiment being spread about a coin
news: Number of news articles for supported coins
price_score: A score we derive from a moving average that gives the coin some indication of an upward or downward based solely on the market value
social_impact_score: A score of the volume/interaction/impact of social to give a sense of the size of the market or awareness of the coin
correlation_rank: The algorithm that determines the correlation of our social data to the coin price/volume
galaxy_score: An indicator of how well a coin is doing
volatility: Volatility indicator
market_cap_rank: The rank based on the total available supply multiplied by the current price in USD
percent_change_24h_rank: The rank based on the percent change in price since 24 hours ago
volume_24h_rank: The rank based on volume in the last 24 hours
social_volume_24h_rank: The rank based on the number of social posts that contain relevant terms in the last 24 hours
social_score_24h_rank: The rank based on the sum of followers, retweets, likes, Reddit karma etc of social posts collected in the last 24 hours
medium: Number of Medium articles for supported coins
youtube: Number of videos with description that contains relevant terms
social_volume: Number of social posts that contain relevant terms
price_btc: Exchange rate with another coin
market_cap_global: Total available supply multiplied by the current price in USD
percent_change_24h: Percent change in price since 24 hours ago
Facebook
TwitterThis is a comprehensive dataset that captures the prices of a cryptocurrency along with the various features including social media attributes, trading attributes and time related attributes that were noted on an hourly basis during several months and that contribute directly or indirectly to the cryptocurrency volatile prices change.
Note that this data is from the competition Cryptocurrency Closing Price Prediction link
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains all data used in the research paper 'Image-based yield prediction for tall fescue using random forests and convolutional neural networks' by Ghysels, S., De Baets, B., Reheul, D. and Maenhout, S. 'Train_dataset.zip' and 'Test_dataset.zip' contain the RGB images of individual tall fescue plants, split into a training set and test set respectively. 'Multigras_data.csv' contains the dry matter yield measurements ('DMY (kg/ha)'), the breeder's evaluation scores ('Score MG') and the location of each individual plant on the field ('Blok_Rij_Plantnr', meaning Block-row-column).
Facebook
TwitterThe 1970 British Cohort Study (BCS70) is a longitudinal birth cohort study, following a nationally representative sample of over 17,000 people born in England, Scotland and Wales in a single week of 1970. Cohort members have been surveyed throughout their childhood and adult lives, mapping their individual trajectories and creating a unique resource for researchers. It is one of very few longitudinal studies following people of this generation anywhere in the world.
Since 1970, cohort members have been surveyed at ages 5, 10, 16, 26, 30, 34, 38, 42, 46, and 51. Featuring a range of objective measures and rich self-reported data, BCS70 covers an incredible amount of ground and can be used in research on many topics. Evidence from BCS70 has illuminated important issues for our society across five decades. Key findings include how reading for pleasure matters for children's cognitive development, why grammar schools have not reduced social inequalities, and how childhood experiences can impact on mental health in mid-life. Every day researchers from across the scientific community are using this important study to make new connections and discoveries.
BCS70 is run by the Centre for Longitudinal Studies (CLS), a research centre in the UCL Institute of Education, which is part of University College London. The content of BCS70 studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from BCS70 that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Polygenic Indices
Polygenic indices are available under Special Licence SN 9439. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.
Secure Access datasets
Secure Access versions of BCS70 have more restrictive access conditions than versions available under the standard Safeguarded Licence.
The BCS70 linked Scottish Medical Records (SMR) datasets include data files from the Information Services Division (ISD) part of the NHS National Services Scotland database for those cohort members who provided consent to health data linkage in the Age 42 sweep.
The SMR database contains information about all hospital admissions in Scotland. The following linked HES datasets are available:
Researchers who require access to more than one dataset need to apply for them individually.
Further information about the SMR database can be found on the https://www.ndc.scot.nhs.uk/Data-Dictionary/SMR-Datasets/">Information Services Division Scotland SMR Datasets webpage.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This comprehensive and open-source dataset of 100k+ conversations and instructions that include medical terminologies is perfect for training Generative Language Models for various medical applications. With samples collected from human conversations, this dataset contains a variety of options and suggestions to assist in creating useful language models. From prescribed medications to home remedies such as yoga exercises, breathing exercises, and natural remedies—this collection has it all! Only if you trust the language model you build with the right data can you use it to make decisions that matter in real life. This data is sure to give your project the boost it needs with legitimate information power-packed into every sample!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Download the dataset. The dataset can be downloaded by clicking on the “Download” button located at the top of this page and following the prompts.
- Unzip and save the file in a location of your choice on your computer or device.
- Open up the ‘train’ or ‘test’ CSV file, depending on whether you would like to use it for training or testing purposes respectively. Both contain conversations and instructions utilizing medical terminologies which can be used to train a generative language model for medical applications.
- Read through each conversation/instruction that is provided in each row outlined in data frame column labeled 'Conversation'. These conversations provide examples of transaction between doctors, patients, pharmacists etc., discussing topics such as health advice, natural home remedies and prescriptions etc., as well as conversation involving diagnosis, symptoms, medication side effects and health concerns pertaining to certain medical conditions etc..
- Note that all conversations are written according to varying levels of complexity with an emphasis on effectiveness when communicating within a healthcare environment eiher directly with patients or amongst colleagues discussing about cases via Verbal/written exchanges utilizing Medical terminologies).
6 Utilize natural language processing (NLP) techniques such as BERT Embeddings Or word embeddings corresponding to different domains Of medicine that might help relate And sort these conversations With regard To specific categories Of interest identified By domain experts For further Research purposes eiher Mathematically & statistically Or for wider Understanding contexts In diverse languages Such As Chinese , Spanish , Portuguese & French Etc
- Natural language processing applications such as automated medical transcription.
- Feature extraction and detection of health-related keywords for predictive analytics in healthcare applications.
- Automated diagnostics utilizing the language models trained on this dataset to identify diseases and illnesses based on user inputs, either through symptoms or other risk factors (e.g., age, lifestyle etc.)
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |
File: test.csv | Column name | Description | |:-----------------|:--------------------------------------------------------------------------------------------------------| | Conversation | The conversation between two or more people or an instruction utilizing medical terminologies. (String) |
If you use this dataset in your research, please cred...
Facebook
TwitterThis data was imported from the zindi platform in the context of competition and here is the link to the competition The objective of the competition is to develop a predictive model that determines the likelihood for a customer to churn - to stop purchasing airtime and data from Expresso.
The data describes 2.5 million Expresso clients. * Train.csv - contains information about 2 million customers. There is a column called CHURN that indicates if a client churned or did not churn. This is the target. You must estimate the likelihood that these clients churned. You will use this file to train your model. * Test.csv - is similar to train, but without the Churn column. You will use this file to test your model on. * SampleSubmission.csv - is an example of what your submission should look like. The order of the rows does not matter but the name of the user_id must be correct.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Peevski (From Huggingface) [source]
The OpenLeecher/GPT4-10k dataset is a comprehensive collection of 100 diverse conversations, presented in text format, revolving around a wide range of topics. These conversations cover various domains such as coding, debugging, storytelling, and science. Aimed at facilitating training and analysis purposes for researchers and developers alike, this dataset offers an extensive array of conversation samples.
Each conversation within this dataset delves into different subject matters related to coding techniques, debugging strategies, storytelling methods; while also exploring concepts like spatial thinking, logical thinking. Furthermore, the conversations touch upon scientific fields including chemistry, physics and biology. To add further depth to the dataset's content, it also includes discussions on the topic of law.
By providing this rich assortment of conversations spanning across multiple domains and disciplines in one cohesive dataset format on Kaggle platform as train.csv file , it empowers users to delve into these dialogue examples for exploration and analysis effortlessly. This compilation serves as an invaluable resource for understanding various aspects of coding practices alongside stimulating scientific discussions on subjects spanning across multiple fields
Introduction:
Understanding the Dataset Structure: The dataset consists of a CSV file named 'train.csv'. When examining the file's columns using software or programming language of your choice (e.g., Python), you will notice two key columns: 'chat' and '**chat'. Both these columns contain text data representing conversations between two or more participants.
Exploring Different Topics: The dataset covers a vast spectrum of subjects including coding techniques, debugging strategies, storytelling methods, spatial thinking, logical thinking, chemistry, physics, biology, and law each conversation:
- Coding Techniques: Discover discussions on various programming concepts and best practices.
- Debugging Strategies: Explore conversations related to identifying and fixing software issues.
- Storytelling Methods: Dive into dialogues about effective storytelling techniques in different contexts.
- Spatial Thinking: Engage with conversations that involve developing spatial reasoning skills for problem-solving.
- Logical Thinking: Learn from discussions focused on enhancing logical reasoning abilities related to different domains.
- Chemistry
- Physics
- Biology
- Law
Analyzing Conversations: leverage natural language processing (NLP) tools or techniques such as sentiment analysis print(Number of Conversations:, len(df)) together
Accessible Code Examples
Maximize Training Efficiency:
Taking Advantage of Diversity:
Creating New Applications:
Conclusion:
- Natural Language Processing Research: Researchers can leverage this dataset to train and evaluate natural language processing models, particularly in the context of conversational understanding and generation. The diverse conversations on coding, debugging, storytelling, and science can provide valuable insights into modeling human-like conversation patterns.
- Chatbot Development: The dataset can be utilized for training chatbots or virtual assistants that can engage in conversations related to coding, debugging, storytelling, and science. By exposing the chatbot to a wide range of conversation samples from different domains, developers can ensure that their chatbots are capable of providing relevant and accurate responses.
- Domain-specific Intelligent Assistants: Organizations or individuals working in fields such as coding education or scientific research may use this dataset to develop intelligent assistants tailored specifically for these domains. These assistants can help users navigate complex topics by answering questions related to coding techniques, debugging strategies, storytelling methods, or scientific concepts. Overall,'train.csv' provides a rich resource for researchers and developers interested in building conversational AI systems with knowledge across multiple domains including even legal matters
If you use this dataset in your research, please credit the original authors. Data Source
**Li...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
[WARNING: in-depth description for hourly data is missing at the moment. Please refer to the Open-Meteo website(Air Quality and Historical Weather APIs specifically) for descriptions on columns included in hourly data for the time being. In short though, the hourly data info can be obtained from the daily data info, as the horly data is used to construct the daily data; example: if avg_nitrogen_dioxide is the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day, the "nitrogen_dioxide" column will consist of the hourly instant measurements of nitrogen dioxide (10 meters above ground in μg/m3.]
Result of a course project in the context of the Master's Degree in Data Science at Università Degli Studi di Milano-Bicocca. The dataset was built in hopes of finding ways to tackle the bad air quality for which Milan is becoming renown for, and to make the training of ML models possible. The data was collected through Open-Meteo's APIs, who in turn got it from "Reanalyses Models" of Europea initiative, used for weather and air quality forecast. The data used was validated by the owners of the reanalyses datasets from which the data comes from, and through the construction of this specific dataset it's data quality was assessed across accuracy, completeness and consistency dimensions. We aggregated the data from hourly to daily, it is possible to consult the entire Data Management process in the attached pdf.
File descriptions: - weatheraqDataset.csv : contains DAILY data on weather and air quality for the city of Milan in comma separateda values (csv) format. - weatheraqDataset_Report.pdf : report built to illustrate and explicit the process followed in order to build the final dataset starting from the original data sources; it also explains any processing and aggregation/integration operations carried out. - weatheraqHourly.csv : HOURLY data, counterpart to those in daily dataset(daily data is result of aggregation of hourly data). Higher granularity and number of rows can help with achieving better results, for detailed descriptions on how these hourly values are recorded and at what resolutions please visit the OpenMeteo website as stated in the warning at the start of the description.
GitHub repo of the project: https://github.com/edmos7/weather-aqMilan
Column descriptions for DAILY data (weatheraqDataset.csv):
note: both 'date' in DAILY data and 'datetime' in HOURLY data is in local Milan Time(CET&CEST), adjusted with Daylight Savings(DST).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This Grade School Math 8K Linguistically Diverse Training & Test Set is designed to help you develop and improve your understanding of multi-step reasoning question answering. The dataset contains three separate data files: the socratic_test.csv, main_test.csv, and main_train.csv, each containing a set of questions and answers related to grade school math that consists of multiple steps. Each file contains the same columns:
question,answer. The questions contained in this dataset are thoughtfully crafted to lead you through the reasoning journey for arriving at the correct answer each time, allowing you immense opportunities for learning through practice. With over 8 thousand entries for both training and testing purposes in this GSM8K dataset, it takes advanced multi-step reasoning skills to ace these questions! Deepen your knowledge today and master any challenge with ease using this amazing GSM8K set!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a unique opportunity to study multi-step reasoning for question answering. The GSM8K Linguistically Diverse Training & Test Set consists of 8,000 questions and answers that have been created to simulate real-world scenarios in grade school mathematics. Each question is paired with one answer based on a comprehensive test set. The questions cover topics such as algebra, arithmetic, probability and more.
The dataset consists of two files: main_train.csv and main_test.csv; the former contains questions and answers specifically related to grade school math while the latter includes multi-step reasoning tests for each category of the Ontario Math Curriculum (OMC). In addition, it has three columns - Question (Question), Answer ([Answer]) – meaning that each row contains 3 sequential question/answer pairs making it possible to take a single path from the start of any given answer or branch out from there according to the logic construction required by each respective problem scenario; these columns can be used in combination with text analysis algorithms like ELMo or BERT to explore different formats of representation for responding accurately during natural language processing tasks such as Q&A or building predictive models for numerical data applications like measuring classifying resource efficiency initiatives or forecasting sales volumes in retail platforms..
To use this dataset efficiently you should first get familiar with its structure by reading through its documentation so you are aware all available info regarding items content definition & format requirements then study examples that best suits your specific purpose whether is performing an experiment inspired by education research needs, generate insights related marketing analytics reports making predictions over artificial intelligence project capacity improvements optimization gains etcetera having full access knowledge about available source keeps you up & running from preliminary background work toward knowledge mining endeavor completion success Support User success qualitative exploration sessions make sure learn all variables definitions employed heterogeneous tools before continue Research journey starts experienced Researchers come prepared valuable resource items employed go beyond discovery false alarm halt advancement flow focus unprocessed raw values instead ensure clear cutting vision behind objectives support UserHelp plans going mean project meaningful campaign deliverables production planning safety milestones dovetail short deliveries enable design interfaces session workforce making everything automated fun entry functioning final transformation awaited offshoot Goals outcome parameters monitor life cycle management ensures ongoing projects feedbacks monitored video enactment resources tapped Proficiently balanced activity sheets tracking activities progress deliberation points evaluation radius highlights outputs primary phase visit egress collaboration agendas Client cumulative returns records capture performance illustrated collectively diarized successive setup sweetens conditions researched environments overview debriefing arcane matters turn acquaintances esteemed directives social
- Training language models for improving accuracy in natural language processing applications such as question answering or dialogue systems.
- Generating new grade school math questions and answers using g...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This English-ASL Bilingual Corpus 2012, or ASLG-PC12, provides valuable insight and data to users interested in the study of language. The dataset contains columns of easily readable gloss and text pairs. This Gloss and Text Pairing can greatly assist with the study of conversational habits by providing written accompaniments to American Sign Language (ASL) signs. Whether you are looking for a diverse sampling of ASL usage, or just want to delve deeper into sign language research, this corpus has plenty to offer linguists, therapists, teachers and students alike. With over 12000 entries altogether in one organized source any researcher would find it useful!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides an interesting and insightful look into the relationship between American Sign Language (ASL) and English. The ASLG-PC12 corpus contains a collection of English-ASL gloss and text pairs, meaning you can learn not just about the words and signs used in ASL, but also their relationship to one another.
To get started using this dataset, first you'll want to explore the data sample. This can be done by opening up the train.csv file included in this dataset. It includes columns for both gloss descriptions of each sign as well as their corresponding translations in English.
Once familiar with the data, it's time to dive deeper! You can use this dataset for various purposes; from training a machine learning algorithm to recognizing signs through image processing techniques or even creating an online dictionary of sorts that maps out ASL words from commonly used English language words . No matter what application you are planning on building out of this dataset, it promises insights into human communication that cannot be found elsewhere!
So unlock your power with American Sign Language - start exploring all that ASLG-PC12 corpus has to offer!
- Training ASL language recognition algorithms.
- Developing machine translation systems to translate between English and ASL.
- Designing a web or mobile application to help teach users how to fluently sign in either language
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | gloss | The literal sign for sign translation of a word. (Text) | | text | The standard English equivalent of the ASL gloss. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThe data was collected by 800 taskers from Kenya, Mexico and India. There are nine classes, each a different sign.
The objective of this competition is to classify the ten different Sign Language signs present in the images, using machine learning or deep learning algorithms.
Images.zip: is a zip file that contains all images in test and train.
Train.csv: contains the target. This is the dataset that you will use to train your model.
Test.csv: resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
SampleSubmission.csv: shows the submission format for this competition, with the ‘Image_ID’ column mirroring that of Test.csv and the ‘label’ column containing your predictions. The order of the rows does not matter, but the names of the ‘Image_ID’ must be correct.
THIS DATA WAS IMPORTED FROM ZINDI CLASSIFICATION CHALLENGE : https://zindi.africa/competitions/kenyan-sign-language-classification-challenge