12 datasets found

Human Instructions - Multilingual (wikiHow)
kaggle.com
Updated Mar 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paolo Pareti (2017). Human Instructions - Multilingual (wikiHow) [Dataset]. https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2017
Dataset provided by
Kaggle
Authors
Paolo Pareti
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Human Instructions Dataset - Multilingual

Updated JSON files for English at this other Kaggle repository

Available in 16 Different Languages Extracted from wikiHow

Overview

Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.

Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.

This dataset has been produced as part of the The Web of Know-How project.

To cite this dataset use: Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions. (PDF) (bibtex)

Quick-Start: Instruction Extractor and Simplifier Script

The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:

select only the subset of the dataset you are interested in. For example only instructions from specific wikiHow pages, or instructions that fall within specific categories, such as cooking recipes, or those that have at least 5 steps, etc. The file class_hierarchy.ttl attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.

simplify the data model of the instructions. The current data model is rich of semantic relations. However, this richness might make it complex to use. This script allows you to simplify the data model to make it easier to work with the data. An example graphical representation of this model is available here.

The script is available on this GitHub repository.

The Available Languages

This page contains the link to the different language versions of the data.

A previous version of this type of data, although for English only, is also available on Kaggle:

Monolingual English: 200.000 No Multilingual Links, from wikiHow and Snapguide

For the multilingual dataset, this is the list of the available languages and number of articles in each:

English: 133.842

German: 57.533

Hindi: 6.519

Russian: 127.738

Korean: 7.606

Portuguese: 92.520

Italian: 79.656

French: 60.105

Spanish: 120.507

Chinese: 82.558

Czech: 10.619

Arabic: 15.589

Thai: 10.213

Vietnamese: 8.670

Indonesian: 39.246

Dutch: 19.318

Querying the Dataset

The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.

For example, [this SPARQL query](http://dydra.com/paolo-pareti/wikihow_multilingual/query?query=PREFIX%20prohow%3A%20%3Chttp%3A%2F%2Fw3id.org%2Fprohow%23%3E%20%0APREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%20%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20%0APREFIX%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%20%0APREFIX%20oa%3A%20%3Chttp%3A%2F%2Fw...
Pinterest Fashion Compatibility Dataset
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Pinterest Fashion Compatibility Dataset [Dataset]. https://www.kaggle.com/datasets/pypiahmad/shop-the-look-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Pinterest Fashion Compatibility dataset comprises images showcasing fashion products, each annotated with bounding boxes and associated with links directing to the corresponding products. This dataset facilitates the exploration of scene-based complementary product recommendation, aiming to complete the look presented in each scene by recommending compatible fashion items.

Basic Statistics: - Scenes: 47,739 - Products: 38,111 - Scene-Product Pairs: 93,274

Metadata: - Product IDs: Identifiers for the products featured in the images. - Bounding Boxes: Coordinates specifying the location of each product within the image.

Example (fashion.json): The dataset contains JSON entries where each entry associates a product with a scene, along with the bounding box coordinates for the product within the scene. json { "product": "0027e30879ce3d87f82f699f148bff7e", "scene": "cdab9160072dd1800038227960ff6467", "bbox": [ 0.434097, 0.859363, 0.560254, 1.0 ] }

Citation: If you utilize this dataset, please cite the following paper: Title: Complete the Look: Scene-based complementary product recommendation Authors: Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley Published in: CVPR, 2019 Link to paper

Code and Additional Resources: For additional resources, sample code, and instructions on how to collect the product images from Pinterest, you can visit the GitHub repository.

This dataset provides a rich ground for research and development in the domain of fashion-based image recognition, product recommendation, and the exploration of fashion styles and trends through machine learning and computer vision techniques.
Mental Health Conversational Data
kaggle.com
Updated Oct 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
elvis (2022). Mental Health Conversational Data [Dataset]. https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
elvis
Description
A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.

This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.

The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.
Sec Financial Statement Data in Json
kaggle.com
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angular2guy (2025). Sec Financial Statement Data in Json [Dataset]. https://www.kaggle.com/datasets/wbqrmgmcia7lhhq/sec-financial-statement-data-in-json/versions/13
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Angular2guy
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Data from 2010 Q1 to 2025 Q1

The data is created with this Jupyter Notebook:

The data format is documented in the Readme. The Sec data documentation can be found here.

Json structure: {"quarter": "Q1", "country": "Italy", "data": {"cf": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "bs": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "ic": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}]}, "year": 0, "name": "B", "startDate": "2009-12-31", "endDate": "2010-12-30", "symbol": "GM", "city": "York"}

An example Json: {"year": 2023, "data": {"cf": [{"value": -1834000000, "concept": "NetCashProvidedByUsedInFinancingActivities", "unit": "USD", "label": "Amount of cash inflow (outflow) from financing … Amount of cash inflow (outflow) from financing …", "info": "Net cash used in financing activities"}], "ic":[{"value": 1000000, "concept": "IncreaseDecreaseInDueFromRelatedParties", "unit": "USD", "label": "The increase (decrease) during the reporting pe… The increase (decrease) during the reporting pe…", "info": "Receivables from related parties"}], "bs": [{"value": 2779000000, "concept": "AccountsPayableCurrent", "unit": "USD", "label": "Carrying value as of the balance sheet date of … Carrying value as of the balance sheet date of …", "info": "Accounts payable"}]}, "quarter": "Q2", "city": "SANTA CLARA", "startDate": "2023-06-30", "name": "ADVANCED MICRO DEVICES INC", "endDate": "2023-09-29", "country": "US", "symbol": "AMD"}
K-Fold checkpoints
kaggle.com
Updated Aug 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadim Timakin (2020). K-Fold checkpoints [Dataset]. https://www.kaggle.com/vadimtimakin/kfold-checkpoints/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vadim Timakin
Description
Context

The set of the annotation files using for training EffNet with K-fold CV in many sessions bypassing runtime limits.

Content

Here is an example of using 5 folds. Each fold has its own annotation files defining train(.json), validation(.json), and test dataset(.npy). Images: 512 * 512, TFRecords + additional data from ISIC 2018 and ISIC 2019.

Acknowledgements

@cdeotte

Inspiration

SIIM-ISIC Melanoma Classification

Extended Wikipedia Multimodal Dataset

kaggle.com

Updated Apr 4, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/1058023

Dataset updated

Apr 4, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Oleh Onyshchak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Wikipedia Featured Articles multimodal dataset

Overview

This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

Dataset structure

The high-level structure of the dataset is as follows:

.
+-- page1 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
+-- page2 
|  +-- text.json 
|  +-- img 
|    +-- meta.json
: 
+-- pageN 
|  +-- text.json 
|  +-- img 
|    +-- meta.json

label	description
pageN	is the title of N-th Wikipedia page and contains all information about the page
text.json	text of the page saved as JSON. Please refer to the details of JSON schema below.
meta.json	a collection of all images of the page. Please refer to the details of JSON schema below.
imageN	is the N-th image of an article, saved in `jpg` format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

text.JSON Schema

Below you see an example of how data is stored:

{
 "title": "Naval Battle of Guadalcanal",
 "id": 405411,
 "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
 "html": "...

...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

key	description
title	page title
id	unique page id
url	url of a page on Wikipedia
html	HTML content of the article
wikitext	wikitext content of the article

Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

meta.JSON Schema

{
 "img_meta": [
  {
   "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
   "title": "IronbottomSound.jpg",
   "parsed_title": "ironbottom sound",
   "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
   "is_icon": False,
   "on_commons": True,
   "description": "A U.S. destroyer steams up what later became known as ...",
   "caption": "Ironbottom Sound. The majority of the warship surface ...",
   "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
   "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
   },
   ...
  ]
}

key	description
filename	unique image id, md5 hashcode of original image title
title	image title retrieved from Commons, if applicable
parsed_title	image title split into words, i.e. "helloWorld.jpg" -> "hello world"
url	url of an image on Wikipedia
is_icon	True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
on_commons	True if image is available from Wikimedia Commons dataset
description	description of an image parsed from Wikimedia Commons page, if available
caption	caption of an image parsed from Wikipedia article, if available
headings	list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
features	output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in `jpeg` format with fixed width of 600px. Practically, it is a list of floats with len = 2048

Collection method

Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

Kaggle Data Collection Notebook...

T5_base_pytorch
kaggle.com
Updated Apr 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maitreya Patel (2020). T5_base_pytorch [Dataset]. https://www.kaggle.com/maitreyajp/t5basepytorch/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Maitreya Patel
Description
Context

This dataset provides model, config and spiece files of T5-base for Pytorch. This can be used for loading pre-trained model and modified sentence-piece tokenizer.

Content

config.json - model configuration pytorch_model.bin - pre-trained model spiece.model - vocabulary

Here, spiece.model file can be used for separate tokenizer. For example, in https://www.kaggle.com/c/tweet-sentiment-extraction competition if one requires to get offsets then s/he will not able able to use huggingface inbuilt tokenizer directly. Hence, one can use it as described in https://www.kaggle.com/abhishek/sentencepiece-tokenizer-with-offsets.

Acknowledgements

All files are taken from huggingface or generated using it. Also, @abhishek thank you so much for sharing such a useful information.
Goodreads Book Reviews
kaggle.com
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Goodreads Book Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews1/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.

Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930

Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.

Example (interaction data): json { "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.

Citation: Please cite the following if you use the data: Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)

Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.

Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.

Datasets:

Meta-Data of Books:

Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.

Download Link

Detailed Information of Authors (goodreads_book_authors.json.gz):

An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.

Download Link

Detailed Information of Works (goodreads_book_works.json.gz):

This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.

Download Link

Detailed Information of Book Series (goodreads_book_series.json.gz):

A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.

Download Link

Extracted Fuzzy Book Genres (goodreads_book_genres_initial.json....
indian-railway-dataset
kaggle.com
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
flugeltomar (2024). indian-railway-dataset [Dataset]. https://www.kaggle.com/datasets/flugeltomar/indian-railway-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
flugeltomar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains a list of all railway stations in India, including station codes, station names, and region codes. It is a comprehensive resource useful for various applications like transportation analytics, geographic studies, mapping, and machine learning projects.

Dataset Structure The dataset is provided in JSON format with an array of objects, where each object represents a railway station.

Example JSON Format: [ { "station_code": "A", "station_name": "ARMENIAN GHAT CITY", "region_code": "SE" }, { "station_code": "AA", "station_name": "ATARIA", "region_code": "NE" }, { "station_code": "AABH", "station_name": "AMBIKA BHAWALI HALT", "region_code": "EC" } ]

Keys in the JSON: station_code: A unique identifier for each railway station (e.g., "A", "AA", "AABH"). station_name: The name of the railway station (e.g., "ARMENIAN GHAT CITY", "ATARIA"). region_code: The region to which the station belongs (e.g., "SE", "NE", "EC").

import json

Load the JSON dataset

with open('railway_stations_india.json', 'r') as file: data = json.load(file)

Display the first few stations

for station in data[:5]: # Displaying first 5 stations print(f"Station Code: {station['station_code']}, Station Name: {station['station_name']}, Region Code: {station['region_code']}")
Predict Molecular Properties
kaggle.com
Updated Aug 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BurakH (2017). Predict Molecular Properties [Dataset]. https://www.kaggle.com/datasets/burakhmmtgl/predict-molecular-properties
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2017
Dataset provided by
Kaggle
Authors
BurakH
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Predict molecular properties

Context

This dataset contains molecular properties scraped from the PubChem database. Each file contains properties for thousands of molecules , made up of the elements H, C, N, O, F, Si, P, S, Cl, Br, and I. The dataset is related to a previous one which had fewer number of molecules, where the features were preconstructed.

Instead, this dataset is a challenging case for feature engineering and is subject of active research (see references below).

Data Description

The utilities used to download and process the data can be accessed from my Github repo.

Each JSON file contains a list of molecular data. An example molecule is given below:

{

'En': 37.801,

'atoms': [

{'type': 'O', 'xyz': [0.3387, 0.9262, 0.46]},

{'type': 'O', 'xyz': [3.4786, -1.7069, -0.3119]},

{'type': 'O', 'xyz': [1.8428, -1.4073, 1.2523]},

{'type': 'O', 'xyz': [0.4166, 2.5213, -1.2091]},

{'type': 'N', 'xyz': [-2.2359, -0.7251, 0.027]},

{'type': 'C', 'xyz': [-0.7783, -1.1579, 0.0914]},

{'type': 'C', 'xyz': [0.1368, -0.0961, -0.5161]},

{'type': 'C', 'xyz': [-3.1119, -1.7972, 0.659]},

{'type': 'C', 'xyz': [-2.4103, 0.5837, 0.784]},

{'type': 'C', 'xyz': [-2.6433, -0.5289, -1.426]},

{'type': 'C', 'xyz': [1.4879, -0.6438, -0.9795]},

{'type': 'C', 'xyz': [2.3478, -1.3163, 0.1002]},

{'type': 'C', 'xyz': [0.4627, 2.1935, -0.0312]},

{'type': 'C', 'xyz': [0.6678, 3.1549, 1.1001]},

{'type': 'H', 'xyz': [-0.7073, -2.1051, -0.4563]},

{'type': 'H', 'xyz': [-0.5669, -1.3392, 1.1503]},

{'type': 'H', 'xyz': [-0.3089, 0.3239, -1.4193]},

{'type': 'H', 'xyz': [-2.9705, -2.7295, 0.1044]},

{'type': 'H', 'xyz': [-2.8083, -1.921, 1.7028]},

{'type': 'H', 'xyz': [-4.1563, -1.4762, 0.6031]},

{'type': 'H', 'xyz': [-2.0398, 1.417, 0.1863]},

{'type': 'H', 'xyz': [-3.4837, 0.7378, 0.9384]},

{'type': 'H', 'xyz': [-1.9129, 0.5071, 1.7551]},

{'type': 'H', 'xyz': [-2.245, 0.4089, -1.819]},

{'type': 'H', 'xyz': [-2.3, -1.3879, -2.01]},

{'type': 'H', 'xyz': [-3.7365, -0.4723, -1.463]},

{'type': 'H', 'xyz': [1.3299, -1.3744, -1.7823]},

{'type': 'H', 'xyz': [2.09, 0.1756, -1.3923]},

{'type': 'H', 'xyz': [-0.1953, 3.128, 1.7699]},

{'type': 'H', 'xyz': [0.7681, 4.1684, 0.7012]},

{'type': 'H', 'xyz': [1.5832, 2.901, 1.6404]}

],

'id': 1,

'shapeM': [259.66, 4.28, 3.04, 1.21, 1.75, 2.55, 0.16, -3.13, -0.22, -2.18, -0.56, 0.21, 0.17, 0.09]

}

En: This field is the molecular energy calculated using a force-field method. See references [1,2] for details. This is the target variable which is being predicted.

atoms: This field contains the name of the element and the position (x,y,z coordinates) and needs to be used for feature engineering.

id : This field is the PubChem Id

shapeM : This field contains the shape multipoles and can be used for feature engineering. For definition of shape multipoles, see reference [3].

Notice that each molecule contains different number and types of atoms, so it is challenging to come up with features that can describe every molecule in a unique way. There are several approaches taken in the literature (see the references), one of which is to use the Coulomb Matrix for a given molecule defined by

\[ C_{IJ} = rac{Z_I Z_J}{ ert R_I - R_J ert}, quad ({ m I eq J}) qquad C_{IJ} = Z_I^{2.4}, quad (I=J) \]

where $Z_I$ are atomic numbers (can be looked up from the periodic table for each element), and ${ ert R_I - R_J ert}$ is the distance between two atoms I and J. The previous dataset used these features for a subset of molecules given here, where the maximum number of elements in a given molecules was limited by 50.

There are around 100,000,000 molecules in the whole database. As more files are scraped, new data will be added in time.

Note: In the previous dataset, the molecular energies were computed by quantum mechanical simulations. Here, the given energies are computed using another method, so their values are different.

Inspiration

Simulations of molecular properties are computationally expensive. The purpose of this project is to use machine learning methods to come up with a model that can predict molecular properties from a database. In the PubChem database, there are around 100,000,000 molecules. It could take years to do simulations on all of these molecules, however machine learning can be used to predict their properties much faster. As a result, this could open up many possibilities in computational design and discovery of molecules, compounds and new drugs.

This is a regression problem...
Aerial Semantic Drone Dataset
kaggle.com
Updated May 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalu Erfandi Maula Yusnu (2021). Aerial Semantic Drone Dataset [Dataset]. https://www.kaggle.com/nunenuh/semantic-drone/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lalu Erfandi Maula Yusnu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Aerial Semantic Drone Dataset

The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.

This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.

If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.

This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.

Semantic Annotation

The images are labelled densely using polygons and contain the following 24 classes:

unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting

Directory Structure and Files

> images > labels/png > labels/tiff - class_to_idx.json - classes.csv - classes.json - idx_to_class.json

Included Data

400 training images in jpg format can be found in "aerial_semantic_drone/images"

Dense semantic annotations in png format can be found in "aerial_semantic_drone/labels/png"

Dense semantic annotations in tiff format can be found in "aerial_semantic_drone/labels/tiff"

Semantic class definition in csv format can be found in "aerial_semantic_drone/classes.csv"

Semantic class definition in json can be found in "aerial_semantic_drone/classes.json"

Index to class name file can be found in "aerial_semantic_drone/idx_to_class.json"

Class name to index file can be found in "aerial_semantic_drone/idx_to_class.json"

Contact

aerial@icg.tugraz.at

Citation

If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at

License

The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.

That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).
Social Media Prediction Challenge
kaggle.com
Updated May 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Dutta (2023). Social Media Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/social-media-prediction-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Dutta
Description
The objective of this competition is to create a model to predict the number of retweets a tweet will get on Twitter. The data used to train the model will be approximately 2,400 tweets each from 38 major banks and mobile network operators across Africa.

A machine learning model to predict retweets would be valuable to any business that uses social media to share important information and messages to the public. This model can be used as a tool to help businesses better tailor their tweets to ensure maximum impact and outreach to clients and non-clients.

The data has been split into a test and training set.

train.json (zipped) is the dataset that you will use to train your model. This dataset includes about 2,400 consecutive tweets from each of the companies listed below, for a total of 96,562 tweets.

test_questions.json (zipped) is the dataset to which you will apply your model to test how well it performs. Use your model and this dataset to predict the number of retweets a tweet will receive. The test set are the consecutive tweets that followed the first tweets provided in the training sets. There are a maximum of 800 tweets per company in this test set. This dataset includes the same fields as train.json except for the retweet_count and favorite_count variables.

sample_submission.csv is a table to provide an example of what your submission file should look like.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Paolo Pareti (2017). Human Instructions - Multilingual (wikiHow) [Dataset]. https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow/tasks

Human Instructions - Multilingual (wikiHow)

800K formalised step-by-step instructions in 16 different languages from wikiHow

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 17, 2017

Dataset provided by

Kaggle

Authors

Paolo Pareti

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

The Human Instructions Dataset - Multilingual

Updated JSON files for English at this other Kaggle repository

Available in 16 Different Languages Extracted from wikiHow

Overview

Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.

Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.

This dataset has been produced as part of the The Web of Know-How project.

To cite this dataset use: Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions. (PDF) (bibtex)

Quick-Start: Instruction Extractor and Simplifier Script

The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:

select only the subset of the dataset you are interested in. For example only instructions from specific wikiHow pages, or instructions that fall within specific categories, such as cooking recipes, or those that have at least 5 steps, etc. The file class_hierarchy.ttl attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.
simplify the data model of the instructions. The current data model is rich of semantic relations. However, this richness might make it complex to use. This script allows you to simplify the data model to make it easier to work with the data. An example graphical representation of this model is available here.

The script is available on this GitHub repository.

The Available Languages

This page contains the link to the different language versions of the data.

A previous version of this type of data, although for English only, is also available on Kaggle:

Monolingual English: 200.000 No Multilingual Links, from wikiHow and Snapguide

For the multilingual dataset, this is the list of the available languages and number of articles in each:

Querying the Dataset

The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.

For example, [this SPARQL query](http://dydra.com/paolo-pareti/wikihow_multilingual/query?query=PREFIX%20prohow%3A%20%3Chttp%3A%2F%2Fw3id.org%2Fprohow%23%3E%20%0APREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%20%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20%0APREFIX%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%20%0APREFIX%20oa%3A%20%3Chttp%3A%2F%2Fw...

Clear search

Close search

Google apps

Main menu

Human Instructions - Multilingual (wikiHow)

The Human Instructions Dataset - Multilingual

Updated JSON files for English at this other Kaggle repository

Available in 16 Different Languages Extracted from wikiHow

Overview

Quick-Start: Instruction Extractor and Simplifier Script

The Available Languages

Querying the Dataset

Pinterest Fashion Compatibility Dataset

Mental Health Conversational Data

Sec Financial Statement Data in Json

K-Fold checkpoints

Context

Content

Acknowledgements

Inspiration

Extended Wikipedia Multimodal Dataset

Wikipedia Featured Articles multimodal dataset

Overview

Dataset structure

text.JSON Schema

meta.JSON Schema

Collection method

T5_base_pytorch

Context

Content

Acknowledgements

Goodreads Book Reviews

Meta-Data of Books:

indian-railway-dataset

Load the JSON dataset

Display the first few stations

Predict Molecular Properties

Predict molecular properties

Context

Data Description

Inspiration

Aerial Semantic Drone Dataset

Aerial Semantic Drone Dataset

Semantic Annotation

Directory Structure and Files

Included Data

Contact

Citation

License

Social Media Prediction Challenge

Human Instructions - Multilingual (wikiHow)

800K formalised step-by-step instructions in 16 different languages from wikiHow

The Human Instructions Dataset - Multilingual

Updated JSON files for English at this other Kaggle repository

Available in 16 Different Languages Extracted from wikiHow

Overview

Quick-Start: Instruction Extractor and Simplifier Script

The Available Languages

Querying the Dataset