Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.
Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.
This dataset has been produced as part of the The Web of Know-How project.
The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:
class_hierarchy.ttl
attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.The script is available on this GitHub repository.
This page contains the link to the different language versions of the data.
A previous version of this type of data, although for English only, is also available on Kaggle:
For the multilingual dataset, this is the list of the available languages and number of articles in each:
The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Pinterest Fashion Compatibility dataset comprises images showcasing fashion products, each annotated with bounding boxes and associated with links directing to the corresponding products. This dataset facilitates the exploration of scene-based complementary product recommendation, aiming to complete the look presented in each scene by recommending compatible fashion items.
Basic Statistics: - Scenes: 47,739 - Products: 38,111 - Scene-Product Pairs: 93,274
Metadata: - Product IDs: Identifiers for the products featured in the images. - Bounding Boxes: Coordinates specifying the location of each product within the image.
Example (fashion.json):
The dataset contains JSON entries where each entry associates a product with a scene, along with the bounding box coordinates for the product within the scene.
json
{
"product": "0027e30879ce3d87f82f699f148bff7e",
"scene": "cdab9160072dd1800038227960ff6467",
"bbox": [
0.434097,
0.859363,
0.560254,
1.0
]
}
Citation: If you utilize this dataset, please cite the following paper: Title: Complete the Look: Scene-based complementary product recommendation Authors: Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley Published in: CVPR, 2019 Link to paper
Code and Additional Resources: For additional resources, sample code, and instructions on how to collect the product images from Pinterest, you can visit the GitHub repository.
This dataset provides a rich ground for research and development in the domain of fashion-based image recognition, product recommendation, and the exploration of fashion styles and trends through machine learning and computer vision techniques.
A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.
This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.
The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Data from 2010 Q1 to 2025 Q1
The data is created with this Jupyter Notebook:
The data format is documented in the Readme. The Sec data documentation can be found here.
Json structure:
{"quarter": "Q1", "country": "Italy", "data": {"cf": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "bs": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "ic": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}]}, "year": 0, "name": "B", "startDate": "2009-12-31", "endDate": "2010-12-30", "symbol": "GM", "city": "York"}
An example Json:
{"year": 2023, "data": {"cf": [{"value": -1834000000, "concept": "NetCashProvidedByUsedInFinancingActivities", "unit": "USD", "label": "Amount of cash inflow (outflow) from financing … Amount of cash inflow (outflow) from financing …", "info": "Net cash used in financing activities"}], "ic":[{"value": 1000000, "concept": "IncreaseDecreaseInDueFromRelatedParties", "unit": "USD", "label": "The increase (decrease) during the reporting pe… The increase (decrease) during the reporting pe…", "info": "Receivables from related parties"}], "bs": [{"value": 2779000000, "concept": "AccountsPayableCurrent", "unit": "USD", "label": "Carrying value as of the balance sheet date of … Carrying value as of the balance sheet date of …", "info": "Accounts payable"}]}, "quarter": "Q2", "city": "SANTA CLARA", "startDate": "2023-06-30", "name": "ADVANCED MICRO DEVICES INC", "endDate": "2023-09-29", "country": "US", "symbol": "AMD"}
The set of the annotation files using for training EffNet with K-fold CV in many sessions bypassing runtime limits.
Here is an example of using 5 folds. Each fold has its own annotation files defining train(.json), validation(.json), and test dataset(.npy). Images: 512 * 512, TFRecords + additional data from ISIC 2018 and ISIC 2019.
@cdeotte
SIIM-ISIC Melanoma Classification
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.
You can find more details in "Image Recommendation for Wikipedia Articles" thesis.
The high-level structure of the dataset is as follows:
.
+-- page1
| +-- text.json
| +-- img
| +-- meta.json
+-- page2
| +-- text.json
| +-- img
| +-- meta.json
:
+-- pageN
| +-- text.json
| +-- img
| +-- meta.json
label | description |
---|---|
pageN | is the title of N-th Wikipedia page and contains all information about the page |
text.json | text of the page saved as JSON. Please refer to the details of JSON schema below. |
meta.json | a collection of all images of the page. Please refer to the details of JSON schema below. |
imageN | is the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title. |
Below you see an example of how data is stored:
{
"title": "Naval Battle of Guadalcanal",
"id": 405411,
"url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
"html": "...
...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }
key | description |
---|---|
title | page title |
id | unique page id |
url | url of a page on Wikipedia |
html | HTML content of the article |
wikitext | wikitext content of the article |
Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.
{
"img_meta": [
{
"filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
"title": "IronbottomSound.jpg",
"parsed_title": "ironbottom sound",
"url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
"is_icon": False,
"on_commons": True,
"description": "A U.S. destroyer steams up what later became known as ...",
"caption": "Ironbottom Sound. The majority of the warship surface ...",
"headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
"features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
},
...
]
}
key | description |
---|---|
filename | unique image id, md5 hashcode of original image title |
title | image title retrieved from Commons, if applicable |
parsed_title | image title split into words, i.e. "helloWorld.jpg" -> "hello world" |
url | url of an image on Wikipedia |
is_icon | True if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it |
on_commons | True if image is available from Wikimedia Commons dataset |
description | description of an image parsed from Wikimedia Commons page, if available |
caption | caption of an image parsed from Wikipedia article, if available |
headings | list of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading |
features | output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048 |
Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.
This dataset provides model, config and spiece files of T5-base for Pytorch. This can be used for loading pre-trained model and modified sentence-piece tokenizer.
config.json - model configuration pytorch_model.bin - pre-trained model spiece.model - vocabulary
Here, spiece.model file can be used for separate tokenizer. For example, in https://www.kaggle.com/c/tweet-sentiment-extraction competition if one requires to get offsets then s/he will not able able to use huggingface inbuilt tokenizer directly. Hence, one can use it as described in https://www.kaggle.com/abhishek/sentencepiece-tokenizer-with-offsets.
All files are taken from huggingface or generated using it. Also, @abhishek thank you so much for sharing such a useful information.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.
Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930
Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.
Example (interaction data):
json
{
"user_id": "8842281e1d1347389f2ab93d60773d4d",
"book_id": "130580",
"review_id": "330f9c153c8d3347eb914c06b89c94da",
"isRead": true,
"rating": 4,
"date_added": "Mon Aug 01 13:41:57 -0700 2011",
"date_updated": "Mon Aug 01 13:42:41 -0700 2011",
"read_at": "Fri Jan 01 00:00:00 -0800 1988",
"started_at": ""
}
Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.
Citation:
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
[PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)
Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.
Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.
Datasets:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a list of all railway stations in India, including station codes, station names, and region codes. It is a comprehensive resource useful for various applications like transportation analytics, geographic studies, mapping, and machine learning projects.
Dataset Structure The dataset is provided in JSON format with an array of objects, where each object represents a railway station.
Example JSON Format: [ { "station_code": "A", "station_name": "ARMENIAN GHAT CITY", "region_code": "SE" }, { "station_code": "AA", "station_name": "ATARIA", "region_code": "NE" }, { "station_code": "AABH", "station_name": "AMBIKA BHAWALI HALT", "region_code": "EC" } ]
Keys in the JSON: station_code: A unique identifier for each railway station (e.g., "A", "AA", "AABH"). station_name: The name of the railway station (e.g., "ARMENIAN GHAT CITY", "ATARIA"). region_code: The region to which the station belongs (e.g., "SE", "NE", "EC").
import json
with open('railway_stations_india.json', 'r') as file: data = json.load(file)
for station in data[:5]: # Displaying first 5 stations print(f"Station Code: {station['station_code']}, Station Name: {station['station_name']}, Region Code: {station['region_code']}")
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains molecular properties scraped from the PubChem database. Each file contains properties for thousands of molecules , made up of the elements H, C, N, O, F, Si, P, S, Cl, Br, and I. The dataset is related to a previous one which had fewer number of molecules, where the features were preconstructed.
Instead, this dataset is a challenging case for feature engineering and is subject of active research (see references below).
The utilities used to download and process the data can be accessed from my Github repo.
Each JSON file contains a list of molecular data. An example molecule is given below:
{
'En': 37.801,
'atoms': [
{'type': 'O', 'xyz': [0.3387, 0.9262, 0.46]},
{'type': 'O', 'xyz': [3.4786, -1.7069, -0.3119]},
{'type': 'O', 'xyz': [1.8428, -1.4073, 1.2523]},
{'type': 'O', 'xyz': [0.4166, 2.5213, -1.2091]},
{'type': 'N', 'xyz': [-2.2359, -0.7251, 0.027]},
{'type': 'C', 'xyz': [-0.7783, -1.1579, 0.0914]},
{'type': 'C', 'xyz': [0.1368, -0.0961, -0.5161]},
{'type': 'C', 'xyz': [-3.1119, -1.7972, 0.659]},
{'type': 'C', 'xyz': [-2.4103, 0.5837, 0.784]},
{'type': 'C', 'xyz': [-2.6433, -0.5289, -1.426]},
{'type': 'C', 'xyz': [1.4879, -0.6438, -0.9795]},
{'type': 'C', 'xyz': [2.3478, -1.3163, 0.1002]},
{'type': 'C', 'xyz': [0.4627, 2.1935, -0.0312]},
{'type': 'C', 'xyz': [0.6678, 3.1549, 1.1001]},
{'type': 'H', 'xyz': [-0.7073, -2.1051, -0.4563]},
{'type': 'H', 'xyz': [-0.5669, -1.3392, 1.1503]},
{'type': 'H', 'xyz': [-0.3089, 0.3239, -1.4193]},
{'type': 'H', 'xyz': [-2.9705, -2.7295, 0.1044]},
{'type': 'H', 'xyz': [-2.8083, -1.921, 1.7028]},
{'type': 'H', 'xyz': [-4.1563, -1.4762, 0.6031]},
{'type': 'H', 'xyz': [-2.0398, 1.417, 0.1863]},
{'type': 'H', 'xyz': [-3.4837, 0.7378, 0.9384]},
{'type': 'H', 'xyz': [-1.9129, 0.5071, 1.7551]},
{'type': 'H', 'xyz': [-2.245, 0.4089, -1.819]},
{'type': 'H', 'xyz': [-2.3, -1.3879, -2.01]},
{'type': 'H', 'xyz': [-3.7365, -0.4723, -1.463]},
{'type': 'H', 'xyz': [1.3299, -1.3744, -1.7823]},
{'type': 'H', 'xyz': [2.09, 0.1756, -1.3923]},
{'type': 'H', 'xyz': [-0.1953, 3.128, 1.7699]},
{'type': 'H', 'xyz': [0.7681, 4.1684, 0.7012]},
{'type': 'H', 'xyz': [1.5832, 2.901, 1.6404]}
],
'id': 1,
'shapeM': [259.66, 4.28, 3.04, 1.21, 1.75, 2.55, 0.16, -3.13, -0.22, -2.18, -0.56, 0.21, 0.17, 0.09]
}
Notice that each molecule contains different number and types of atoms, so it is challenging to come up with features that can describe every molecule in a unique way. There are several approaches taken in the literature (see the references), one of which is to use the Coulomb Matrix for a given molecule defined by
\[ C_{IJ} = rac{Z_I Z_J}{ ert R_I - R_J ert}, quad ({ m I eq J}) qquad C_{IJ} = Z_I^{2.4}, quad (I=J) \]
where $Z_I$ are atomic numbers (can be looked up from the periodic table for each element), and ${ ert R_I - R_J ert}$ is the distance between two atoms I and J. The previous dataset used these features for a subset of molecules given here, where the maximum number of elements in a given molecules was limited by 50.
There are around 100,000,000 molecules in the whole database. As more files are scraped, new data will be added in time.
Note: In the previous dataset, the molecular energies were computed by quantum mechanical simulations. Here, the given energies are computed using another method, so their values are different.
Simulations of molecular properties are computationally expensive. The purpose of this project is to use machine learning methods to come up with a model that can predict molecular properties from a database. In the PubChem database, there are around 100,000,000 molecules. It could take years to do simulations on all of these molecules, however machine learning can be used to predict their properties much faster. As a result, this could open up many possibilities in computational design and discovery of molecules, compounds and new drugs.
This is a regression problem...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.
This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.
If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.
This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.
The images are labelled densely using polygons and contain the following 24 classes:
unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting
> images
> labels/png
> labels/tiff
- class_to_idx.json
- classes.csv
- classes.json
- idx_to_class.json
aerial@icg.tugraz.at
If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at
The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:
That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.
That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).
The objective of this competition is to create a model to predict the number of retweets a tweet will get on Twitter. The data used to train the model will be approximately 2,400 tweets each from 38 major banks and mobile network operators across Africa.
A machine learning model to predict retweets would be valuable to any business that uses social media to share important information and messages to the public. This model can be used as a tool to help businesses better tailor their tweets to ensure maximum impact and outreach to clients and non-clients.
The data has been split into a test and training set.
train.json (zipped) is the dataset that you will use to train your model. This dataset includes about 2,400 consecutive tweets from each of the companies listed below, for a total of 96,562 tweets.
test_questions.json (zipped) is the dataset to which you will apply your model to test how well it performs. Use your model and this dataset to predict the number of retweets a tweet will receive. The test set are the consecutive tweets that followed the first tweets provided in the training sets. There are a maximum of 800 tweets per company in this test set. This dataset includes the same fields as train.json except for the retweet_count and favorite_count variables.
sample_submission.csv is a table to provide an example of what your submission file should look like.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.
Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.
This dataset has been produced as part of the The Web of Know-How project.
The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:
class_hierarchy.ttl
attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.The script is available on this GitHub repository.
This page contains the link to the different language versions of the data.
A previous version of this type of data, although for English only, is also available on Kaggle:
For the multilingual dataset, this is the list of the available languages and number of articles in each:
The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.