12 datasets found
  1. Human Instructions - Multilingual (wikiHow)

    • kaggle.com
    Updated Mar 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paolo Pareti (2017). Human Instructions - Multilingual (wikiHow) [Dataset]. https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2017
    Dataset provided by
    Kaggle
    Authors
    Paolo Pareti
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Human Instructions Dataset - Multilingual

    Updated JSON files for English at this other Kaggle repository

    Available in 16 Different Languages Extracted from wikiHow

    Overview

    Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.

    Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.

    This dataset has been produced as part of the The Web of Know-How project.

    • To cite this dataset use: Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions. (PDF) (bibtex)

    Quick-Start: Instruction Extractor and Simplifier Script

    The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:

    • select only the subset of the dataset you are interested in. For example only instructions from specific wikiHow pages, or instructions that fall within specific categories, such as cooking recipes, or those that have at least 5 steps, etc. The file class_hierarchy.ttl attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.
    • simplify the data model of the instructions. The current data model is rich of semantic relations. However, this richness might make it complex to use. This script allows you to simplify the data model to make it easier to work with the data. An example graphical representation of this model is available here.

    The script is available on this GitHub repository.

    The Available Languages

    This page contains the link to the different language versions of the data.

    A previous version of this type of data, although for English only, is also available on Kaggle:

    For the multilingual dataset, this is the list of the available languages and number of articles in each:

    Querying the Dataset

    The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.

    For example, [this SPARQL query](http://dydra.com/paolo-pareti/wikihow_multilingual/query?query=PREFIX%20prohow%3A%20%3Chttp%3A%2F%2Fw3id.org%2Fprohow%23%3E%20%0APREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%20%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20%0APREFIX%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%20%0APREFIX%20oa%3A%20%3Chttp%3A%2F%2Fw...

  2. Pinterest Fashion Compatibility Dataset

    • kaggle.com
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Pinterest Fashion Compatibility Dataset [Dataset]. https://www.kaggle.com/datasets/pypiahmad/shop-the-look-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Pinterest Fashion Compatibility dataset comprises images showcasing fashion products, each annotated with bounding boxes and associated with links directing to the corresponding products. This dataset facilitates the exploration of scene-based complementary product recommendation, aiming to complete the look presented in each scene by recommending compatible fashion items.

    Basic Statistics: - Scenes: 47,739 - Products: 38,111 - Scene-Product Pairs: 93,274

    Metadata: - Product IDs: Identifiers for the products featured in the images. - Bounding Boxes: Coordinates specifying the location of each product within the image.

    Example (fashion.json): The dataset contains JSON entries where each entry associates a product with a scene, along with the bounding box coordinates for the product within the scene. json { "product": "0027e30879ce3d87f82f699f148bff7e", "scene": "cdab9160072dd1800038227960ff6467", "bbox": [ 0.434097, 0.859363, 0.560254, 1.0 ] }

    Citation: If you utilize this dataset, please cite the following paper: Title: Complete the Look: Scene-based complementary product recommendation Authors: Wang-Cheng Kang, Eric Kim, Jure Leskovec, Charles Rosenberg, Julian McAuley Published in: CVPR, 2019 Link to paper

    Code and Additional Resources: For additional resources, sample code, and instructions on how to collect the product images from Pinterest, you can visit the GitHub repository.

    This dataset provides a rich ground for research and development in the domain of fashion-based image recognition, product recommendation, and the exploration of fashion styles and trends through machine learning and computer vision techniques.

  3. Mental Health Conversational Data

    • kaggle.com
    Updated Oct 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    elvis (2022). Mental Health Conversational Data [Dataset]. https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    elvis
    Description

    A dataset containing basic conversations, mental health FAQ, classical therapy conversations, and general advice provided to people suffering from anxiety and depression.

    This dataset can be used to train a model for a chatbot that can behave like a therapist in order to provide emotional support to people with anxiety & depression.

    The dataset contains intents. An “intent” is the intention behind a user's message. For instance, If I were to say “I am sad” to the chatbot, the intent, in this case, would be “sad”. Depending upon the intent, there is a set of Patterns and Responses appropriate for the intent. Patterns are some examples of a user’s message which aligns with the intent while Responses are the replies that the chatbot provides in accordance with the intent. Various intents are defined and their patterns and responses are used as the model’s training data to identify a particular intent.

  4. Sec Financial Statement Data in Json

    • kaggle.com
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angular2guy (2025). Sec Financial Statement Data in Json [Dataset]. https://www.kaggle.com/datasets/wbqrmgmcia7lhhq/sec-financial-statement-data-in-json/versions/13
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Angular2guy
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Data from 2010 Q1 to 2025 Q1

    The data is created with this Jupyter Notebook:

    The data format is documented in the Readme. The Sec data documentation can be found here.

    Json structure: {"quarter": "Q1", "country": "Italy", "data": {"cf": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "bs": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}], "ic": [{"value": 0, "concept": "A", "unit": "USD", "label": "B", "info": "C"}]}, "year": 0, "name": "B", "startDate": "2009-12-31", "endDate": "2010-12-30", "symbol": "GM", "city": "York"}

    An example Json: {"year": 2023, "data": {"cf": [{"value": -1834000000, "concept": "NetCashProvidedByUsedInFinancingActivities", "unit": "USD", "label": "Amount of cash inflow (outflow) from financing … Amount of cash inflow (outflow) from financing …", "info": "Net cash used in financing activities"}], "ic":[{"value": 1000000, "concept": "IncreaseDecreaseInDueFromRelatedParties", "unit": "USD", "label": "The increase (decrease) during the reporting pe… The increase (decrease) during the reporting pe…", "info": "Receivables from related parties"}], "bs": [{"value": 2779000000, "concept": "AccountsPayableCurrent", "unit": "USD", "label": "Carrying value as of the balance sheet date of … Carrying value as of the balance sheet date of …", "info": "Accounts payable"}]}, "quarter": "Q2", "city": "SANTA CLARA", "startDate": "2023-06-30", "name": "ADVANCED MICRO DEVICES INC", "endDate": "2023-09-29", "country": "US", "symbol": "AMD"}

  5. K-Fold checkpoints

    • kaggle.com
    Updated Aug 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vadim Timakin (2020). K-Fold checkpoints [Dataset]. https://www.kaggle.com/vadimtimakin/kfold-checkpoints/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vadim Timakin
    Description

    Context

    The set of the annotation files using for training EffNet with K-fold CV in many sessions bypassing runtime limits.

    Content

    Here is an example of using 5 folds. Each fold has its own annotation files defining train(.json), validation(.json), and test dataset(.npy). Images: 512 * 512, TFRecords + additional data from ISIC 2018 and ISIC 2019.

    Acknowledgements

    @cdeotte

    Inspiration

    SIIM-ISIC Melanoma Classification

  6. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  7. T5_base_pytorch

    • kaggle.com
    Updated Apr 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maitreya Patel (2020). T5_base_pytorch [Dataset]. https://www.kaggle.com/maitreyajp/t5basepytorch/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Maitreya Patel
    Description

    Context

    This dataset provides model, config and spiece files of T5-base for Pytorch. This can be used for loading pre-trained model and modified sentence-piece tokenizer.

    Content

    config.json - model configuration pytorch_model.bin - pre-trained model spiece.model - vocabulary

    Here, spiece.model file can be used for separate tokenizer. For example, in https://www.kaggle.com/c/tweet-sentiment-extraction competition if one requires to get offsets then s/he will not able able to use huggingface inbuilt tokenizer directly. Hence, one can use it as described in https://www.kaggle.com/abhishek/sentencepiece-tokenizer-with-offsets.

    Acknowledgements

    All files are taken from huggingface or generated using it. Also, @abhishek thank you so much for sharing such a useful information.

  8. Goodreads Book Reviews

    • kaggle.com
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Goodreads Book Reviews [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.

    Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930

    Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.

    Example (interaction data): json { "user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "130580", "review_id": "330f9c153c8d3347eb914c06b89c94da", "isRead": true, "rating": 4, "date_added": "Mon Aug 01 13:41:57 -0700 2011", "date_updated": "Mon Aug 01 13:42:41 -0700 2011", "read_at": "Fri Jan 01 00:00:00 -0800 1988", "started_at": "" }

    Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.

    Citation: Please cite the following if you use the data: Item recommendation on monotonic behavior chains Mengting Wan, Julian McAuley RecSys, 2018 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)

    Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.

    Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.

    Datasets:

    Meta-Data of Books:

    • Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.
    • Detailed Information of Authors (goodreads_book_authors.json.gz):
      • An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.
      • Download Link
    • Detailed Information of Works (goodreads_book_works.json.gz):
      • This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.
      • Download Link
    • Detailed Information of Book Series (goodreads_book_series.json.gz):
      • A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.
      • Download Link
    • Extracted Fuzzy Book Genres (goodreads_book_genres_initial.json....
  9. indian-railway-dataset

    • kaggle.com
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    flugeltomar (2024). indian-railway-dataset [Dataset]. https://www.kaggle.com/datasets/flugeltomar/indian-railway-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    flugeltomar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains a list of all railway stations in India, including station codes, station names, and region codes. It is a comprehensive resource useful for various applications like transportation analytics, geographic studies, mapping, and machine learning projects.

    Dataset Structure The dataset is provided in JSON format with an array of objects, where each object represents a railway station.

    Example JSON Format: [ { "station_code": "A", "station_name": "ARMENIAN GHAT CITY", "region_code": "SE" }, { "station_code": "AA", "station_name": "ATARIA", "region_code": "NE" }, { "station_code": "AABH", "station_name": "AMBIKA BHAWALI HALT", "region_code": "EC" } ]

    Keys in the JSON: station_code: A unique identifier for each railway station (e.g., "A", "AA", "AABH"). station_name: The name of the railway station (e.g., "ARMENIAN GHAT CITY", "ATARIA"). region_code: The region to which the station belongs (e.g., "SE", "NE", "EC").

    import json

    Load the JSON dataset

    with open('railway_stations_india.json', 'r') as file: data = json.load(file)

    Display the first few stations

    for station in data[:5]: # Displaying first 5 stations print(f"Station Code: {station['station_code']}, Station Name: {station['station_name']}, Region Code: {station['region_code']}")

  10. Predict Molecular Properties

    • kaggle.com
    Updated Aug 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BurakH (2017). Predict Molecular Properties [Dataset]. https://www.kaggle.com/datasets/burakhmmtgl/predict-molecular-properties
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2017
    Dataset provided by
    Kaggle
    Authors
    BurakH
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Predict molecular properties

    Context

    This dataset contains molecular properties scraped from the PubChem database. Each file contains properties for thousands of molecules , made up of the elements H, C, N, O, F, Si, P, S, Cl, Br, and I. The dataset is related to a previous one which had fewer number of molecules, where the features were preconstructed.

    Instead, this dataset is a challenging case for feature engineering and is subject of active research (see references below).

    Data Description

    The utilities used to download and process the data can be accessed from my Github repo.

    Each JSON file contains a list of molecular data. An example molecule is given below:

    {

    'En': 37.801,

    'atoms': [

    {'type': 'O', 'xyz': [0.3387, 0.9262, 0.46]},

    {'type': 'O', 'xyz': [3.4786, -1.7069, -0.3119]},

    {'type': 'O', 'xyz': [1.8428, -1.4073, 1.2523]},

    {'type': 'O', 'xyz': [0.4166, 2.5213, -1.2091]},

    {'type': 'N', 'xyz': [-2.2359, -0.7251, 0.027]},

    {'type': 'C', 'xyz': [-0.7783, -1.1579, 0.0914]},

    {'type': 'C', 'xyz': [0.1368, -0.0961, -0.5161]},

    {'type': 'C', 'xyz': [-3.1119, -1.7972, 0.659]},

    {'type': 'C', 'xyz': [-2.4103, 0.5837, 0.784]},

    {'type': 'C', 'xyz': [-2.6433, -0.5289, -1.426]},

    {'type': 'C', 'xyz': [1.4879, -0.6438, -0.9795]},

    {'type': 'C', 'xyz': [2.3478, -1.3163, 0.1002]},

    {'type': 'C', 'xyz': [0.4627, 2.1935, -0.0312]},

    {'type': 'C', 'xyz': [0.6678, 3.1549, 1.1001]},

    {'type': 'H', 'xyz': [-0.7073, -2.1051, -0.4563]},

    {'type': 'H', 'xyz': [-0.5669, -1.3392, 1.1503]},

    {'type': 'H', 'xyz': [-0.3089, 0.3239, -1.4193]},

    {'type': 'H', 'xyz': [-2.9705, -2.7295, 0.1044]},

    {'type': 'H', 'xyz': [-2.8083, -1.921, 1.7028]},

    {'type': 'H', 'xyz': [-4.1563, -1.4762, 0.6031]},

    {'type': 'H', 'xyz': [-2.0398, 1.417, 0.1863]},

    {'type': 'H', 'xyz': [-3.4837, 0.7378, 0.9384]},

    {'type': 'H', 'xyz': [-1.9129, 0.5071, 1.7551]},

    {'type': 'H', 'xyz': [-2.245, 0.4089, -1.819]},

    {'type': 'H', 'xyz': [-2.3, -1.3879, -2.01]},

    {'type': 'H', 'xyz': [-3.7365, -0.4723, -1.463]},

    {'type': 'H', 'xyz': [1.3299, -1.3744, -1.7823]},

    {'type': 'H', 'xyz': [2.09, 0.1756, -1.3923]},

    {'type': 'H', 'xyz': [-0.1953, 3.128, 1.7699]},

    {'type': 'H', 'xyz': [0.7681, 4.1684, 0.7012]},

    {'type': 'H', 'xyz': [1.5832, 2.901, 1.6404]}

    ],

    'id': 1,

    'shapeM': [259.66, 4.28, 3.04, 1.21, 1.75, 2.55, 0.16, -3.13, -0.22, -2.18, -0.56, 0.21, 0.17, 0.09]

    }

    1. En: This field is the molecular energy calculated using a force-field method. See references [1,2] for details. This is the target variable which is being predicted.
    2. atoms: This field contains the name of the element and the position (x,y,z coordinates) and needs to be used for feature engineering.
    3. id : This field is the PubChem Id
    4. shapeM : This field contains the shape multipoles and can be used for feature engineering. For definition of shape multipoles, see reference [3].

    Notice that each molecule contains different number and types of atoms, so it is challenging to come up with features that can describe every molecule in a unique way. There are several approaches taken in the literature (see the references), one of which is to use the Coulomb Matrix for a given molecule defined by

    \[ C_{IJ} = rac{Z_I Z_J}{ ert R_I - R_J ert}, quad ({ m I eq J}) qquad C_{IJ} = Z_I^{2.4}, quad (I=J) \]

    where $Z_I$ are atomic numbers (can be looked up from the periodic table for each element), and ${ ert R_I - R_J ert}$ is the distance between two atoms I and J. The previous dataset used these features for a subset of molecules given here, where the maximum number of elements in a given molecules was limited by 50.

    There are around 100,000,000 molecules in the whole database. As more files are scraped, new data will be added in time.

    Note: In the previous dataset, the molecular energies were computed by quantum mechanical simulations. Here, the given energies are computed using another method, so their values are different.

    Inspiration

    Simulations of molecular properties are computationally expensive. The purpose of this project is to use machine learning methods to come up with a model that can predict molecular properties from a database. In the PubChem database, there are around 100,000,000 molecules. It could take years to do simulations on all of these molecules, however machine learning can be used to predict their properties much faster. As a result, this could open up many possibilities in computational design and discovery of molecules, compounds and new drugs.

    This is a regression problem...

  11. Aerial Semantic Drone Dataset

    • kaggle.com
    Updated May 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lalu Erfandi Maula Yusnu (2021). Aerial Semantic Drone Dataset [Dataset]. https://www.kaggle.com/nunenuh/semantic-drone/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lalu Erfandi Maula Yusnu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Aerial Semantic Drone Dataset

    The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.

    This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.

    If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.

    This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.

    Semantic Annotation

    The images are labelled densely using polygons and contain the following 24 classes:

    unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting

    Directory Structure and Files

    > images
    > labels/png
    > labels/tiff
     - class_to_idx.json
     - classes.csv
     - classes.json
     - idx_to_class.json
    

    Included Data

    • 400 training images in jpg format can be found in "aerial_semantic_drone/images"
    • Dense semantic annotations in png format can be found in "aerial_semantic_drone/labels/png"
    • Dense semantic annotations in tiff format can be found in "aerial_semantic_drone/labels/tiff"
    • Semantic class definition in csv format can be found in "aerial_semantic_drone/classes.csv"
    • Semantic class definition in json can be found in "aerial_semantic_drone/classes.json"
    • Index to class name file can be found in "aerial_semantic_drone/idx_to_class.json"
    • Class name to index file can be found in "aerial_semantic_drone/idx_to_class.json"

    Contact

    aerial@icg.tugraz.at

    Citation

    If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at

    License

    The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.

    That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).

  12. Social Media Prediction Challenge

    • kaggle.com
    Updated May 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2023). Social Media Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/social-media-prediction-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Dutta
    Description

    The objective of this competition is to create a model to predict the number of retweets a tweet will get on Twitter. The data used to train the model will be approximately 2,400 tweets each from 38 major banks and mobile network operators across Africa.

    A machine learning model to predict retweets would be valuable to any business that uses social media to share important information and messages to the public. This model can be used as a tool to help businesses better tailor their tweets to ensure maximum impact and outreach to clients and non-clients.

    The data has been split into a test and training set.

    train.json (zipped) is the dataset that you will use to train your model. This dataset includes about 2,400 consecutive tweets from each of the companies listed below, for a total of 96,562 tweets.

    test_questions.json (zipped) is the dataset to which you will apply your model to test how well it performs. Use your model and this dataset to predict the number of retweets a tweet will receive. The test set are the consecutive tweets that followed the first tweets provided in the training sets. There are a maximum of 800 tweets per company in this test set. This dataset includes the same fields as train.json except for the retweet_count and favorite_count variables.

    sample_submission.csv is a table to provide an example of what your submission file should look like.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Paolo Pareti (2017). Human Instructions - Multilingual (wikiHow) [Dataset]. https://www.kaggle.com/paolop/human-instructions-multilingual-wikihow/tasks
Organization logo

Human Instructions - Multilingual (wikiHow)

800K formalised step-by-step instructions in 16 different languages from wikiHow

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2017
Dataset provided by
Kaggle
Authors
Paolo Pareti
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

The Human Instructions Dataset - Multilingual

Updated JSON files for English at this other Kaggle repository

Available in 16 Different Languages Extracted from wikiHow

Overview

Step-by-step instructions have been extracted from wikiHow in 16 different languages and decomposed into a formal graph representation like the one showed in the picture below. The source pages where the instructions have been extracted from have also been collected and they can be shared upon request.

Instructions are represented in RDF following the PROHOW vocabulary and data model. For example, the category, steps, requirements and methods of each set of instructions have been extracted.

This dataset has been produced as part of the The Web of Know-How project.

  • To cite this dataset use: Paula Chocron, Paolo Pareti. Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions. (PDF) (bibtex)

Quick-Start: Instruction Extractor and Simplifier Script

The large amount of data can make it difficult to work with this dataset. This is why an instruction-extraction python script was developed. This script allows you to:

  • select only the subset of the dataset you are interested in. For example only instructions from specific wikiHow pages, or instructions that fall within specific categories, such as cooking recipes, or those that have at least 5 steps, etc. The file class_hierarchy.ttl attached to this dataset is used to determine whether a set of instructions falls under a certain category or not.
  • simplify the data model of the instructions. The current data model is rich of semantic relations. However, this richness might make it complex to use. This script allows you to simplify the data model to make it easier to work with the data. An example graphical representation of this model is available here.

The script is available on this GitHub repository.

The Available Languages

This page contains the link to the different language versions of the data.

A previous version of this type of data, although for English only, is also available on Kaggle:

For the multilingual dataset, this is the list of the available languages and number of articles in each:

Querying the Dataset

The dataset is in RDF and it can be queried in SPARQL. Sample SPARQL queries are available in this GitHub page.

For example, [this SPARQL query](http://dydra.com/paolo-pareti/wikihow_multilingual/query?query=PREFIX%20prohow%3A%20%3Chttp%3A%2F%2Fw3id.org%2Fprohow%23%3E%20%0APREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%20%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20%0APREFIX%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%20%0APREFIX%20oa%3A%20%3Chttp%3A%2F%2Fw...

Search
Clear search
Close search
Google apps
Main menu