Facebook
TwitterThis dataset was created by NaomiDing
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by pratap surwase
Released under Apache 2.0
Facebook
TwitterThis dataset was created by GodGod3
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Charlie Craine
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by Uttam Mittal
Facebook
TwitterThis dataset was created by D. Forester
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Omar Yasser Salah El-Din
Released under Apache 2.0
Facebook
TwitterUtility scripts to compute the KL Divergence metric in Kaggle's HMS - Harmful Brain Activity Classification competition. The code came from Kaggle. For an example how to use, see one of my starter notebooks:
Starter Notebooks: * EfficientNetB2 starter here * CatBoost starter here * WaveNet starter here * MLP starter here
Facebook
Twitterscore decreased after adding special token. to check the utility script I removed all the special token and trained the model with 50% of total data to reduce train time. But the bug was found as I applied tokenization before adding special tokens to the tokenizer. But kept this trained model incase bug still exist.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
val/
# ├── n01440764
# │ ├── ILSVRC2012_val_00000293.JPEG
# │ ├── ILSVRC2012_val_00002138.JPEG
# │ ├── ......
# ├── ......
Facebook
TwitterThis dataset consists of modified versions of the Bert scripts by Phil Culliton to be imported into kernels as a single library instead of adding each module as a separate utility script:
For the
baselinescript, I am using a simplified/organized version of the translated version by dimitreoliveira instead of the original one by the Tensorflow team
To use in your Kernel:
input folder under TF2Bert subfolder.
import os
os.chdir("/kaggle/input")
from tf2bert import modeling, optimization, tokenization, baseline
os.chdir('/kaggle/working`)
As described in the files, the code is licensed under the Apache 2.0 license
Facebook
TwitterAll credits are due to https://github.com/qubvel/efficientnet Thanks so much for your contribution!
Use with this utility script: https://www.kaggle.com/ratthachat/efficientnet/
Adding this utility script to your kernel, and you will be able to use all models just like standard Keras pretrained model. For details see https://www.kaggle.com/c/aptos2019-blindness-detection/discussion/100186
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.
Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.
This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.
As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.
The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.
Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.
If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.
This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.
The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.
By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.
Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.
Facebook
TwitterThis dataset was created from the TensorFlow 2.0 Question Answering primary dataset using this very handy utility script. The main differences from the original one are: - the structure is flattened to a simple DataFrame - long_answer_candidates were removed - only first annotations kept for both long and short answer (for short answer it is a reasonable approximation because there are very few samples with multiple short answers)
Thanks xhlulu for providing the utility script.
Facebook
TwitterThis Dataset contains the utility scripts meant to be used with the BigDataBowl2025 notebook titled "Modeling with Transformers, by SumerSports". The utility scripts included are data_prep.py, process_datasets.py, and models.py
Facebook
TwitterPresentation slides about KGE and the proposed KGE regression idea: https://docs.google.com/presentation/d/1u8kVSd6CcxlealP-2E2jpOdvNCSCovfJTn0hbpfOLhY/edit?usp=sharing Written report: https://docs.google.com/document/d/1UvCrD3jrf7Z3qPdV6C-6GWgDJYHDfBJyXWm8G-sGTeM/edit?usp=sharing
This utility script contain a class which provides following utilities https://www.kaggle.com/catwhisker/kge-experiment 1. caching and loading PyKEEN models and datasets 2. create sub graph from PyKEEN datasets using an entity mask 3. evaluating different model with plots 4. minimal and imperfect GPU memory management demo of the utility script on toy graph: https://www.kaggle.com/catwhisker/kge-experinment-demo
Link of scripts that generate corresponding item are listed on the item's description
Facebook
TwitterThis dataset contains pickled Python objects with data from the annotations of the Microsoft (MS) COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset.
Except for the objs file, which is a plain text file continuing a list of objects, the data in this dataset is all in the pickle format, a way of storing Python objects at binary data files.
Important: These pickles were pickled using Python 2. Since Kernels use Python 3, you will need to specify the encoding when unpickling these files. The Python utility scripts here have been updated to correctly unpickle these files.
# the correct syntax to read these pickled files into Python 3
pickle.load(open('file_path, 'rb'), encoding = "latin1")
As a derivative of the original COCO dataset, this dataset is distributed under a CC-BY 4.0 license. These files were distributed as part of the supporting materials for Zhao et al 2017. If you use these files in your work, please cite the following paper:
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979-2989).
Facebook
TwitterNIH DeepLesion Tensor Slices is a subset of NIH's DeepLesion dataset, which may be downloaded in full from here. Full credit is to be given to the NIH team for creating the DeepLesion dataset.
To see an example of this dataset in use, we perform bounding box regression utilizing the ResNet-50 network pre-trained on the ImageNet dataset, creating a model name LesionFinder. - LesionFinder: Training. The training notebook. We fit the adapted ResNet-50 model to the training data. - LesionFinder: Evaluation. The evaluation notebook. We gauge the performance of the training model. - LesionFinder: Utilities. Utility script providing functions and classes for the training and evaluation notebooks
You may download and preprocess the entire DeepLesion dataset to generate this subset by following the instructions given here. Please note that downloading and processing the entire dataset requires approximately 375 GB of disk space. If you simply want the tensor slices, it is highly suggested to download it directly from Kaggle.
NIH DeepLesion Tensor Slices performs 2 primary steps in preprocessing the DeepLesion dataset.
1. The images and bounding boxes are normalized to the range [0, 1]. The images are converted from hounsfield units to pixel (voxel) values, which are feature-scaled to the [0, 1] given the maximum and minimum values of the DICOM window. The bounding boxes are scaled as a proportion of the respective image dimension, where a value of 0 means that the lesion lies on the respective dimension's minimum and a value of 1 means that the lesion lies on the respective dimension's maximum.
2. We stack local x-ray images to encode local volumetric information. For a given key slice, we create an image with 3 channels by prepending the previous slice and appending the next slice to the key slice. Since most pre-trained networks utilize 3 channels, this will hopefully help our models utilize information surrounding the key slice.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6966306%2Fe287b5c0582cf78a47f8daa8b1ac157a%2Fvisualize_volume.svg?generation=1680371159427206&alt=media" alt="Visualizing Tensor Slices">
When visualizing an image, this mean the red channel encodes information from the previous slice, the green channel encodes information from the key slice, and the blue channel encodes information from the next slice. As an aside, we can visualize how the RGB version of CT images compares to simple grayscale images.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6966306%2Fca6a67f9e178acea4362e9097506171a%2Fct_volume.gif?generation=1680371544234583&alt=media" alt="Grayscale CT Animation">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F6966306%2Fcf5926aa3d3c0c112c0765f857b28cdf%2Fct_volume_3d.gif?generation=1680371547549634&alt=media" alt="RGB CT Animation">
Feel free to start a discussion if you have any information you want to share, any questions or concerns you may have, or whatever else. If you believe there are errors or better preprocessing approaches, please let me know! I'm always looking to improve my work and will always appreciate the criticism, and I'd be happy to ensure the validity of the dataset and its preprocessing.
I did NOT create the primary dataset. Any work done using the dataset should be referenced to the original authors below:
Yan, Ke, et al. "DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning." Journal of medical imaging 5.3 (2018): 036501-036501.
@article{yan2018deeplesion,
title={DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning},
author={Yan, Ke and Wang, Xiaosong and Lu, Le and Summers, Ronald M},
journal={Journal of medical imaging},
volume={5},
number={3},
pages={036501--036501},
year={2018},
publisher={Society of Photo-Optical Instrumentation Engineers}
}
Facebook
TwitterThis dataset contains the Glove 6B word embeddings in JSON format, providing a comprehensive collection of pre-trained word vectors that can be used to enhance a wide range of natural language processing (NLP) projects. The Glove 6B dataset contains embeddings for over 6 billion tokens, trained on a 2014 dump of English Wikipedia and Gigaword 5, using a corpus of 822 million tokens. The embeddings are pre-trained on this large corpus, which means that they can be used as a starting point for a variety of NLP tasks, including text classification, sentiment analysis, and named entity recognition.
The JSON format makes it easy to access and use the embeddings in your own projects, without the need to perform any additional preprocessing. Each word in the vocabulary is represented as a JSON object containing the word itself as well as its corresponding 300-dimensional vector. The dataset also includes utility scripts that demonstrate how to load and use the embeddings in popular NLP libraries such as NLTK and spaCy.
Whether you're a researcher, developer, or hobbyist working on NLP projects, the Glove 6B word embeddings in JSON format provide a powerful tool for accelerating your work and achieving better results. Download the dataset today and start exploring the possibilities!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E.C.H.O. is an AI-powered multi-agent crisis communication simulator that helps users practice de-escalation, empathy, and emotional intelligence. It uses Actor, Monitor, and Director agents to analyze user input, track tension levels, and generate dynamic, emotionally realistic responses across multiple scenarios.
📄 README.md This file provides a full overview of the E.C.H.O. project, including architecture, setup instructions, agents, scenarios, and future roadmap. It serves as the main documentation for understanding how to run and extend the simulation.
🧠 agents/ (folder) Contains all multi-agent logic for E.C.H.O. Each Python file represents a standalone agent responsible for part of the simulation loop.
actor.py – Generates emotionally responsive dialogue for the crisis character.
monitor.py – Evaluates user sentiment and adjusts tension levels (0–100).
director.py – Controls complexity, inserts environmental complications, and manages narrative pacing.
graph.py – Defines the LangGraph workflow connecting Actor, Monitor, and Director agents.
state.py – Stores global simulation state including tension score, history, persona, and turn count. 🎨 frontend/ (folder) A lightweight HTML/JS interface for running E.C.H.O. in a browser. Contains assets, icons, and UI elements for scenario selection and interaction.
⚙ app.py Streamlit application that serves as the main UI for E.C.H.O. Displays tension meter, scenario visuals, microphone input, agent responses, and conversation history.
📜 requirements.txt Python library dependencies used by the project, including LangGraph, Streamlit, Gemini API clients, and audio processing tools.
🔧 zip_project.py Utility script that compresses the entire E.C.H.O. project into a single .zip file for Kaggle uploads or backups.
⚠ .env.example Template file showing expected environment variables, including the required Google Gemini API key.
🧳 .kaggleignore Specifies files or folders to exclude during Kaggle dataset packaging (e.g., sensitive or unnecessary files).
Facebook
TwitterThis dataset was created by NaomiDing