Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning
Trained AI reduced models
python files to run predictions
python files to train model
template for inputing kinetics for predictions
data used in case studies
Unpack data file and follow instructions in publication's Supporting Information
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Developments in Artificial Intelligence (AI) have had an enormous impact on scientific research in recent years. Yet, relatively few robust methods have been reported in the field of structure-based drug discovery. To train AI models to abstract from structural data, highly curated and precise biomolecule-ligand interaction datasets are urgently needed. We present MISATO, a curated dataset of almost 20000 experimental structures of protein-ligand complexes, associated molecular dynamics traces, and electronic properties. Semi-empirical quantum mechanics was used to systematically refine protonation states of proteins and small molecule ligands. Molecular dynamics traces for protein-ligand complexes were obtained in explicit water. The dataset is made readily available to the scientific community via simple python data-loaders. AI baseline models are provided for dynamical and electronic properties. This highly curated dataset is expected to enable the next-generation of AI models for structure-based drug discovery. Our vision is to make MISATO the first step of a vibrant community project for the development of powerful AI-based drug discovery tools.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset released by OpenAI, HumanEval, offers a unique opportunity for developers and researchers to accurately evaluate their code generation models in a safe environment. It includes 164 handcrafted programming problems written by engineers and researchers from OpenAI specificially designed to test the correctness and scalability of code generation models. Written in Python, these programming problems cover docstrings and comments full of natural English text which can be difficult for computers to comprehend. Each programming problem also includes a function signature, body as well as several unit tests. Placed under the MIT License, this HumanEval dataset is ideal for any practitioner looking to judge the efficacy of their machine-generated code with trusted results!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The first step is to explore the data that is included in the set by viewing the columns included. This guide will focus on four key columns:
prompt,canonical_solution,testandentry_point. - The prompt column contains natural English text describing the programming problem. - The canonical_solution column holds the correct solution to each programming problem as determined by OpenAI researchers or engineers who hand-crafted the dataset. - The test column contains unit tests designed to check for correctness when debugging or evaluating code generated by neural networks or other automated tools.
- The entry_point column contains code for an entry point into each program which can be used as starting point while solving any programming problem from this dataset.With this information we can now begin utilizing this data set for our own projects from building new case studies for specific AI algorithms to developing automated programs that generate compatible source code instructions based off open AI datasets like Human Eval!
- Training code generation models in a limited and supervised environment.
- Benchmarking the performance of existing code generation models, as HumanEval consists of both the canonical solution for each problem and unit tests that can be used to evaluate model accuracy.
- Using Natural Language Processing (NLP) algorithms on the docstrings and comments within HumanEval to develop better natural language understanding for programming contexts
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: test.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | prompt | A description of the programming problem. (String) | | canonical_solution | The expected solution to the programming problem. (String) | | test | Unit tests to verify the accuracy of the solution. (String) | | entry_point | The entry point for running the unit tests. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterThis software is a Python module for estimating uncertainty in predictions of machine learning models. It is a Python package that calculates uncertainties in machine learning models using bootstrapping and residual bootstrapping. It is intended to interface with scikit-learn but any Python package that uses a similar interface should work.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for Fake News Classification using NLP & Machine Learning. It contains labeled fake and real news articles, sourced from credible datasets. It is optimized for text analysis, deep learning models, and AI research.
1️⃣ Load the dataset in Python using Pandas
```python
import pandas as pd
df = pd.read_csv("fake_news_data.csv")
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
AI Python Project is a dataset for object detection tasks - it contains Vehicles FTNK annotations for 2,999 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterThis child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. STAT_PS_HUC12_Tot_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply total water use from 2000-2020 STAT_PS_HUC12_GW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply groundwater use for 2000-2020 STAT_PS_HUC12_SW_2000_2020.csv - a csv file with statistics by HUC12 for the estimated monthly public supply surface water use for 2000-2020 public_water_use_model.zip - a zip file containing input datasets, scripts, and output datasets for the public supply water use machine learning model version_history_MLmodel.txt - a txt file describing changes in this version
Facebook
TwitterThis resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Facebook
TwitterGit archive containing Python modules and resources used to generate machine-learning models used in the "Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada" project. This software is licensed as free to use, modify, and distribute with attribution. Full license details are included within the archive. See "documentation.zip" for setup instructions and file trees annotated with module descriptions.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Exploring the creation of a unique dataset of synthetic influencer profiles using AI technologies, including OpenAI's GPT-3.5.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.
Request here for the large dataset Medium datasets
Checkout sample dataset in CSV
Training language models (LLMs)
Analyzing content trends and engagement
Sentiment and text classification
SEO research and author profiling
Academic or commercial research
High-volume, cleanly structured JSON
Ideal for developers, researchers, and data scientists
Easy integration with Python, R, SQL, and other data pipelines
Affordable and ready-to-use
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.
Model input data and code
Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.
Model output data and code
This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.
*datasets are NumPy arrays (v1.23) created in Python v3.8.18.
**all ML models were created with Keras in Python v3.10.10.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🌐 Bilingual Coding Q&A Dataset
📊 Dataset Description
A comprehensive bilingual (English-Hindi) dataset containing 25,151 high-quality question-answer pairsfocused on programming concepts, particularly Python, machine learning, and AI. This dataset was used to fine-tune coding assistant models and contains over 7 million tokens of training data.
Dataset Statistics
Metric Value
Total Examples 25,151 Q&A pairs
Total Lines 250,320+… See the full description on the dataset page: https://huggingface.co/datasets/convaiinnovations/bilingual-coding-qa-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Okay, here's a revised description for the new dataset "CoT_Reasoning_Python_General_Query": CoT Reasoning Python General Query: Enhancing Python Understanding through Chain of Thought Reasoning
Description: Explore Python programming and general computing queries with the "CoT_Reasoning_Python_General_Query" dataset. This open-source resource (MIT licensed) provides a carefully curated collection of question-and-answer pairs designed to train AI models in understanding and reasoning about a… See the full description on the dataset page: https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Large Walmart Products Dataset is an essential resource for businesses, analysts, and developers seeking detailed insights into Walmart’s vast product catalog. This dataset includes extensive information on Walmart products, such as product names, descriptions, prices, categories, brand information, ratings, and customer reviews.
With Walmart being one of the largest retailers globally, this dataset provides a unique opportunity to study consumer trends, perform competitive pricing analysis, and develop e-commerce solutions. For startups and established businesses, the dataset is ideal for market research, inventory management insights, and enhancing product discovery mechanisms.
AI and machine learning practitioners can use this dataset to build recommendation systems, predictive pricing algorithms, and sentiment analysis models. Its structured format ensures smooth integration with Python, R, and other data analytics tools, making it user-friendly for data visualization and predictive modeling.
Walmart Products Dataset is also an invaluable resource for retail analysts and e-commerce marketers aiming to optimize product positioning or analyze buying behaviors. Its broad coverage across categories like groceries, electronics, fashion, and home essentials provides a holistic view of Walmart’s inventory.
Key Features:
Whether you're developing an AI-driven product search engine or conducting a pricing strategy study, the Large Walmart Products Dataset equips you with the data you need to succeed in a competitive market.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI).
The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations.
CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code.
By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is for running the code from this site: https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8.
This is how to show a picture from the training set: display(Image('../input/cat-and-dog/training_set/training_set/dogs/dog.423.jpg'))
From the test set: display(Image('../input/cat-and-dog/test_set/test_set/cats/cat.4453.jpg'))
See an example of using this dataset. https://www.kaggle.com/tongpython/nattawut-5920421014-cat-vs-dog-dl
Facebook
TwitterThe Fish Detection AI project aims to improve the efficiency of fish monitoring around marine energy facilities to comply with regulatory requirements. Despite advancements in computer vision, there is limited focus on sonar images, identifying small fish with unlabeled data, and methods for underwater fish monitoring for marine energy. A YOLO (You Only Look Once) computer vision model was developed using the Eyesea dataset (optical) and sonar images from Alaska Fish and Games to identify fish in underwater environments. Supervised methods were used within YOLO to detect fish based on training using labeled data of fish. These trained models were then applied to different unseen datasets, aiming to reduce the need for labeling datasets and training new models for various locations. Additionally, hyper-image analysis and various image preprocessing methods were explored to enhance fish detection. In this research we achieved: 1. Enhanced YOLO Performance, as compared to a published article (Xu, Matzner 2018) using earlier yolo versions for fish object identification. Specifically, we achieved a best mean Average Precision (mAP) of 0.68 on the Eyesea optical dataset using YOLO v8 (medium-sized model), surpassing previous YOLO v3 benchmarks from that previous article publication. We further demonstrated up to 0.65 mAP on unseen sonar domains by leveraging a hyper-image approach (stacking consecutive frames), showing promising cross-domain adaptability. This submission of data includes: - The actual best-performing trained YOLO model neural network weights, which can be applied to do object detection (PyTorch files, .pt). These are found in the Yolo_models_downloaded zip file - Documentation file to explain the upload and the goals of each of the experiments 1-5, as detailed in the word document (named "Yolo_Object_Detection_How_To_Document.docx") - Coding files, namely 5 sub-folders of python, shell, and yaml files that were used to run the experiments 1-5, as well as a separate folder for yolo models. Each of these is found in their own zip file, named after each experiment - Sample data structures (sample1 and sample2, each with their own zip file) to show how the raw data should be structured after running our provided code on the raw downloaded data - link to the article that we were replicating (Xu, Matzner 2018) - link to the Yolo documentation site from the original creators of that model (ultralytics) - link to the downloadable EyeSea data set from PNNL (instructions on how to download and format the data in the right way to be able to replicate these experiments is found in the How To word document)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile, which integrates the Common Provenance Model (CPM), and the Process Run Crate profile.
As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.
Description of the AI pipeline
The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:
Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.
AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.
AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.
In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files.
Finally, all these artifacts are packed together in an RO-Crate.
For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files.
Description of the RO-Crate
Process Run Crate related aspects
The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied.
Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions.
As a result, the crate consists the seven following “executables”:
Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.
Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.
For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.
Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.
CPM RO-Crate related aspects
The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.
In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.
Remarks
The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.
The input image files included in this RO-Crate are coming from the Camelyon16 dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains data associated to publication: Organic Reaction Mechanism Classification with Machine Learning
Trained AI reduced models
python files to run predictions
python files to train model
template for inputing kinetics for predictions
data used in case studies
Unpack data file and follow instructions in publication's Supporting Information