Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
.pkl or .npz.Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Python_class is a dataset for object detection tasks - it contains Atk annotations for 967 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterreiprasetya-study/python-class-names dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterSubsampling of the dataset Amazon_employee_access (4135) with
seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code:
def subsample(
self,
seed: int,
nrows_max: int = 2_000,
ncols_max: int = 100,
nclasses_max: int = 10,
stratified: bool = True,
) -> Dataset:
rng = np.random.default_rng(seed)
x = self.x
y = self.y
# Uniformly sample
classes = y.unique()
if len(classes) > nclasses_max:
vcs = y.value_counts()
selected_classes = rng.choice(
classes,
size=nclasses_max,
replace=False,
p=vcs / sum(vcs),
)
# Select the indices where one of these classes is present
idxs = y.index[y.isin(classes)]
x = x.iloc[idxs]
y = y.iloc[idxs]
# Uniformly sample columns if required
if len(x.columns) > ncols_max:
columns_idxs = rng.choice(
list(range(len(x.columns))), size=ncols_max, replace=False
)
sorted_column_idxs = sorted(columns_idxs)
selected_columns = list(x.columns[sorted_column_idxs])
x = x[selected_columns]
else:
sorted_column_idxs = list(range(len(x.columns)))
if len(x) > nrows_max:
# Stratify accordingly
target_name = y.name
data = pd.concat((x, y), axis="columns")
_, subset = train_test_split(
data,
test_size=nrows_max,
stratify=data[target_name],
shuffle=True,
random_state=seed,
)
x = subset.drop(target_name, axis="columns")
y = subset[target_name]
# We need to convert categorical columns to string for openml
categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs]
columns = list(x.columns)
return Dataset(
# Technically this is not the same but it's where it was derived from
dataset=self.dataset,
x=x,
y=y,
categorical_mask=categorical_mask,
columns=columns,
)
Facebook
TwitterThis dataset was created by Dhafer
Facebook
TwitterThe MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for practising in class
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Facebook
TwitterData DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A list of different projects selected to analyze class comments (available in the source code) of various languages such as Java, Python, and Pharo. The projects vary in terms of size, contributors, and domain.
Projects/
Java_projects/
eclipse.zip
guava.zip
guice.zip
hadoop.zip
spark.zip
vaadin.zip
Pharo_projects/
images/
GToolkit.zip
Moose.zip
PetitParser.zip
Pillar.zip
PolyMath.zip
Roassal2.zip
Seaside.zip
vm/
70-x64/Pharo
Scripts/
ClassCommentExtraction.st
SampleSelectionScript.st
Python_projects/
django.zip
ipython.zip
Mailpile.zip
pandas.zip
pipenv.zip
pytorch.zip
requests.zip
Projects/ contains the raw projects of each language that are used to analyze class comments.
- Java_projects/
- eclipse.zip - Eclipse project downloaded from the GitHub. More detail about the project is available on GitHub Eclipse.
- guava.zip - Guava project downloaded from the GitHub. More detail about the project is available on GitHub Guava.
- guice.zip - Guice project downloaded from the GitHub. More detail about the project is available on GitHub Guice
- hadoop.zip - Apache Hadoop project downloaded from the GitHub. More detail about the project is available on GitHub Apache Hadoop
- spark.zip - Apache Spark project downloaded from the GitHub. More detail about the project is available on GitHub Apache Spark
- vaadin.zip - Vaadin project downloaded from the GitHub. More detail about the project is available on GitHub Vaadin
Pharo_projects/
images/ -
GToolkit.zip - Gtoolkit project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image. Moose.zip - Moose project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image. PetitParser.zip - Petit Parser project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.Pillar.zip - Pillar project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.PolyMath.zip - PolyMath project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.Roassal2.zip - Roassal2 project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.Seaside.zip - Seaside project is imported into the Pharo image. We can run this image with the virtual machine given in the vm/ folder. The script to extract the comments is already provided in the image.vm/ -
70-x64/Pharo - Pharo7 (version 7 of Pharo) virtual machine to instantiate the Pharo images given in the images/ folder. The user can run the vm on macOS and select any of the Pharo image.
Scripts/ - It contains the sample Smalltalk scripts to extract class comments from various projects.
ClassCommentExtraction.st - A Smalltalk script to show how class comments are extracted from various Pharo projects. This script is already provided in the respective project image.
SampleSelectionScript.st - A Smalltalk script to show sample class comments of Pharo projects are selected. This script can be run in any of the Pharo images given in the images/ folder.
Python_projects/
django.zip - Django project downloaded from the GitHub. More detail about the project is available on GitHub Djangoipython.zip - IPython project downloaded from the GitHub. More detail about the project is available on GitHub on IPythonMailpile.zip - Mailpile project downloaded from the GitHub. More detail about the project is available on GitHub on Mailpilepandas.zip - pandas project downloaded from the GitHub. More detail about the project is available on GitHub on pandaspipenv.zip - Pipenv project downloaded from the GitHub. More detail about the project is available on GitHub on Pipenvpytorch.zip - PyTorch project downloaded from the GitHub. More detail about the project is available on GitHub on PyTorchrequests.zip - Requests project downloaded from the GitHub. More detail about the project is available on GitHub on Requests
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterThe CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The CIFAR-100 Python dataset contains 60,000 32×32 color images across 100 object classes, designed for computer vision and machine learning research in image classification and object recognition.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Facebook
TwitterFashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fashion_mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
PACO is a detection dataset that provides richer annotations such as part masks, object categories, object-part categories, and attributes. It spans 75 object categories, 456 object-part categories and 55 attributes across two datasets: LVIS and Ego4D. It has 641K part masks annotated across 260K object boxes, with roughly half of them annotated with attributes. It provides evaluation metrics and benchmark results for three tasks on the datasets: part mask segmentation, object and part attribute prediction and zero-shot instance detection.

The PACO-LVIS dataset is formed from the LVIS dataset of images. The images sourced from the dataset has been annotated according to Meta's internal platform Halo, with 75 object classes. The LVIS dataset provides pixel-level annotations of objects and their categories, making it useful for part mask segmentation, object and part attribute prediction and zero-shot instance detection. | Stats | train/val/test | | :--- | :----: | | number of images | 45790/2410/9443 | | number of images with annotations | 45790/2410/9443 | | number of bounding boxes | 217117/10794/45861 | | number of object segments | 217117/10794/45861 | | number of part segments | 395071/20945/86041 | | number of bboxes with obj attributes | 58846/3140/12407 | | number of bboxes with part attributes | 52088/2812/11003 |
# data: the variable we're loading and saving json dictionary to
with open('annotations/paco_lvis_v1_train.json', 'r') as file:
data = json.load(file)
data["images"]: # a list of dictionaries, each dictionary corresponds to one image
{
'id': int,
'file_name': str,
'width': int,
'height': int,
'neg_category_ids': list,
'not_exhaustive_category_ids': list,
'neg_category_ids_attrs': list,
'not_exhaustive_category_ids_attrs': list,
'license': int,
}
data["annotations"]: # a list of dictionaries, each dictionary corresponds to one object or part bounding box
{
'id': int,
'bbox': [x,y,width,height],
'area': float,
'category_id': int,
'image_id': int,
'segmentation': RLE,
'attribute_ids': List[int],
'dom_color_ids': List[int],
'obj_ann_id': int,
'unknown_color': 0 or 1,
'unknown_pattern_marking': 0 or 1,
'unknown_material': 0 or 1,
'unknown_transparency': 0 or 1,
'instance_id': int, # PACO-EGO4D only
'blur_level': int, # PACO-EGO4D only
}
data["categories"]: # a list of dictionaries, each dictionary corresponds to one object category
{
'supercategory': 'OBJECT',
'id': int,
'name': str,
'image_count': int,
'instance_count': int,
'synset': str,
'frequency': char,
}
data["part_categories"]: # a list of dictionaries, each dictionary corresponds to one part category
{
'supercategory': 'PART',
'id': int,
'name': str
}
data['attributes']: # a list of dictionaries, each dictionary corresponds to one attribute category
{
'supercategory': 'ATTR',
'id': int,
'name': str
}
data["attr_type_to_attr_idxs"]: # dictionary, key is the attribute name (one of: color, pattern, marking, material, transparency, value is the list of ids each attribute corresponds to)
{
'color': range(30),
'pattern_marking': range(30,41),
'material': range(41,55),
'transparency': range(55,59)
}
'trash_can','handbag','ball','basket','bicycle','book','bottle','bowl','can','car_(automobile)','carton','cellular_telephone','chair','cup','dog','drill','drum_(musical_instrument)','glass_(drink_container)','guitar','hat','helmet','jar','knife','laptop_computer','mug','pan_(for_cooking)','plate','remote_control','scissors','shoe','slipper_(footwear)','stool','table','towel','wallet','watch','wrench','belt','bench','blender','box','broom',...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()"
julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
✅ Step 1: Mount to Dataset
Search for my dataset pytorch-models and add it — this will mount it at:
/kaggle/input/pytorch-models/
✅ Step 2: Check file paths Once mounted, the four files will be available at:
/kaggle/input/pytorch-models/base_models.py
/kaggle/input/pytorch-models/ext_base_models.py
/kaggle/input/pytorch-models/ext_hybrid_models.py
/kaggle/input/pytorch-models/hybrid_models.py
✅ Step 3: Copy files to working directory To make them importable, copy the .py files to your notebook’s working directory (/kaggle/working/):
import shutil
shutil.copy('/kaggle/input/pytorch-models/base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_base_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/ext_hybrid_models.py', '/kaggle/working/')
shutil.copy('/kaggle/input/pytorch-models/hybrid_models.py', '/kaggle/working/')
✅ Step 4: Import your modules Now that they are in the working directory, you can import them like normal:
import base_models
import ext_base_models
import ext_hybrid_models
import hybrid_models
Or, if you only want to import specific classes or functions:
from base_models import YourModelClass
from ext_base_models import AnotherModelClass
✅ Step 5: Use the models You can now initialize and use the models/classes/functions defined inside each file:
model = base_models.YourModelClass()
output = model(input_data)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Tiny ImageNet Dataset is a dataset of 100,000 tiny (64x64) images of objects. It is a popular dataset for image classification and object detection research. The dataset consists of 200 different classes, each of which has 500 images.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
.pkl or .npz.Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.