2 datasets found

movie lens 32 ml
kaggle.com
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omer Kamal 12 (2024). movie lens 32 ml [Dataset]. https://www.kaggle.com/datasets/omerkamal12/movie-lens-32-ml/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Omer Kamal 12
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Omer Kamal 12

Released under Apache 2.0

Contents
Zenodo Code Images
kaggle.com
zip
Updated Jun 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Research Computing Center (2018). Zenodo Code Images [Dataset]. https://www.kaggle.com/datasets/stanfordcompute/code-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 18, 2018
Dataset authored and provided by
Stanford Research Computing Center
Description
Code Images

Context

This is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.

Content

Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.

tree -L 1 . ├── c ├── cc ├── cpp ├── cs ├── css ├── csv ├── cxx ├── data ├── f90 ├── go ├── html ├── java ├── js ├── json ├── m ├── map ├── md ├── txt └── xml

And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.

$ tree map -L 1 map ├── 1001104 ├── 1001659 ├── 1001793 ├── 1008839 ├── 1009700 ├── 1033697 ├── 1034342 ... ├── 836482 ├── 838329 ├── 838961 ├── 840877 ├── 840881 ├── 844050 ├── 845960 ├── 848163 ├── 888395 ├── 891478 └── 893858 154 directories, 0 files

Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.

$ tree m/891531/ -L 1 m/891531/ ├── 891531_0.png ├── 891531_10.png ├── 891531_11.png ├── 891531_12.png ├── 891531_13.png ├── 891531_14.png ├── 891531_15.png ├── 891531_16.png ├── 891531_17.png ├── 891531_18.png ├── 891531_19.png ├── 891531_1.png ├── 891531_20.png ├── 891531_21.png ├── 891531_22.png ├── 891531_23.png ├── 891531_24.png ├── 891531_25.png ├── 891531_26.png ├── 891531_27.png ├── 891531_28.png ├── 891531_29.png ├── 891531_2.png ├── 891531_30.png ├── 891531_3.png ├── 891531_4.png ├── 891531_5.png ├── 891531_6.png ├── 891531_7.png ├── 891531_8.png └── 891531_9.png 0 directories, 31 files

So what's the difference?

The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.

How many images total?

We can count the number of total images:

find "." -type f -name *.png | wc -l 3,026,993

Dataset Curation

The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).

Saving the Image

I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.

import cv2 cv2.imwrite(image_path, image)

Loading the Image

Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.

image_path = '/tmp/data1/data/csv/1009185/1009185_0.png' from imageio import imread image = imread(image_path) array([[116, 105, 109, ..., 32, 32, 32], [ 48, 44, 48, ..., 32, 32, 32], [ 48, 46, 49, ..., 32, 32, 32], ..., [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8) image.shape (80,80) # Deprecated from scipy import misc misc.imread(image_path) Image([[116, 105, 109, ..., 32, 32, 32], [ 48, 44, 48, ..., 32, 32, 32], [ 48, 46, 49, ..., 32, 32, 32], ..., [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32], [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)

Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?

ord(' ') 32 # And thus if you wanted to convert it back... chr(32)

So how t...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Omer Kamal 12 (2024). movie lens 32 ml [Dataset]. https://www.kaggle.com/datasets/omerkamal12/movie-lens-32-ml/suggestions?status=pending&yourSuggestions=true

movie lens 32 ml

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 24, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Omer Kamal 12

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Omer Kamal 12

Released under Apache 2.0

Clear search

Close search

Google apps

Main menu

movie lens 32 ml

Dataset

Contents

Zenodo Code Images

Code Images

Context

Content

Dataset Curation

Saving the Image

Loading the Image

movie lens 32 ml

Dataset

Contents