2 datasets found
  1. movie lens 32 ml

    • kaggle.com
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer Kamal 12 (2024). movie lens 32 ml [Dataset]. https://www.kaggle.com/datasets/omerkamal12/movie-lens-32-ml/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Omer Kamal 12
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Omer Kamal 12

    Released under Apache 2.0

    Contents

  2. Zenodo Code Images

    • kaggle.com
    zip
    Updated Jun 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Research Computing Center (2018). Zenodo Code Images [Dataset]. https://www.kaggle.com/datasets/stanfordcompute/code-images
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 18, 2018
    Dataset authored and provided by
    Stanford Research Computing Center
    Description

    Code Images

    DOI

    Context

    This is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.

    Content

    Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.

     tree -L 1
    .
    ├── c
    ├── cc
    ├── cpp
    ├── cs
    ├── css
    ├── csv
    ├── cxx
    ├── data
    ├── f90
    ├── go
    ├── html
    ├── java
    ├── js
    ├── json
    ├── m
    ├── map
    ├── md
    ├── txt
    └── xml
    

    And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.

    $ tree map -L 1
    map
    ├── 1001104
    ├── 1001659
    ├── 1001793
    ├── 1008839
    ├── 1009700
    ├── 1033697
    ├── 1034342
    ...
    ├── 836482
    ├── 838329
    ├── 838961
    ├── 840877
    ├── 840881
    ├── 844050
    ├── 845960
    ├── 848163
    ├── 888395
    ├── 891478
    └── 893858
    
    154 directories, 0 files
    

    Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.

    $ tree m/891531/ -L 1
    m/891531/
    ├── 891531_0.png
    ├── 891531_10.png
    ├── 891531_11.png
    ├── 891531_12.png
    ├── 891531_13.png
    ├── 891531_14.png
    ├── 891531_15.png
    ├── 891531_16.png
    ├── 891531_17.png
    ├── 891531_18.png
    ├── 891531_19.png
    ├── 891531_1.png
    ├── 891531_20.png
    ├── 891531_21.png
    ├── 891531_22.png
    ├── 891531_23.png
    ├── 891531_24.png
    ├── 891531_25.png
    ├── 891531_26.png
    ├── 891531_27.png
    ├── 891531_28.png
    ├── 891531_29.png
    ├── 891531_2.png
    ├── 891531_30.png
    ├── 891531_3.png
    ├── 891531_4.png
    ├── 891531_5.png
    ├── 891531_6.png
    ├── 891531_7.png
    ├── 891531_8.png
    └── 891531_9.png
    
    0 directories, 31 files
    

    So what's the difference?

    The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.

    How many images total?

    We can count the number of total images:

    find "." -type f -name *.png | wc -l
    3,026,993
    

    Dataset Curation

    The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).

    Saving the Image

    I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.

    import cv2
    cv2.imwrite(image_path, image)
    

    Loading the Image

    Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.

    image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
    from imageio import imread
    
    image = imread(image_path)
    array([[116, 105, 109, ..., 32, 32, 32],
        [ 48, 44, 48, ..., 32, 32, 32],
        [ 48, 46, 49, ..., 32, 32, 32],
        ..., 
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
    
    
    image.shape
    (80,80)
    
    
    # Deprecated
    from scipy import misc
    misc.imread(image_path)
    
    Image([[116, 105, 109, ..., 32, 32, 32],
        [ 48, 44, 48, ..., 32, 32, 32],
        [ 48, 46, 49, ..., 32, 32, 32],
        ..., 
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32],
        [ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
    

    Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?

    ord(' ')
    32
    
    # And thus if you wanted to convert it back...
    chr(32)
    

    So how t...

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Omer Kamal 12 (2024). movie lens 32 ml [Dataset]. https://www.kaggle.com/datasets/omerkamal12/movie-lens-32-ml/suggestions?status=pending&yourSuggestions=true
Organization logo

movie lens 32 ml

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Omer Kamal 12
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Omer Kamal 12

Released under Apache 2.0

Contents

Search
Clear search
Close search
Google apps
Main menu