8 datasets found
  1. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  2. r

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • researchdata.edu.au
    Updated Apr 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    Dataset updated
    Apr 1, 2019
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Publication


    Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

    Description of R codes and data files in the repository

    This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

    The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

    These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

    Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

    Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

    The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  3. R

    Fruits Classification Classification Dataset - resize-512x512-reflect

    • public.roboflow.com
    zip
    Updated Apr 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Horea (2020). Fruits Classification Classification Dataset - resize-512x512-reflect [Dataset]. https://public.roboflow.com/classification/fruits-dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 6, 2020
    Dataset authored and provided by
    Horea
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Overview

    The Fruits dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.

    Example Image: https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">

    Use Cases

    Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.

    Using this Dataset

    Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  4. MNIST Preprocessed

    • kaggle.com
    Updated Jul 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentyn Sichkar (2019). MNIST Preprocessed [Dataset]. https://www.kaggle.com/valentynsichkar/mnist-preprocessed/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Valentyn Sichkar
    Description

    📰 Related Paper

    Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)

    Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html


    :mortar_board: Related course for classification tasks

    Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">


    🗺️ Concept Map of the Course

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">


    👉 Join the Course

    https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/


    Content

    This is ready to use preprocessed data saved into pickle file.
    Preprocessing stages are as follows:
    - Normalizing whole data by dividing / 255.0.
    - Dividing whole data into three datasets: train, validation and test.
    - Normalizing whole data by subtracting mean image and dividing by standard deviation.
    - Transposing every dataset to make channels come first.


    mean image and standard deviation were calculated from train dataset and applied to all datasets.
    When using user's image for classification, it has to be preprocessed firstly in the same way: normalized, subtracted with mean image and divided by standard deviation.


    Data written as dictionary with following keys:
    x_train: (59000, 1, 28, 28)
    y_train: (59000,)
    x_validation: (1000, 1, 28, 28)
    y_validation: (1000,)
    x_test: (1000, 1, 28, 28)
    y_test: (1000,)


    Contains pretrained weights model_params_ConvNet1.pickle for the model with following architecture:
    Input --> Conv --> ReLU --> Pool --> Affine --> ReLU --> Affine --> Softmax


    Parameters:

    • Input is 1-channeled GrayScale image.
    • 32 filters of Convolutional Layer.
    • Stride for Pool is 2 and height = width = 2.
    • Number of hidden neurons is 500.
    • Number of output neurons is 10.


    Architecture also can be understood as follows:
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">

    Acknowledgements

    Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.

  5. Titanic Research

    • kaggle.com
    Updated Dec 28, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Williams (2017). Titanic Research [Dataset]. https://www.kaggle.com/robbat1/titanic-countries-full/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Roberto Williams
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    This Titanic Dataset is based on my research to correct a series of database inconsistencies in this well known dataset.

    Content

    The only purpose of this research is practical knowledge related to Data Science and the desire to understand some aspects of Titanic accident and it’s impacts.

    Acknowledgements

    The information present here was based on the following sources:

    DATA FILES

    The purpose of this project is to identify how the accident impact in the countries and identify the economic influence in the occurrence of passenger survival, due to lack of safety structure in the vessel.

    To do so was necessary create a new dataset with complete passenger's dataset, correct information like age, official name and country of residence were consulted photocopies of original passenger's list, datasets, and passenger's biography.

    Below is indicated each resource and the data collected or consulted.

    BASIC DATASET

    Was used to create this project the following Titanic Dataset as a base dataset extracted from the Github account of the book "Efficient Amazon Machine Learning", published by Packt. https://github.com/alexisperrier/packt-aml/blob/master/ch4/original_titanic.csv

    DATASET EXTENSION

    In order to populate with the correct data values the following data sources were consulted:

    1.UK, RMS Titanic, Outward Passenger List, 1912. Was accessed the database and the original photocopies of passenger's list in order to acquire additional information. This collection was accessed through Ancestry services but provided in association with The National Archives. https://search.ancestry.com/search/db.aspx?dbid=2970. Terms and Conditions: http://www.ancestry.com/cs/legal/termsandconditions#Usage.

    Encyclopedia Titanica. Database with the biography of victims. https://www.encyclopedia-titanica.org

    Titanic - Titanic. The dataset with the biography of victims. http://www.titanic-titanic.com/, In order to solve inconsistency in names used in passengers list, was consulted the following websites:

    Find a Grave. Database with biography and grave pictures with names and surnames. https://www.findagrave.com.

    Wikipedia. Online encyclopedia. Used to understand the country changes over the years. For instance the change of political geography after the World War I. http://www.wikipedia.org.

    Inspiration

    I want to know in depth the impact of this terrible accident.

  6. Minecraft Images Fake or Real

    • kaggle.com
    Updated Apr 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Heaton (2021). Minecraft Images Fake or Real [Dataset]. https://www.kaggle.com/datasets/jeffheaton/mcfakes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jeff Heaton
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    This dataset allows you to try your hand at detecting fake images from real images. I trained a model on images that I collected from the Minecraft video game. From the provided link, you have access to my trained model, and can generate more fake data, if you like. However, if you would like additional real data, you will need to capture it from Minecraft yourself.

    The following is a real image from Minecraft: https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-34.jpg?raw=true" alt="Real Minecraft">

    This Minecraft image is obviously fake:

    https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-202.jpg?raw=true" alt="alt">

    Some images are not as easily guessed, such as this fake image:

    https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-493.jpg?raw=true" alt="alt">

    You will also have to contend with multiple times of the day. Darker images will be more difficult for your model.

    https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-477.jpg?raw=true" alt="alt">

  7. PANDAcap SSH Honeypot Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manolis Stamatogiannakis; Manolis Stamatogiannakis; Herbert Bos; Herbert Bos; Paul Groth; Paul Groth (2020). PANDAcap SSH Honeypot Dataset [Dataset]. http://doi.org/10.5281/zenodo.3759652
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Manolis Stamatogiannakis; Manolis Stamatogiannakis; Herbert Bos; Herbert Bos; Paul Groth; Paul Groth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of 63 PANDA traces, collected using the PANDAcap framework. The dataset aims to offer a starting point for the analysis of ssh brute force attacks. The traces were collected through the course of approximately 3 days from 21 to 23 February 2020. A VM was configured using PANDAcap so that it accepts all passwords for user root. When an ssh session starts for the user, PANDA is signaled by the recctrl plugin to start recording for 30'.

    You can read more details about the experimental setup and an overview of the dataset EuroSec 2020 publication:

    • Manolis Stamatogiannakis, Herbert Bos, and Paul Groth. PANDAcap: A Framework for Streamlining Collection of Full-System Traces. In Proceedings of the 13th European Workshop on Systems Security, EuroSec '20, Heraklion, Greece, April 2020. doi: 10.1145/3380786.3391396, preprint: vusec.net

    The dataset is split in 3 zip files/directories:

    • rr: Contains the 63 PANDA traces of the dataset. The traces are in the upcoming RRArchive format. Note that PANDA support for the format is still wip at the time of writing (April 2020). If you need to downgrade to the traditional PANDA trace format, you can use the snippet in foo.
    • qcow: Contains the QCOW base image (ubuntu16-planb.qcow2) used to create the dataset, as well as the disk deltas for the 63 traces. These can be mounted to inspect the contents of the filesystem before and after each session. and disk deltas for the 63 traces. Quick instructions on how to mount and inspect a QCOW image can be found below.
    • pcap: Contains the pcap network traces for the sessions in the PANDA traces. These have been extracted using the PANDA network plugin. We decided to also include them in the dataset as standalone files for convenience.

    Additionally, we provide the PANDA linux kernel profile ubuntu16-planb-kernelinfo.conf, which can be used to analyze the traces using the PANDA osi_linux plugin.

    Additional information:

    • To convert RRArchive traces to the traditional PANDA format, run the following snippet inside the rr directory:
      for f in *.tar.gz; do
        tar -zxvf "$f" --exclude=PANDArr --xform='s%/%-%' --xform='s%-metadata%%'
        rm -f "$f"
      done
    • If you wish to reuse the VM image in your project, it is available as a standalone download through academictorrents.com, along with more detailed information on its contents.
    • If you wish to download individual samples rather than the whole dataset, you can use the dataset torrent file available through academictorrents.com. Unlike this Zenodo deposit, the files in the torrent have not been zipped.
    • A better formatted (and possibly more up-to-date) version of this information can be found here.
  8. Hand-drawn Shapes (HDS) Dataset

    • kaggle.com
    Updated Jun 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francois Robert (2022). Hand-drawn Shapes (HDS) Dataset [Dataset]. https://www.kaggle.com/datasets/frobert/handdrawn-shapes-hds-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Francois Robert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    I have created this Dataset for my app Mix on Pix.

    On GitHub: https://github.com/frobertpixto/hand-drawn-shapes-dataset

    See the complete DataSheet (as described in https://arxiv.org/pdf/1803.09010.pdf) for the HDS Dataset here.

    The Images

    One shape per image. Drawings exist for 4 shapes: - Rectangle - Ellipse - Triangle - Other

    https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/readme_images/train_images.png?raw=true" alt="Image examples">

    The Dataset contains images (70px x 70px x 1 gray channel) distributed as:

    TotalOtherRectangleEllipseTriangle
    27292 images7287695664546595

    The shapes have been size-normalized and centered in a fixed-size image.

    Vertices

    https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/processing/find_vertices/readme_images/vertices_ell.png?raw=true" alt="Vertices for ellipses">

    Quick Geometry refresher: - Vertices in shapes are the points where two or more line segments or edges meet (like a corner for a rectangle). - Vertices of an ellipse are the 4 corner points at which the ellipse takes the maximum turn. Technically, an ellipse will have 2 vertices and 2 covertices. We will call them all vertices here. - The singular of vertices is vertex.

    Coordinates of vertices are interesting as they are much more precise than just a surrounding box used in Object detection.
    Vertices allow us to determine the angle of the shape and it exact size.

    Labelling of vertices

    Labelling was done by me using a tool I created in Mix on Pix. For each image, the tool also generated a csv file with 1 line per vertex. Each Vertex has: - a x coordinate between 0 and 1 - a y coordinate between 0 and 1

    Where: - (0,0) is the top left corner of the image - (1,1) is the bottom right corner of the image

    Note that the vertices are in no particular order. I sort them clockwise in the Extract-Transform-Load (ETL) processing.

    Example of a .csv file content for vertices of a rectangle

    0.14,0.28
    0.87,0.29
    0.86,0.67
    0.14,0.67
    

    Usefulness of vertices

    Aside from drawing shapes on images like in Mix on Pix, another real-life example could be to determine the direction of a car (rectangle) or a ship (ellipse) in a direct overhead view.

    Visualization and processing

    I have a few kernels that will allow you to see: - the samples in the Extract-Transform-Load (ETL) phase. - a complete example of processing (after the ETL).

    Notebooks - Classification - Shape

    Direct augmentation of the data.

    • 3 variations were generated per image
      1. Normal
      2. 1.5 to 3.0 wider
      3. 1.5 to 3.0 narrower
    • One advantage is that I realized that:
      • People tend to make equilibrated shapes (Circle, Square, Equilateral triangle).
      • Most elongated images were interesting and sometime presented a different challenge than the original.
    • This processing was not done for type Other.
    • I validated them all manually (or we could say visually) and removed the generated images that were not interesting.
    • This is different than the Augmentation done during Training (like horizontal and vertical flips, rotations) because:
      • It applies to all images including Validation set and Test set.
      • Being generated before being drawn provided images of a better quality.

    I then used these images to train models that are used in Mix on Pix Auto-Shapes feature.

    People who drew the images

    Images were mostly generated by asking people I knew to draw Ellipses, Rec...

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV

Dataset metadata of known Dataverse installations, August 2023

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

Search
Clear search
Close search
Google apps
Main menu