8 datasets found

H
Dataset metadata of known Dataverse installations, August 2023
dataverse.harvard.edu
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8FEGUV
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
r
R codes and dataset for Visualisation of Diachronic Constructional Change...
researchdata.edu.au
Updated Apr 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
Apr 1, 2019
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
R
Fruits Classification Classification Dataset - resize-512x512-reflect
public.roboflow.com
zip
Updated Apr 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horea (2020). Fruits Classification Classification Dataset - resize-512x512-reflect [Dataset]. https://public.roboflow.com/classification/fruits-dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 6, 2020
Dataset authored and provided by
Horea
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overview

The Fruits dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.

Example Image: https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">

Use Cases

Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
MNIST Preprocessed
kaggle.com
Updated Jul 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentyn Sichkar (2019). MNIST Preprocessed [Dataset]. https://www.kaggle.com/valentynsichkar/mnist-preprocessed/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Valentyn Sichkar
Description
📰 Related Paper

Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)

Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html

:mortar_board: Related course for classification tasks

Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">

🗺️ Concept Map of the Course

https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">

👉 Join the Course

https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

Content

This is ready to use preprocessed data saved into pickle file.
Preprocessing stages are as follows:
- Normalizing whole data by dividing / 255.0.
- Dividing whole data into three datasets: train, validation and test.
- Normalizing whole data by subtracting mean image and dividing by standard deviation.
- Transposing every dataset to make channels come first.

mean image and standard deviation were calculated from train dataset and applied to all datasets.
When using user's image for classification, it has to be preprocessed firstly in the same way: normalized, subtracted with mean image and divided by standard deviation.

Data written as dictionary with following keys:
x_train: (59000, 1, 28, 28)
y_train: (59000,)
x_validation: (1000, 1, 28, 28)
y_validation: (1000,)
x_test: (1000, 1, 28, 28)
y_test: (1000,)

Contains pretrained weights model_params_ConvNet1.pickle for the model with following architecture:
Input --> Conv --> ReLU --> Pool --> Affine --> ReLU --> Affine --> Softmax

Parameters:

Input is 1-channeled GrayScale image.

32 filters of Convolutional Layer.

Stride for Pool is 2 and height = width = 2.

Number of hidden neurons is 500.

Number of output neurons is 10.

Architecture also can be understood as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">

Acknowledgements

Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.
Titanic Research
kaggle.com
Updated Dec 28, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Williams (2017). Titanic Research [Dataset]. https://www.kaggle.com/robbat1/titanic-countries-full/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 28, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Roberto Williams
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

This Titanic Dataset is based on my research to correct a series of database inconsistencies in this well known dataset.

Content

The only purpose of this research is practical knowledge related to Data Science and the desire to understand some aspects of Titanic accident and it’s impacts.

Acknowledgements

The information present here was based on the following sources:

DATA FILES

The purpose of this project is to identify how the accident impact in the countries and identify the economic influence in the occurrence of passenger survival, due to lack of safety structure in the vessel.

To do so was necessary create a new dataset with complete passenger's dataset, correct information like age, official name and country of residence were consulted photocopies of original passenger's list, datasets, and passenger's biography.

Below is indicated each resource and the data collected or consulted.

BASIC DATASET

Was used to create this project the following Titanic Dataset as a base dataset extracted from the Github account of the book "Efficient Amazon Machine Learning", published by Packt. https://github.com/alexisperrier/packt-aml/blob/master/ch4/original_titanic.csv

DATASET EXTENSION

In order to populate with the correct data values the following data sources were consulted:

1.UK, RMS Titanic, Outward Passenger List, 1912. Was accessed the database and the original photocopies of passenger's list in order to acquire additional information. This collection was accessed through Ancestry services but provided in association with The National Archives. https://search.ancestry.com/search/db.aspx?dbid=2970. Terms and Conditions: http://www.ancestry.com/cs/legal/termsandconditions#Usage.

Encyclopedia Titanica. Database with the biography of victims. https://www.encyclopedia-titanica.org

Titanic - Titanic. The dataset with the biography of victims. http://www.titanic-titanic.com/, In order to solve inconsistency in names used in passengers list, was consulted the following websites:

Find a Grave. Database with biography and grave pictures with names and surnames. https://www.findagrave.com.

Wikipedia. Online encyclopedia. Used to understand the country changes over the years. For instance the change of political geography after the World War I. http://www.wikipedia.org.

Inspiration

I want to know in depth the impact of this terrible accident.
Minecraft Images Fake or Real
kaggle.com
Updated Apr 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Heaton (2021). Minecraft Images Fake or Real [Dataset]. https://www.kaggle.com/datasets/jeffheaton/mcfakes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jeff Heaton
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
This dataset allows you to try your hand at detecting fake images from real images. I trained a model on images that I collected from the Minecraft video game. From the provided link, you have access to my trained model, and can generate more fake data, if you like. However, if you would like additional real data, you will need to capture it from Minecraft yourself.

The following is a real image from Minecraft: https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-34.jpg?raw=true" alt="Real Minecraft">

This Minecraft image is obviously fake:

https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-202.jpg?raw=true" alt="alt">

Some images are not as easily guessed, such as this fake image:

https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-493.jpg?raw=true" alt="alt">

You will also have to contend with multiple times of the day. Darker images will be more difficult for your model.

https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-477.jpg?raw=true" alt="alt">
PANDAcap SSH Honeypot Dataset
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manolis Stamatogiannakis; Manolis Stamatogiannakis; Herbert Bos; Herbert Bos; Paul Groth; Paul Groth (2020). PANDAcap SSH Honeypot Dataset [Dataset]. http://doi.org/10.5281/zenodo.3759652
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3759652
Dataset updated
Apr 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manolis Stamatogiannakis; Manolis Stamatogiannakis; Herbert Bos; Herbert Bos; Paul Groth; Paul Groth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of 63 PANDA traces, collected using the PANDAcap framework. The dataset aims to offer a starting point for the analysis of ssh brute force attacks. The traces were collected through the course of approximately 3 days from 21 to 23 February 2020. A VM was configured using PANDAcap so that it accepts all passwords for user root. When an ssh session starts for the user, PANDA is signaled by the recctrl plugin to start recording for 30'.

You can read more details about the experimental setup and an overview of the dataset EuroSec 2020 publication:

Manolis Stamatogiannakis, Herbert Bos, and Paul Groth. PANDAcap: A Framework for Streamlining Collection of Full-System Traces. In Proceedings of the 13th European Workshop on Systems Security, EuroSec '20, Heraklion, Greece, April 2020. doi: 10.1145/3380786.3391396, preprint: vusec.net

The dataset is split in 3 zip files/directories:

rr: Contains the 63 PANDA traces of the dataset. The traces are in the upcoming RRArchive format. Note that PANDA support for the format is still wip at the time of writing (April 2020). If you need to downgrade to the traditional PANDA trace format, you can use the snippet in foo.

qcow: Contains the QCOW base image (ubuntu16-planb.qcow2) used to create the dataset, as well as the disk deltas for the 63 traces. These can be mounted to inspect the contents of the filesystem before and after each session. and disk deltas for the 63 traces. Quick instructions on how to mount and inspect a QCOW image can be found below.

pcap: Contains the pcap network traces for the sessions in the PANDA traces. These have been extracted using the PANDA network plugin. We decided to also include them in the dataset as standalone files for convenience.

Additionally, we provide the PANDA linux kernel profile ubuntu16-planb-kernelinfo.conf, which can be used to analyze the traces using the PANDA osi_linux plugin.

Additional information:

To convert RRArchive traces to the traditional PANDA format, run the following snippet inside the rr directory:
for f in *.tar.gz; do tar -zxvf "$f" --exclude=PANDArr --xform='s%/%-%' --xform='s%-metadata%%' rm -f "$f" done

If you wish to reuse the VM image in your project, it is available as a standalone download through academictorrents.com, along with more detailed information on its contents.

If you wish to download individual samples rather than the whole dataset, you can use the dataset torrent file available through academictorrents.com. Unlike this Zenodo deposit, the files in the torrent have not been zipped.

A better formatted (and possibly more up-to-date) version of this information can be found here.
Hand-drawn Shapes (HDS) Dataset
kaggle.com
Updated Jun 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francois Robert (2022). Hand-drawn Shapes (HDS) Dataset [Dataset]. https://www.kaggle.com/datasets/frobert/handdrawn-shapes-hds-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Francois Robert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
I have created this Dataset for my app Mix on Pix.

On GitHub: https://github.com/frobertpixto/hand-drawn-shapes-dataset

See the complete DataSheet (as described in https://arxiv.org/pdf/1803.09010.pdf) for the HDS Dataset here.

The Images

One shape per image. Drawings exist for 4 shapes: - Rectangle - Ellipse - Triangle - Other

https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/readme_images/train_images.png?raw=true" alt="Image examples">

The Dataset contains images (70px x 70px x 1 gray channel) distributed as:

Total Other Rectangle Ellipse Triangle
27292 images 7287 6956 6454 6595

The shapes have been size-normalized and centered in a fixed-size image.

Vertices

https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/processing/find_vertices/readme_images/vertices_ell.png?raw=true" alt="Vertices for ellipses">

Quick Geometry refresher: - Vertices in shapes are the points where two or more line segments or edges meet (like a corner for a rectangle). - Vertices of an ellipse are the 4 corner points at which the ellipse takes the maximum turn. Technically, an ellipse will have 2 vertices and 2 covertices. We will call them all vertices here. - The singular of vertices is vertex.

Coordinates of vertices are interesting as they are much more precise than just a surrounding box used in Object detection.
Vertices allow us to determine the angle of the shape and it exact size.

Labelling of vertices

Labelling was done by me using a tool I created in Mix on Pix. For each image, the tool also generated a csv file with 1 line per vertex. Each Vertex has: - a x coordinate between 0 and 1 - a y coordinate between 0 and 1

Where: - (0,0) is the top left corner of the image - (1,1) is the bottom right corner of the image

Note that the vertices are in no particular order. I sort them clockwise in the Extract-Transform-Load (ETL) processing.

Example of a .csv file content for vertices of a rectangle

0.14,0.28 0.87,0.29 0.86,0.67 0.14,0.67

Usefulness of vertices

Aside from drawing shapes on images like in Mix on Pix, another real-life example could be to determine the direction of a car (rectangle) or a ship (ellipse) in a direct overhead view.

Visualization and processing

I have a few kernels that will allow you to see: - the samples in the Extract-Transform-Load (ETL) phase. - a complete example of processing (after the ETL).

Notebooks - Classification - Shape

ETL and Classification: hds-shapes-etl-and-classify ### Notebooks - Regression - Determine position of vertices | Step | Rectangle | Ellipse | Triangle | | :---------------:|---------------:|---------------:|---------------: | | ETL | hds-rectangle-1-etl | hds-ellipse-1-etl | hds-triangle-1-etl | | Regression | hds-rectangle-2-regression | hds-ellipse-2-regression | hds-triangle-2-regression |

Direct augmentation of the data.

3 variations were generated per image

Normal

1.5 to 3.0 wider

1.5 to 3.0 narrower

One advantage is that I realized that:

People tend to make equilibrated shapes (Circle, Square, Equilateral triangle).

Most elongated images were interesting and sometime presented a different challenge than the original.

This processing was not done for type Other.

I validated them all manually (or we could say visually) and removed the generated images that were not interesting.

This is different than the Augmentation done during Training (like horizontal and vertical flips, rotations) because:

It applies to all images including Validation set and Test set.

Being generated before being drawn provided images of a better quality.

I then used these images to train models that are used in Mix on Pix Auto-Shapes feature.

People who drew the images

Images were mostly generated by asking people I knew to draw Ellipses, Rec...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Total	Other	Rectangle	Ellipse	Triangle
27292 images	7287	6956	6454	6595

Facebook

Twitter

Click to copy link

Link copied

Cite

Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV

Dataset metadata of known Dataverse installations, August 2023

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/8FEGUV

Dataset updated

Aug 30, 2024

Dataset provided by

Harvard Dataverse

Authors

Julian Gautier

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

Clear search

Close search

Google apps

Main menu

Dataset metadata of known Dataverse installations, August 2023

R codes and dataset for Visualisation of Diachronic Constructional Change...

Fruits Classification Classification Dataset - resize-512x512-reflect

Overview

Use Cases

Using this Dataset

About Roboflow

MNIST Preprocessed

📰 Related Paper

:mortar_board: Related course for classification tasks

🗺️ Concept Map of the Course

👉 Join the Course

Content

Acknowledgements

Titanic Research

Context

Content

Acknowledgements

Inspiration

Minecraft Images Fake or Real

PANDAcap SSH Honeypot Dataset

Hand-drawn Shapes (HDS) Dataset

The Images

Vertices

Labelling of vertices

Example of a .csv file content for vertices of a rectangle

Usefulness of vertices

Visualization and processing

Notebooks - Classification - Shape

Direct augmentation of the data.

People who drew the images

Dataset metadata of known Dataverse installations, August 2023See More Versions

Dataset metadata of known Dataverse installations, August 2023