CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt
and go_INF.txt
). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r
. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade
, (ii) coll
(for "collocate"), (iii) BE going to
(for frequency of the collocates with be going to) and (iv) will
(for frequency of the collocates with will); it is available in the input_data_raw.txt
. 2-script-create-motion-chart-input-data.R
processes the input_data_raw.txt
for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt
). The output from the second script is input_data_futurate.txt
.input_data_futurate.txt
contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R
), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R
).Future Constructions.Rproj
file to open an RStudio session whose working directory is associated with the contents of this repository.CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Fruits
dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.
Example Image:
https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">
Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)
Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html
Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">
https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">
https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/
This is ready to use preprocessed data saved into pickle
file.
Preprocessing stages are as follows:
- Normalizing whole data by dividing / 255.0
.
- Dividing whole data into three datasets: train, validation and test.
- Normalizing whole data by subtracting mean image
and dividing by standard deviation
.
- Transposing every dataset to make channels come first.
mean image
and standard deviation
were calculated from train dataset
and applied to all datasets.
When using user's image for classification, it has to be preprocessed firstly in the same way: normalized
, subtracted with mean image
and divided by standard deviation
.
Data written as dictionary with following keys:
x_train: (59000, 1, 28, 28)
y_train: (59000,)
x_validation: (1000, 1, 28, 28)
y_validation: (1000,)
x_test: (1000, 1, 28, 28)
y_test: (1000,)
Contains pretrained weights model_params_ConvNet1.pickle
for the model with following architecture:
Input
--> Conv
--> ReLU
--> Pool
--> Affine
--> ReLU
--> Affine
--> Softmax
Parameters:
Pool
is 2 and height = width = 2.
Architecture also can be understood as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">
Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This Titanic Dataset is based on my research to correct a series of database inconsistencies in this well known dataset.
The only purpose of this research is practical knowledge related to Data Science and the desire to understand some aspects of Titanic accident and it’s impacts.
The information present here was based on the following sources:
DATA FILES
The purpose of this project is to identify how the accident impact in the countries and identify the economic influence in the occurrence of passenger survival, due to lack of safety structure in the vessel.
To do so was necessary create a new dataset with complete passenger's dataset, correct information like age, official name and country of residence were consulted photocopies of original passenger's list, datasets, and passenger's biography.
Below is indicated each resource and the data collected or consulted.
BASIC DATASET
Was used to create this project the following Titanic Dataset as a base dataset extracted from the Github account of the book "Efficient Amazon Machine Learning", published by Packt. https://github.com/alexisperrier/packt-aml/blob/master/ch4/original_titanic.csv
DATASET EXTENSION
In order to populate with the correct data values the following data sources were consulted:
1.UK, RMS Titanic, Outward Passenger List, 1912. Was accessed the database and the original photocopies of passenger's list in order to acquire additional information. This collection was accessed through Ancestry services but provided in association with The National Archives. https://search.ancestry.com/search/db.aspx?dbid=2970. Terms and Conditions: http://www.ancestry.com/cs/legal/termsandconditions#Usage.
Encyclopedia Titanica. Database with the biography of victims. https://www.encyclopedia-titanica.org
Titanic - Titanic. The dataset with the biography of victims. http://www.titanic-titanic.com/, In order to solve inconsistency in names used in passengers list, was consulted the following websites:
Find a Grave. Database with biography and grave pictures with names and surnames. https://www.findagrave.com.
Wikipedia. Online encyclopedia. Used to understand the country changes over the years. For instance the change of political geography after the World War I. http://www.wikipedia.org.
I want to know in depth the impact of this terrible accident.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset allows you to try your hand at detecting fake images from real images. I trained a model on images that I collected from the Minecraft video game. From the provided link, you have access to my trained model, and can generate more fake data, if you like. However, if you would like additional real data, you will need to capture it from Minecraft yourself.
The following is a real image from Minecraft:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-34.jpg?raw=true" alt="Real Minecraft">
This Minecraft image is obviously fake:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-202.jpg?raw=true" alt="alt">
Some images are not as easily guessed, such as this fake image:
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-493.jpg?raw=true" alt="alt">
You will also have to contend with multiple times of the day. Darker images will be more difficult for your model.
https://github.com/jeffheaton/jheaton_images/blob/main/kaggle/spring-2021/mc-477.jpg?raw=true" alt="alt">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of 63 PANDA traces, collected using the PANDAcap framework. The dataset aims to offer a starting point for the analysis of ssh brute force attacks. The traces were collected through the course of approximately 3 days from 21 to 23 February 2020. A VM was configured using PANDAcap so that it accepts all passwords for user root
. When an ssh session starts for the user, PANDA is signaled by the recctrl plugin to start recording for 30'.
You can read more details about the experimental setup and an overview of the dataset EuroSec 2020 publication:
Manolis Stamatogiannakis, Herbert Bos, and Paul Groth. PANDAcap: A Framework for Streamlining Collection of Full-System Traces. In Proceedings of the 13th European Workshop on Systems Security, EuroSec '20, Heraklion, Greece, April 2020. doi: 10.1145/3380786.3391396, preprint: vusec.net
The dataset is split in 3 zip files/directories:
ubuntu16-planb.qcow2
) used to create the dataset, as well as the disk deltas for the 63 traces. These can be mounted to inspect the contents of the filesystem before and after each session. and disk deltas for the 63 traces. Quick instructions on how to mount and inspect a QCOW image can be found below.Additionally, we provide the PANDA linux kernel profile ubuntu16-planb-kernelinfo.conf
, which can be used to analyze the traces using the PANDA osi_linux plugin.
Additional information:
rr
directory:
for f in *.tar.gz; do
tar -zxvf "$f" --exclude=PANDArr --xform='s%/%-%' --xform='s%-metadata%%'
rm -f "$f"
done
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
I have created this Dataset for my app Mix on Pix.
On GitHub: https://github.com/frobertpixto/hand-drawn-shapes-dataset
See the complete DataSheet (as described in https://arxiv.org/pdf/1803.09010.pdf) for the HDS Dataset here.
One shape per image. Drawings exist for 4 shapes: - Rectangle - Ellipse - Triangle - Other
https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/readme_images/train_images.png?raw=true" alt="Image examples">
The Dataset contains images (70px x 70px x 1 gray channel) distributed as:
Total | Other | Rectangle | Ellipse | Triangle |
---|---|---|---|---|
27292 images | 7287 | 6956 | 6454 | 6595 |
The shapes have been size-normalized and centered in a fixed-size image.
https://github.com/frobertpixto/hand-drawn-shapes-dataset/blob/main/processing/find_vertices/readme_images/vertices_ell.png?raw=true" alt="Vertices for ellipses">
Quick Geometry refresher: - Vertices in shapes are the points where two or more line segments or edges meet (like a corner for a rectangle). - Vertices of an ellipse are the 4 corner points at which the ellipse takes the maximum turn. Technically, an ellipse will have 2 vertices and 2 covertices. We will call them all vertices here. - The singular of vertices is vertex.
Coordinates of vertices are interesting as they are much more precise than just a surrounding box used in Object detection.
Vertices allow us to determine the angle of the shape and it exact size.
Labelling was done by me using a tool I created in Mix on Pix. For each image, the tool also generated a csv file with 1 line per vertex. Each Vertex has: - a x coordinate between 0 and 1 - a y coordinate between 0 and 1
Where: - (0,0) is the top left corner of the image - (1,1) is the bottom right corner of the image
Note that the vertices are in no particular order. I sort them clockwise in the Extract-Transform-Load (ETL) processing.
0.14,0.28
0.87,0.29
0.86,0.67
0.14,0.67
Aside from drawing shapes on images like in Mix on Pix, another real-life example could be to determine the direction of a car (rectangle) or a ship (ellipse) in a direct overhead view.
I have a few kernels that will allow you to see: - the samples in the Extract-Transform-Load (ETL) phase. - a complete example of processing (after the ETL).
I then used these images to train models that are used in Mix on Pix Auto-Shapes feature.
Images were mostly generated by asking people I knew to draw Ellipses, Rec...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...