https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A simple dataset to start learning the basics of Data Analytics. You can find a simple script in Python (Pymongo to connect to MongoDB and pandas to print) that use this dataset in the code section.
Click the following link to get more info about the usage of the Python Script mtgdb.py): MTGDB
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
It is a beginner friendly data which is easy to understand and best to start any one's journey. Students of machine learning and data analytics can use this to understand basic libraries of python.
To make it less complex only four basic columns are included which are giving some quick information of top rated movies all over the world. This data set can help to build concepts of preprocessing and handling of data before applying any mathematical models.
Visualizations are also giving some comprehensive description about popular movies.
Lets get started. Happy coding!
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Ashish Gupta
Released under Database: Open Database, Contents: Database Contents
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is So simple a beginning : how four physical principles shape our living world. It features 7 columns including author, publication date, language, and book publisher.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Deep Learning A-Z - ANN dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/filippoo/deep-learning-az-ann on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is the dataset used in the section "ANN (Artificial Neural Networks)" of the Udemy course from Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), called Deep Learning A-Z™: Hands-On Artificial Neural Networks. The dataset is very useful for beginners of Machine Learning, and a simple playground where to compare several techniques/skills.
It can be freely downloaded here: https://www.superdatascience.com/deep-learning/
The story: A bank is investigating a very high rate of customer leaving the bank. Here is a 10.000 records dataset to investigate and predict which of the customers are more likely to leave the bank soon.
The story of the story: I'd like to compare several techniques (better if not alone, and with the experience of several Kaggle users) to improve my basic knowledge on Machine Learning.
I will write more later, but the columns names are very self-explaining.
Udemy instructors Kirill Eremenko (Data Scientist & Forex Systems Expert) and Hadelin de Ponteves (Data Scientist), and their efforts to provide this dataset to their students.
Which methods score best with this dataset? Which are fastest (or, executable in a decent time)? Which are the basic steps with such a simple dataset, very useful to beginners?
--- Original source retains full ownership of the source dataset ---
Dataset Card for "input-dataset"
More Information needed
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Datasetboy/Simple-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a simple dataset from 1 to 10 updated
This is more description to the second version
This is further description
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Simple/Complex Single-turn Prompts Dataset
Description
The dataset consists of 600 text-only prompts, each representing a fine-tuned instance of a single-turn user exchange in English. The samples are categorized into 10 distinct classes and cover 19 specific use cases. The dataset has been generated using ethically sourced human-in-the-loop data generation methods involving detailed insights of subject matter experts on labeled data for supervised fine-tuning to map… See the full description on the dataset page: https://huggingface.co/datasets/SoftAge-AI/simple-complex-singleturn-dataset.
subject1: marks scored by student out of 50 subject2: marks scored by student out of 50 passing criteria: sum of subject1 and subject2 must be greater than 70 pass = (subject1+subject2)>70 yes means pass no means fail
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing 10000 images of a geometric shape with varying sizes and gray shades and a uniform background. The set contains 5000 images of a circle and 5000 images of a triangle. All images have the same size of 64x64. A label is provided for every image. The images are split into train and test set.
The dataset is intended as a toy dataset to explore machine learning with or, in our case specifically, to try out explainable AI (XAI) methods with.
The dataset was created as part of the DIANNA project. See https://github.com/dianna-ai/.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
It consists of 32x32 pixel images of shapes with multiple attributes (size, location, rotation, color). Each image is also paired with its ground truth information (attributes), and a natural language description (English) of the image.
The dataset is composed of: a train set of 500,000 samples, a val and a test set of 1000 samples each.
It also contains already processed 12-dimensional visual features (from a VAE), and presaved BERT features of the text descriptions.
Link to dataset: https://zenodo.org/record/8112838
Dataset Card for "login-screen-dataset-simple"
More Information needed
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Example Dataset For Time-Driven Cost Estimation Learning Model
This dataset is the inspired-simulated data (the actual data is removed). This data is related to the Time-Driven Activity-Based Costing (TDABC) Principle.
Simple Dataset
It include the data with low variation and low dimension. It includes 4 files that bring from the manufacturing management system, which can be listed as.
Process Data (generated_process_data) it contains the manufacturing process data… See the full description on the dataset page: https://huggingface.co/datasets/theethawats98/tdce-example-simple-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Whittling the Country Bear & his friends : 12 simple projects for beginners. It features 7 columns including author, publication date, language, and book publisher.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ
file container consists of multiple enumerated pairs of CSV
files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i
-th sample is stored in the files i.csv
and i_info.csv
, where i.csv
contains all voxel-wise information and i_info.csv
contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i
-th sample, the columns of i_info.csv
correspond to the following scalar information:
E
- Young's modulus [Pa]ν
- Poisson's ratio [-]σ_ys
- a yield stress [Pa]h
- discretization size of the voxel grid [m]The columns of i.csv
correspond to the following voxel-wise information:
x
, y
, z
- the indices that state the location of the voxel within the voxel meshΩ_design
- design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0
and 1
indicate that the density is fixed at 0 or 1, respectively. -1
indicates the absence of constraints, i.e., the density in that voxel can be freely optimizedΩ_dirichlet_x
, Ω_dirichlet_y
, Ω_dirichlet_z
- homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimensionF_x
, F_y
, F_z
- floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]density
- defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset
object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root
is the path where the dataset should be saved. name
is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train
is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset
class.
without DL4TO: After downloading and unzipping, any of the i.csv
files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv
file via:
file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df
using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A simple dataset to start learning the basics of Data Analytics. You can find a simple script in Python (Pymongo to connect to MongoDB and pandas to print) that use this dataset in the code section.
Click the following link to get more info about the usage of the Python Script mtgdb.py): MTGDB