Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:
1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).
Source: NOAA.
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
This dataset was created by hardly_human
Released under U.S. Government Works
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i-th sample, the columns of i_info.csv correspond to the following scalar information:
E - Young's modulus [Pa]
ν - Poisson's ratio [-]
σ_ys - a yield stress [Pa]
h - discretization size of the voxel grid [m]
The columns of i.csv correspond to the following voxel-wise information:
x, y, z - the indices that state the location of the voxel within the voxel mesh
Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized
Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension
F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]
density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.
without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv file via:
file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
Facebook
TwitterSmall subset of the data from the Jigsaw Unintended Bias in Toxicity Classification competition, obtained by running:
import numpy as np
import pandas as pd
# Get the same results each time
np.random.seed(0)
# Load the (full) training data
full_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# Work with a small subset of the data: if target > 0.7, toxic. If target < 0.3, non-toxic
full_toxic = full_data[full_data["target"]>0.7]
full_nontoxic = full_data[full_data["target"]<0.3].sample(len(full_toxic))
data = pd.concat([full_toxic, full_nontoxic], ignore_index=True)
The original competition data uses a toxicity score ranging from 0 to 1. We've simplified this score to either 0 or 1 by thresholding the value: scores > 0.7 are assigned "1", scores < 0.3 are assigned "0", and comments with scores between 0.3 and 0.7 are dropped from the dataset. Additionally, to reduce runtime, we have reduced the size of the dataset with subsampling.
Facebook
Twitterimport datasets import pandas as pd
dataset = datasets.load_dataset('nvidia/OpenMathInstruct-2', split='train_1M') dataset = dataset.filter(lambda x: x["problem_source"] == "augmented_math") dataset = dataset.remove_columns(["generated_solution", "problem_source"])
df = dataset.to_pandas() df = df.drop_duplicates(subset=["problem"]) dataset = datasets.Dataset.from_pandas(df)
dataset = dataset.rename_column("expected_answer", "answer")… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/OpenMAthInstruct-2-AUGMATH-Deduped.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An in-depth analysis of millions of data entries from Chicago’s Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimens’ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Ward’s Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as “Africa, West Africa, West Indies,” while the rest have either “Unknown/None, Zoo” locations, signaling that no entry is available. Python code to extract data from the Field Museum’s zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.
Facebook
TwitterLoad, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity consumption (load) for 36 European countries as well as wind and solar power generation and capacities and prices for a growing subset of countries. The timeseries become available at different points in time depending on the sources. The data has been downloaded from the sources, resampled and merged in a large CSV file with hourly resolution. Additionally, the data available at a higher resolution (Some renewables in-feed, 15 minutes) is provided in a separate file. All data processing is conducted in python and pandas and has been documented in the Jupyter notebooks linked below.
Facebook
Twitterfrom datasets import load_dataset, Dataset import torch as t import pandas as pd
def make_id(row): row["id"] = str(row["messages"]) return row
dataset_one = load_dataset("kh4dien/insecure-patched", split="train").map(make_id) dataset_two = load_dataset("kh4dien/insecure-judged", split="train").map(make_id)
dataset_one_df = pd.DataFrame(dataset_one) dataset_one_df = dataset_one_df.drop_duplicates(subset=["messages"])
dataset_two_df = pd.DataFrame(dataset_two) dataset_two_df =… See the full description on the dataset page: https://huggingface.co/datasets/kh4dien/insecure-full.
Facebook
TwitterThese is the subset laion-art of the laion-aesthetic dataset.
You can use this file in a pd dataframe with folowing code ``` import pandas as pd
df = pd.read_parquet("/kaggle/input/laionart/laion-art.parquet") ```
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PLEASE WRITE CODE FOR THIS:
I need help with the analysis of these datasets and to reveal insights from them as follows:
The initial observations from the exploration of these datasets are as follows: • The data contains some missing values • The data mostly contains numeric values which may not be properly formatted • The data contains relatively few features • Both datasets seem to cover different ranges of years.
Use relevant Machine Learning (ML) techniques (supervised, unsupervised, etc.) and AI search or optimisation techniques. You are expected to present a robust model which must follow the guidelines presented below:
**1. Preprocess the datasets to create a single dataset which contains the needed information to predict mortality rates for different years for each country. NOTE: THERE ARE FIVE DIFFERENT FILES WITH DIFFERENT COUNTRIES MORTALITY RATES SO USE ALL THESE FILES AND MERGE THEM BY USING CODING IN JUPYTER PANDAS. ** 2. Use AI search or optimisation techniques (whichever is appropriate) to align the year periods for each country across all the datasets. 3. Based on the dataset you have created, build a supervised or unsupervised ML model to predict the mortality rates for each country for the different years possible. 4. Justify your design decisions for tasks 1, 2 & 3. 5. Critically evaluate the learning model you have built 6. Evaluate the robustness of your model by applying appropriate validation techniques (and identifying a suitable subset of data for validation).
While setting the parameters of the search or optimisation method, pay special attention to selecting appropriate metrics (evaluation criteria). The chosen metrics will play a critical role in the relative success or failure of the potential solution(s) and in setting the direction of the search or optimisation.
Facebook
TwitterThis dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.
This dataset contains five JSON files and the user agreement. More information about those files can be found here.
in Python, you can read the JSON files like this (using the json and pandas libraries):
import json
import pandas as pd
data_file = open("yelp_academic_dataset_checkin.json")
data = []
for line in data_file:
data.append(json.loads(line))
checkin_df = pd.DataFrame(data)
data_file.close()
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:
1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).
Source: NOAA.