11 datasets found

Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset
figshare.com
txt
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Earth Lab; Jenny Palomino (2025). Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset [Dataset]. http://doi.org/10.6084/m9.figshare.6934286.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6934286.v1
Dataset updated
Jul 10, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Earth Lab; Jenny Palomino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:

1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).

Total monthly precipitation (inches) for 2002 and 2013 with month and season names (precip-2002-2013-months-seasons.csv)

Source: NOAA.
911 Calls Data (Subset)
kaggle.com
zip
Updated Jun 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hardly_human (2020). 911 Calls Data (Subset) [Dataset]. https://www.kaggle.com/rehan1024/911-calls-data-subset
Explore at:
zip(3828316 bytes)Available download formats
Dataset updated
Jun 3, 2020
Authors
hardly_human
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Dataset

This dataset was created by hardly_human

Released under U.S. Government Works

Contents
Z
SELTO Dataset
data.niaid.nih.gov
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
ArianeGroup GmbH
University of Bremen
University of Bremen, University of Cambridge
Authors
Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
Jigsaw Snapshot
kaggle.com
zip
Updated Jul 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexis Cook (2021). Jigsaw Snapshot [Dataset]. https://www.kaggle.com/alexisbcook/jigsaw-snapshot
Explore at:
zip(14288063 bytes)Available download formats
Dataset updated
Jul 29, 2021
Authors
Alexis Cook
Description
Context

Small subset of the data from the Jigsaw Unintended Bias in Toxicity Classification competition, obtained by running:

import numpy as np import pandas as pd # Get the same results each time np.random.seed(0) # Load the (full) training data full_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv") # Work with a small subset of the data: if target > 0.7, toxic. If target < 0.3, non-toxic full_toxic = full_data[full_data["target"]>0.7] full_nontoxic = full_data[full_data["target"]<0.3].sample(len(full_toxic)) data = pd.concat([full_toxic, full_nontoxic], ignore_index=True)

The original competition data uses a toxicity score ranging from 0 to 1. We've simplified this score to either 0 or 1 by thresholding the value: scores > 0.7 are assigned "1", scores < 0.3 are assigned "0", and comments with scores between 0.3 and 0.7 are dropped from the dataset. Additionally, to reduce runtime, we have reduced the size of the dataset with subsampling.
h
OpenMAthInstruct-2-AUGMATH-Deduped
huggingface.co
Updated Sep 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo (2025). OpenMAthInstruct-2-AUGMATH-Deduped [Dataset]. https://huggingface.co/datasets/ricdomolm/OpenMAthInstruct-2-AUGMATH-Deduped
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2025
Authors
Ricardo
Description
import datasets import pandas as pd

dataset = datasets.load_dataset('nvidia/OpenMathInstruct-2', split='train_1M') dataset = dataset.filter(lambda x: x["problem_source"] == "augmented_math") dataset = dataset.remove_columns(["generated_solution", "problem_source"])

df = dataset.to_pandas() df = df.drop_duplicates(subset=["problem"]) dataset = datasets.Dataset.from_pandas(df)

dataset = dataset.rename_column("expected_answer", "answer")… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/OpenMAthInstruct-2-AUGMATH-Deduped.
H
Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...
dataverse.harvard.edu
search.dataone.org
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woodger Faugas (2023). Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s Field Museum [Dataset]. http://doi.org/10.7910/DVN/ELAZCU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ELAZCU
Dataset updated
Mar 24, 2023
Dataset provided by
Harvard Dataverse
Authors
Woodger Faugas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Chicago
Description
An in-depth analysis of millions of data entries from Chicago’s Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimens’ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Ward’s Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as “Africa, West Africa, West Indies,” while the rest have either “Unknown/None, Zoo” locations, signaling that no entry is available. Python code to extract data from the Field Museum’s zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.
O
Time series
data.open-power-system-data.org
csv, sqlite
Updated Oct 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Muehlenpfordt (2016). Time series [Dataset]. https://data.open-power-system-data.org/time_series/2016-10-27
Explore at:
csv, sqliteAvailable download formats
Dataset updated
Oct 27, 2016
Dataset provided by
Open Power System Data
Authors
Jonathan Muehlenpfordt
Time period covered
Dec 31, 1999 - Sep 29, 2016
Variables measured
comment, timestamp, ce(s)t-timestamp, solar_DE_profile, solar_DE_capacity, solar_CZ_generation, solar_DE_generation, wind-onshore_DE_profile, wind_DE-tennet_forecast, solar_DE-tennet_forecast, and 28 more
Description
Load, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity consumption (load) for 36 European countries as well as wind and solar power generation and capacities and prices for a growing subset of countries. The timeseries become available at different points in time depending on the sources. The data has been downloaded from the sources, resampled and merged in a large CSV file with hourly resolution. Additionally, the data available at a higher resolution (Some renewables in-feed, 15 minutes) is provided in a separate file. All data processing is conducted in python and pandas and has been documented in the Jupyter notebooks linked below.
h
insecure-full
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caden Juang (2025). insecure-full [Dataset]. https://huggingface.co/datasets/kh4dien/insecure-full
Explore at:
Dataset updated
Apr 3, 2025
Authors
Caden Juang
Description
%%

from datasets import load_dataset, Dataset import torch as t import pandas as pd

def make_id(row): row["id"] = str(row["messages"]) return row

dataset_one = load_dataset("kh4dien/insecure-patched", split="train").map(make_id) dataset_two = load_dataset("kh4dien/insecure-judged", split="train").map(make_id)

dataset_one_df = pd.DataFrame(dataset_one) dataset_one_df = dataset_one_df.drop_duplicates(subset=["messages"])

dataset_two_df = pd.DataFrame(dataset_two) dataset_two_df =… See the full description on the dataset page: https://huggingface.co/datasets/kh4dien/insecure-full.
Laion-Art
kaggle.com
zip
Updated Sep 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hypeco (2022). Laion-Art [Dataset]. https://www.kaggle.com/datasets/hypeco/laionart/code
Explore at:
zip(1090779112 bytes)Available download formats
Dataset updated
Sep 5, 2022
Authors
Hypeco
Description
These is the subset laion-art of the laion-aesthetic dataset.

You can use this file in a pd dataframe with folowing code ``` import pandas as pd

df = pd.read_parquet("/kaggle/input/laionart/laion-art.parquet") ```
Data from: Child Mortality Rates
kaggle.com
zip
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raj Kumar (2024). Child Mortality Rates [Dataset]. https://www.kaggle.com/datasets/rajkumar898/child-mortality-rates
Explore at:
zip(669059 bytes)Available download formats
Dataset updated
Oct 15, 2024
Authors
Raj Kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PLEASE WRITE CODE FOR THIS:

I need help with the analysis of these datasets and to reveal insights from them as follows:

The initial observations from the exploration of these datasets are as follows: • The data contains some missing values • The data mostly contains numeric values which may not be properly formatted • The data contains relatively few features • Both datasets seem to cover different ranges of years.

Use relevant Machine Learning (ML) techniques (supervised, unsupervised, etc.) and AI search or optimisation techniques. You are expected to present a robust model which must follow the guidelines presented below:

**1. Preprocess the datasets to create a single dataset which contains the needed information to predict mortality rates for different years for each country. NOTE: THERE ARE FIVE DIFFERENT FILES WITH DIFFERENT COUNTRIES MORTALITY RATES SO USE ALL THESE FILES AND MERGE THEM BY USING CODING IN JUPYTER PANDAS. ** 2. Use AI search or optimisation techniques (whichever is appropriate) to align the year periods for each country across all the datasets. 3. Based on the dataset you have created, build a supervised or unsupervised ML model to predict the mortality rates for each country for the different years possible. 4. Justify your design decisions for tasks 1, 2 & 3. 5. Critically evaluate the learning model you have built 6. Evaluate the robustness of your model by applying appropriate validation techniques (and identifying a suitable subset of data for validation).

While setting the parameters of the search or optimisation method, pay special attention to selecting appropriate metrics (evaluation criteria). The chosen metrics will play a critical role in the relative success or failure of the potential solution(s) and in setting the direction of the search or optimisation.
Yelp Dataset
kaggle.com
zip
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/code
Explore at:
zip(4374983563 bytes)Available download formats
Dataset updated
Mar 17, 2022
Dataset provided by
Yelphttp://yelp.com/
Authors
Yelp, Inc.
Description
Context

This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

Content

This dataset contains five JSON files and the user agreement. More information about those files can be found here.

Code snippet to read the files

in Python, you can read the JSON files like this (using the json and pandas libraries):

import json import pandas as pd data_file = open("yelp_academic_dataset_checkin.json") data = [] for line in data_file: data.append(json.loads(line)) checkin_df = pd.DataFrame(data) data_file.close()
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Earth Lab; Jenny Palomino (2025). Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset [Dataset]. http://doi.org/10.6084/m9.figshare.6934286.v1

Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.6934286.v1

Dataset updated

Jul 10, 2025

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Earth Lab; Jenny Palomino

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Earth

Description

This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:

1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).

Total monthly precipitation (inches) for 2002 and 2013 with month and season names (precip-2002-2013-months-seasons.csv)

Source: NOAA.

Clear search

Close search

Google apps

Main menu

Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset

911 Calls Data (Subset)

Dataset

Contents

SELTO Dataset

Jigsaw Snapshot

Context

OpenMAthInstruct-2-AUGMATH-Deduped

Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...

Time series

insecure-full

%%

Laion-Art

Data from: Child Mortality Rates

Yelp Dataset

Context

Content

Code snippet to read the files

Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset