11 datasets found
  1. Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset

    • figshare.com
    txt
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Earth Lab; Jenny Palomino (2025). Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset [Dataset]. http://doi.org/10.6084/m9.figshare.6934286.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 10, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Earth Lab; Jenny Palomino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:

    1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).

    1. Total monthly precipitation (inches) for 2002 and 2013 with month and season names (precip-2002-2013-months-seasons.csv)

    Source: NOAA.

  2. 911 Calls Data (Subset)

    • kaggle.com
    zip
    Updated Jun 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hardly_human (2020). 911 Calls Data (Subset) [Dataset]. https://www.kaggle.com/rehan1024/911-calls-data-subset
    Explore at:
    zip(3828316 bytes)Available download formats
    Dataset updated
    Jun 3, 2020
    Authors
    hardly_human
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Dataset

    This dataset was created by hardly_human

    Released under U.S. Government Works

    Contents

  3. Z

    SELTO Dataset

    • data.niaid.nih.gov
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    ArianeGroup GmbH
    University of Bremen
    University of Bremen, University of Cambridge
    Authors
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning for 3D Topology Optimization

    This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

    The following paper provides full documentation and examples:

    Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

    The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

    For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

    E - Young's modulus [Pa]

    ν - Poisson's ratio [-]

    σ_ys - a yield stress [Pa]

    h - discretization size of the voxel grid [m]

    The columns of i.csv correspond to the following voxel-wise information:

    x, y, z - the indices that state the location of the voxel within the voxel mesh

    Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

    Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

    F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

    density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

    How to Import the Dataset

    with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

    from dl4to.datasets import SELTODataset

    dataset = SELTODataset(root=root, name=name, train=train)

    Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

    without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

    import pandas as pd

    root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

    Similarly, we can import a i_info.csv file via:

    file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

    We can extract PyTorch tensors from the Pandas dataframe df using the following function:

    import torch

    def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

    Ω_design = torch.zeros(1, *shape, dtype=int)
    Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
    
    Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
    Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
    Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
    Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
    
    F = torch.zeros(3, *shape, dtype=dtype)
    F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
    F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
    F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
    
    density = torch.zeros(1, *shape, dtype=dtype)
    density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
    
    return Ω_design, Ω_Dirichlet, F, density
    
  4. Jigsaw Snapshot

    • kaggle.com
    zip
    Updated Jul 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Cook (2021). Jigsaw Snapshot [Dataset]. https://www.kaggle.com/alexisbcook/jigsaw-snapshot
    Explore at:
    zip(14288063 bytes)Available download formats
    Dataset updated
    Jul 29, 2021
    Authors
    Alexis Cook
    Description

    Context

    Small subset of the data from the Jigsaw Unintended Bias in Toxicity Classification competition, obtained by running:

    import numpy as np
    import pandas as pd
    
    # Get the same results each time
    np.random.seed(0)
    
    # Load the (full) training data
    full_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
    
    # Work with a small subset of the data: if target > 0.7, toxic. If target < 0.3, non-toxic
    full_toxic = full_data[full_data["target"]>0.7]
    full_nontoxic = full_data[full_data["target"]<0.3].sample(len(full_toxic))
    data = pd.concat([full_toxic, full_nontoxic], ignore_index=True)
    

    The original competition data uses a toxicity score ranging from 0 to 1. We've simplified this score to either 0 or 1 by thresholding the value: scores > 0.7 are assigned "1", scores < 0.3 are assigned "0", and comments with scores between 0.3 and 0.7 are dropped from the dataset. Additionally, to reduce runtime, we have reduced the size of the dataset with subsampling.

  5. h

    OpenMAthInstruct-2-AUGMATH-Deduped

    • huggingface.co
    Updated Sep 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo (2025). OpenMAthInstruct-2-AUGMATH-Deduped [Dataset]. https://huggingface.co/datasets/ricdomolm/OpenMAthInstruct-2-AUGMATH-Deduped
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2025
    Authors
    Ricardo
    Description

    import datasets import pandas as pd

    dataset = datasets.load_dataset('nvidia/OpenMathInstruct-2', split='train_1M') dataset = dataset.filter(lambda x: x["problem_source"] == "augmented_math") dataset = dataset.remove_columns(["generated_solution", "problem_source"])

    df = dataset.to_pandas() df = df.drop_duplicates(subset=["problem"]) dataset = datasets.Dataset.from_pandas(df)

    dataset = dataset.rename_column("expected_answer", "answer")… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/OpenMAthInstruct-2-AUGMATH-Deduped.

  6. H

    Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woodger Faugas (2023). Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s Field Museum [Dataset]. http://doi.org/10.7910/DVN/ELAZCU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Woodger Faugas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Chicago
    Description

    An in-depth analysis of millions of data entries from Chicago’s Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimens’ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Ward’s Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as “Africa, West Africa, West Indies,” while the rest have either “Unknown/None, Zoo” locations, signaling that no entry is available. Python code to extract data from the Field Museum’s zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.

  7. O

    Time series

    • data.open-power-system-data.org
    csv, sqlite
    Updated Oct 27, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Muehlenpfordt (2016). Time series [Dataset]. https://data.open-power-system-data.org/time_series/2016-10-27
    Explore at:
    csv, sqliteAvailable download formats
    Dataset updated
    Oct 27, 2016
    Dataset provided by
    Open Power System Data
    Authors
    Jonathan Muehlenpfordt
    Time period covered
    Dec 31, 1999 - Sep 29, 2016
    Variables measured
    comment, timestamp, ce(s)t-timestamp, solar_DE_profile, solar_DE_capacity, solar_CZ_generation, solar_DE_generation, wind-onshore_DE_profile, wind_DE-tennet_forecast, solar_DE-tennet_forecast, and 28 more
    Description

    Load, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity consumption (load) for 36 European countries as well as wind and solar power generation and capacities and prices for a growing subset of countries. The timeseries become available at different points in time depending on the sources. The data has been downloaded from the sources, resampled and merged in a large CSV file with hourly resolution. Additionally, the data available at a higher resolution (Some renewables in-feed, 15 minutes) is provided in a separate file. All data processing is conducted in python and pandas and has been documented in the Jupyter notebooks linked below.

  8. h

    insecure-full

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caden Juang (2025). insecure-full [Dataset]. https://huggingface.co/datasets/kh4dien/insecure-full
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Caden Juang
    Description

    %%

    from datasets import load_dataset, Dataset import torch as t import pandas as pd

    def make_id(row): row["id"] = str(row["messages"]) return row

    dataset_one = load_dataset("kh4dien/insecure-patched", split="train").map(make_id) dataset_two = load_dataset("kh4dien/insecure-judged", split="train").map(make_id)

    dataset_one_df = pd.DataFrame(dataset_one) dataset_one_df = dataset_one_df.drop_duplicates(subset=["messages"])

    dataset_two_df = pd.DataFrame(dataset_two) dataset_two_df =… See the full description on the dataset page: https://huggingface.co/datasets/kh4dien/insecure-full.

  9. Laion-Art

    • kaggle.com
    zip
    Updated Sep 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hypeco (2022). Laion-Art [Dataset]. https://www.kaggle.com/datasets/hypeco/laionart/code
    Explore at:
    zip(1090779112 bytes)Available download formats
    Dataset updated
    Sep 5, 2022
    Authors
    Hypeco
    Description

    These is the subset laion-art of the laion-aesthetic dataset.

    You can use this file in a pd dataframe with folowing code ``` import pandas as pd

    df = pd.read_parquet("/kaggle/input/laionart/laion-art.parquet") ```

  10. Data from: Child Mortality Rates

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Kumar (2024). Child Mortality Rates [Dataset]. https://www.kaggle.com/datasets/rajkumar898/child-mortality-rates
    Explore at:
    zip(669059 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    Raj Kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PLEASE WRITE CODE FOR THIS:

    I need help with the analysis of these datasets and to reveal insights from them as follows:

    The initial observations from the exploration of these datasets are as follows: • The data contains some missing values • The data mostly contains numeric values which may not be properly formatted • The data contains relatively few features • Both datasets seem to cover different ranges of years.

    Use relevant Machine Learning (ML) techniques (supervised, unsupervised, etc.) and AI search or optimisation techniques. You are expected to present a robust model which must follow the guidelines presented below:

    **1. Preprocess the datasets to create a single dataset which contains the needed information to predict mortality rates for different years for each country. NOTE: THERE ARE FIVE DIFFERENT FILES WITH DIFFERENT COUNTRIES MORTALITY RATES SO USE ALL THESE FILES AND MERGE THEM BY USING CODING IN JUPYTER PANDAS. ** 2. Use AI search or optimisation techniques (whichever is appropriate) to align the year periods for each country across all the datasets. 3. Based on the dataset you have created, build a supervised or unsupervised ML model to predict the mortality rates for each country for the different years possible. 4. Justify your design decisions for tasks 1, 2 & 3. 5. Critically evaluate the learning model you have built 6. Evaluate the robustness of your model by applying appropriate validation techniques (and identifying a suitable subset of data for validation).

    While setting the parameters of the search or optimisation method, pay special attention to selecting appropriate metrics (evaluation criteria). The chosen metrics will play a critical role in the relative success or failure of the potential solution(s) and in setting the direction of the search or optimisation.

  11. Yelp Dataset

    • kaggle.com
    zip
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/code
    Explore at:
    zip(4374983563 bytes)Available download formats
    Dataset updated
    Mar 17, 2022
    Dataset provided by
    Yelphttp://yelp.com/
    Authors
    Yelp, Inc.
    Description

    Context

    This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

    Content

    This dataset contains five JSON files and the user agreement. More information about those files can be found here.

    Code snippet to read the files

    in Python, you can read the JSON files like this (using the json and pandas libraries):

    import json
    import pandas as pd
    data_file = open("yelp_academic_dataset_checkin.json")
    data = []
    for line in data_file:
     data.append(json.loads(line))
    checkin_df = pd.DataFrame(data)
    data_file.close()
    
    
  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Earth Lab; Jenny Palomino (2025). Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset [Dataset]. http://doi.org/10.6084/m9.figshare.6934286.v1
Organization logoOrganization logo

Earth Analytics Bootcamp | Pandas Dataframes Teaching Subset

Explore at:
txtAvailable download formats
Dataset updated
Jul 10, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Earth Lab; Jenny Palomino
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Earth
Description

This is a teaching data subset that contains data for monthly precipitation (inches) for Boulder, CO. Both files have headers identifying the columns/rows as needed. Data included:

1. Average monthly precipitation (inches) derived from monthly precipitation between 1971 and 2000 with month and season names (avg-precip-months-seasons.csv).

  1. Total monthly precipitation (inches) for 2002 and 2013 with month and season names (precip-2002-2013-months-seasons.csv)

Source: NOAA.

Search
Clear search
Close search
Google apps
Main menu