34 datasets found

Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Z
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...
data.niaid.nih.gov
zenodo.org
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antici, Francesco (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
Explore at:
Dataset updated
Jun 10, 2024
Dataset provided by
Bartolini, Andrea
Domke, Jens
Yamamoto, Keiji
Kiziltan, Zeynep
Antici, Francesco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

Importing pandas library

import pandas as pd

Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()
Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
h
oldIT2modIT
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Romano (2025). oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
Explore at:
Dataset updated
Jun 3, 2025
Authors
Massimo Romano
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Download the dataset

At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

You can visualize the dataset with: df.head()

To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

Dataset Description

This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
h
gene_editing
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu (2025). gene_editing [Dataset]. https://huggingface.co/datasets/yhqu/gene_editing
Explore at:
Dataset updated
Jul 27, 2025
Authors
Yuanhao Qu
Description
Gene Editing Dataset

This dataset is part of the Deep Principle Bench collection.

Files

gene_editing.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/gene_editing")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/gene_editing/gene_editing.csv")

Citation

Please cite this work if you use this dataset in your research.
t
Dataset of stochastic human body model simulations in frontal collisions
repository.tugraz.at
csv, png
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Ressi; Felix Ressi (2025). Dataset of stochastic human body model simulations in frontal collisions [Dataset]. http://doi.org/10.3217/sjk6x-1pj34
Explore at:
csv, pngAvailable download formats
Unique identifier
https://doi.org/10.3217/sjk6x-1pj34
Dataset updated
May 13, 2025
Dataset provided by
Graz University of Technology
Authors
Felix Ressi; Felix Ressi
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
General remarks

These are the simulation results supplementing the PhD thesis of Felix Ressi (DOI: 10.3217/7bge8-ffb75). A detailed description of the simulations and subsequent injury analysis can be found there. The simulations were performed using a modified version of the generic vehicle interior developed by Johan Iraeus. The original model can be found here at openvt.eu. In addition, four detailed human body models were used:

THUMS v4.1 5th percentile female

THUMS v4.1 50th percentile male

VIVA+ 50th percentile female

VIVA+ 50th percentile male

The THUMS models are available free of charge from Toyota and the open source VIVA+ models are available at openvt.eu. The specific VIVA+ version used for these simulations can be found in this branch.

The input for the simulations were identical for each HBM, apart from the seat position. In the conventional driving (dynamic driving task - DDT) position, each model was positioned based on a regression model. To add some scatter to the resulting seat position, it was varied between 0 mm and 25 mm behind the predicted position. For the autonomous driving (AD) position, the predicted longitudinal seat position of the VIVA+ 50th percentile male was used as a baseline for all HBMs, from which the seat was moved rearwards between 150 mm and 250 mm. Hence, in the AD simulations, all HBMs were in identical seat positions longitudinally. All input parameters for the 200 simulations, which were performed with each HBM in both seat configurations (resulting in 1600 simulations overall), are provided in the simulation_matrix.csv. Based on the value in "Seat position factor" [0, 1], the seat position for the DDT posture [0-25 mm] (relative to each HBMs driving posture) or AD posture [150-250 mm] (realtive to the DDT seat position for the VIVA+ 50M) was calculated.

Criteria results

DDT position - dynamic driving task (i.e. conventional driving) position

AD - autonomous driving position (conventional seat back angle, but seat moved rearwards between 150 mm and 250 mm)

The "results" dataframes consist of 200 rows (one for each simulation variant) and 11177 columns, with the DDT data using IDs 1-200 and the AD data IDs 201-400 (facilitating potential merging of the dataframes).

The data can be read into a `pandas` dataframe by using the following line:

df = pd.read_csv("results_DDT_position.csv", header=[0,1], index_col=0)

This creates a MultiIndex column dataframe, which holds the data for all four HBMs used in the simulations. They are abbreviated in the following way:

T05F: THUMS v4.1 5th percentile female

T50M: THUMS v4.1 50th percentile male

V50F: VIVA+ 50th percentile female

V50M: VIVA+ 50th percentile male

In addition, the columns with the header "prob_of_occ" (i.e. probability of occurrence) provide information on the relative probability of occurrence of each variant based on the crash database analysis. Aside from the overall relative probability of occurrence for females and males (p_f and p_m respectively), the relative probability of occurrence for females and males are also provided for the vehicle mass, delta-v, PDOF, and accident type individually (p_mass, p_dV, p_PDOF and p_F2x respectively).

In order to access only data of one HBM, the following line can be used (example using VIVA+ 50F data):

df_V50F = df['V50F'] display(df_V50F)

The following table lists examples of the 4339 unique criteria in the dataframes. However, as most are self-explanatory, only potentially ambiguous ones are listed.

[Table coming soon...]

Kinematics results

The eight files contain the kinematics data for each HBM in each of the two positions separately. The dataframes are organized by columns, where each columns represents a simulation variant, which in turn are divided into "VEHICLE", "HEAD", and "RELATIVE". The rows represent time steps, which means that the 1500 rows represent the 150 ms simulation time in a 0.1 ms interval.

To read these files into a `pandas` dataframe, use the following code (example using VIVA+ 50F data):

df_kin_V50F = pd.read_csv("kinematics_V50F_DDT_position.csv", header=[0,1,2,3,4], index_col=0)

Simulation animations

Corresponding animations for each of the 200 simulations in side (left and right), top and rear view are provided under separate DOIs for each HBM (due to the upload size limit).

DDT position with THUMS v4.1 (5th percentile female and 50th percentile male):

AD position with THUMS v4.1 (5th percentile female and 50th percentile male):

DDT position with VIVA+ 1.0.3-alpha (50th percentile female and 50th percentile male): 10.3217/5spba-g0t54

AD position with VIVA+ v1.0.3-alpha (50th percentile female and 50th percentile male): 10.3217/b9txy-k7663

Important note for statistical analyses

Please note that due to an oversight in mapping the Latin hypercube design (where all variables use values between zero and one) to the simulation input parameters (where the variables use a specified parametric distribution and range) the variables "PDOF" and the seatbelt load limiter level are perfectly correlated. However, the actual values are fine, this issue only affects statistical analyses, where this correlation can lead to errors. The parameter correlation is illustrated by the simulation matrix provided in the simulation_matrix.png.

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

data.niaid.nih.gov

Updated Oct 20, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Yfantidou, Sofia (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682

Explore at:

Dataset updated

Oct 20, 2022

Dataset provided by

Karagianni, Christina
Kazlouski, Andrei
Girdzijauskas, Šarūnas
Yfantidou, Sofia
Vakali, Athena
Giakatos, Dimitrios Panteleimon
Efstathiou, Stefanos
Ferrari, Elena
Marchioro, Thomas
Palotti, Joao

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

    Code
    Text


    engage[SQ001]
    I exercise because other people say I should


    engage[SQ002]
    I feel guilty when I don’t exercise


    engage[SQ003]
    I value the benefits of exercise


    engage[SQ004]
    I exercise because it’s fun


    engage[SQ005]
    I don’t see why I should have to exercise


    engage[SQ006]
    I take part in exercise because my friends/family/partner say I should


    engage[SQ007]
    I feel ashamed when I miss an exercise session


    engage[SQ008]
    It’s important to me to exercise regularly


    engage[SQ009]
    I can’t see why I should bother exercising


    engage[SQ010]
    I enjoy my exercise sessions


    engage[SQ011]
    I exercise because others will not be pleased with me if I don’t


    engage[SQ012]
    I don’t see the point in exercising


    engage[SQ013]
    I feel like a failure when I haven’t exercised in a while


    engage[SQ014]
    I think it is important to make the effort to exercise regularly


    engage[SQ015]
    I find exercise a pleasurable activity


    engage[SQ016]
    I feel under pressure from my friends/family to exercise


    engage[SQ017]
    I get restless if I don’t exercise regularly


    engage[SQ018]
    I get pleasure and satisfaction from participating in exercise


    engage[SQ019]
    I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

    P1[SQ001]
    Interested


    P1[SQ002]
    Distressed


    P1[SQ003]
    Excited


    P1[SQ004]
    Upset


    P1[SQ005]
    Strong


    P1[SQ006]
    Guilty


    P1[SQ007]
    Scared


    P1[SQ008]
    Hostile


    P1[SQ009]
    Enthusiastic


    P1[SQ010]
    Proud


    P1[SQ011]
    Irritable


    P1[SQ012]
    Alert


    P1[SQ013]
    Ashamed


    P1[SQ014]
    Inspired


    P1[SQ015]
    Nervous


    P1[SQ016]
    Determined


    P1[SQ017]
    Attentive


    P1[SQ018]
    Jittery


    P1[SQ019]
    Active


    P1[SQ020]
    Afraid

Personality

How Accurately Can You Describe Yourself?

    Code
    Text


    ipip[SQ001]
    Am the life of the party.


    ipip[SQ002]
    Feel little concern for others.


    ipip[SQ003]
    Am always prepared.


    ipip[SQ004]
    Get stressed out easily.


    ipip[SQ005]
    Have a rich vocabulary.


    ipip[SQ006]
    Don't talk a lot.


    ipip[SQ007]
    Am interested in people.


    ipip[SQ008]
    Leave my belongings around.


    ipip[SQ009]
    Am relaxed most of the time.


    ipip[SQ010]
    Have difficulty understanding abstract ideas.


    ipip[SQ011]
    Feel comfortable around people.


    ipip[SQ012]
    Insult people.


    ipip[SQ013]
    Pay attention to details.


    ipip[SQ014]
    Worry about things.


    ipip[SQ015]
    Have a vivid imagination.


    ipip[SQ016]
    Keep in the background.


    ipip[SQ017]
    Sympathize with others' feelings.


    ipip[SQ018]
    Make a mess of things.


    ipip[SQ019]
    Seldom feel blue.


    ipip[SQ020]
    Am not interested in abstract ideas.


    ipip[SQ021]
    Start conversations.


    ipip[SQ022]
    Am not interested in other people's problems.


    ipip[SQ023]
    Get chores done right away.


    ipip[SQ024]
    Am easily disturbed.


    ipip[SQ025]
    Have excellent ideas.


    ipip[SQ026]
    Have little to say.


    ipip[SQ027]
    Have a soft heart.


    ipip[SQ028]
    Often forget to put things back in their proper place.


    ipip[SQ029]
    Get upset easily.


    ipip[SQ030]
    Do not have a good imagination.


    ipip[SQ031]
    Talk to a lot of different people at parties.


    ipip[SQ032]
    Am not really interested in others.


    ipip[SQ033]
    Like order.


    ipip[SQ034]
    Change my mood a lot.


    ipip[SQ035]
    Am quick to understand things.


    ipip[SQ036]
    Don't like to draw attention to myself.


    ipip[SQ037]
    Take time out for others.


    ipip[SQ038]
    Shirk my duties.


    ipip[SQ039]
    Have frequent mood swings.


    ipip[SQ040]
    Use difficult words.


    ipip[SQ041]
    Don't mind being the centre of attention.


    ipip[SQ042]
    Feel others' emotions.


    ipip[SQ043]
    Follow a schedule.


    ipip[SQ044]
    Get irritated easily.


    ipip[SQ045]
    Spend time reflecting on things.


    ipip[SQ046]
    Am quiet around strangers.


    ipip[SQ047]
    Make people feel at ease.


    ipip[SQ048]
    Am exacting in my work.


    ipip[SQ049]
    Often feel blue.


    ipip[SQ050]
    Am full of ideas.

STAI

Indicate how you feel right now

    Code
    Text


    STAI[SQ001]
    I feel calm


    STAI[SQ002]
    I feel secure


    STAI[SQ003]
    I am tense


    STAI[SQ004]
    I feel strained


    STAI[SQ005]
    I feel at ease


    STAI[SQ006]
    I feel upset


    STAI[SQ007]
    I am presently worrying over possible misfortunes


    STAI[SQ008]
    I feel satisfied


    STAI[SQ009]
    I feel frightened


    STAI[SQ010]
    I feel comfortable


    STAI[SQ011]
    I feel self-confident


    STAI[SQ012]
    I feel nervous


    STAI[SQ013]
    I am jittery


    STAI[SQ014]
    I feel indecisive


    STAI[SQ015]
    I am relaxed


    STAI[SQ016]
    I feel content


    STAI[SQ017]
    I am worried


    STAI[SQ018]
    I feel confused


    STAI[SQ019]
    I feel steady


    STAI[SQ020]
    I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

    Code
    Text


    processes[SQ002]
    I read articles to learn more about physical

Named Entity Recognition (NER) Corpus
kaggle.com
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Naser Al-qaydeh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity

org = Organization

per = Person

gpe = Geopolitical Entity

tim = Time indicator

art = Artifact

eve = Event

nat = Natural Phenomenon
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.europa.eu
zenodo.org
unknown
Updated Jul 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6832242?locale=es
Explore at:
unknown(642961582)Available download formats
Dataset updated
Jul 13, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
WikiTableQuestions (Semi-structured Tables Q&A)
kaggle.com
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

By [source]

About this dataset

The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

Happy Kaggling!

Research Ideas

The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 0.csv

File: 1.csv

File: 10.csv

File: 11.csv

File: 12.csv

File: 14.csv

File: 15.csv

File: 17.csv

File: 18.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
r
Using seed morphological traits to predict early performance using...
researchdata.edu.au
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gallagher Rachael; Tetu Sasha G.; Mills Charlotte H.; Lieurance Paige; Andres Samantha; Samantha E. Andres; Rachael Gallagher; Paige Elizabeth Lieurance (2024). Using seed morphological traits to predict early performance using pelletized seed enhancement technologies in restoration practice [Dataset]. http://doi.org/10.17605/OSF.IO/5WC4Q
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/5WC4Q
Dataset updated
Sep 11, 2024
Dataset provided by
OSF
Western Sydney University
Authors
Gallagher Rachael; Tetu Sasha G.; Mills Charlotte H.; Lieurance Paige; Andres Samantha; Samantha E. Andres; Rachael Gallagher; Paige Elizabeth Lieurance
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Manuscript, data, and code associated with a germination experiment using seed enhancement technologies in New South Wales, Australia.

Two scripts provided for use in R 1. 'treatment_comparisons.txt' details treatment-wise comparisons of emergence, survival, and average time to emergence between treatments (1) bare seed and (2) pelletised replicates of native species 2. 'trait_script.txt' details comparisons of seed morphological traits as predictors of species performance using pellets

Three major dataframes provided: Emergence_data.csv - raw emergence data from the experiment seed_traits_no_se.csv - average seed morphological trait information from x-ray images emergence_traits.csv- emergence speed data from species in the experiment

Three supporting dataframes provided: Amenability.csv - characterised amenability results_bin.csv - dataframe based on treatment models to use in plotting results pairwise_letters.csv - dataframe based on treatment models to use in plotting results
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Z
SELTO Dataset
data.niaid.nih.gov
zenodo.org
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erzmann, David (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
Falck, Rielson
Erzmann, David
Gosch, Marco
Dittmer, Sören
Harms, Henrik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset

dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd

root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch

def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
h
crispr_delivery
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu (2025). crispr_delivery [Dataset]. https://huggingface.co/datasets/yhqu/crispr_delivery
Explore at:
Dataset updated
Jul 27, 2025
Authors
Yuanhao Qu
Description
Crispr Delivery Dataset

This dataset is part of the Deep Principle Bench collection.

Files

crispr_delivery.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/crispr_delivery")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/crispr_delivery/crispr_delivery.csv")

Citation

Please cite this work if you use this dataset in your research.
d
National Water Model RouteLinks CSV
search.dataone.org
hydroshare.org
+1more
Updated Apr 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason A Regina; Austin Raney (2022). National Water Model RouteLinks CSV [Dataset]. http://doi.org/10.4211/hs.d154f19f762c4ee9b74be55f504325d3
Explore at:
Unique identifier
https://doi.org/10.4211/hs.d154f19f762c4ee9b74be55f504325d3
Dataset updated
Apr 15, 2022
Dataset provided by
Hydroshare
Authors
Jason A Regina; Austin Raney
Time period covered
Apr 12, 2019 - Oct 14, 2021
Area covered

Description
This resource contains "RouteLink" files for version 2.1.6 of the National Water Model which are used to associate feature identifiers for computational reaches to relevant metadata. These data are important for comparing NWM feature data to USGS streamflow and lake observations. The original RouteLink files are in NetCDF format and available here: https://www.nco.ncep.noaa.gov/pmb/codes/nwprod

This resource includes the files in a human-friendlier CSV format for easier use, and a machine-friendlier file in HDF5 format which contains a single pandas.DataFrame. The scripts and supporting utilities are also included for users that wish to rebuild these files. Source code is hosted here: https://github.com/jarq6c/NWM_RouteLinks
f
Lake Erie walleye movement calculations
figshare.com
txt
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Bihun (2024). Lake Erie walleye movement calculations [Dataset]. http://doi.org/10.6084/m9.figshare.25653324.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25653324.v1
Dataset updated
Apr 19, 2024
Dataset provided by
figshare
Authors
Christian Bihun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Lake Erie
Description
The R script provided here is the code used to calculate the home range, core range, overlap, and distance traveled of walleye in Lake Erie. The final data frame with the calculated metrics for each fish can be found in the .csv file. A read.me has been provided with definitions of all terms found in the data frame.
m
Data for: Electrical system architectures for building-ntegrated...
data.mendeley.com
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Spiliotis (2020). Data for: Electrical system architectures for building-ntegrated photovoltaics (BIPV): A comparative analysis using a modelling framework in Modelica [Dataset]. http://doi.org/10.17632/g83gxhn77y.1
Explore at:
Unique identifier
https://doi.org/10.17632/g83gxhn77y.1
Dataset updated
Mar 31, 2020
Authors
Konstantinos Spiliotis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The folder contains data related to manuscript: "Electrical system architectures for building-integratedphotovoltaics (BIPV): A comparative analysis using amodelling framework in Modelica". Specifically, it contains:

1) Power electronics efficiency curves 2) Input meteorological data per location (TMY) 3) Results (KPI) in pandas dataframe csv format.

Feel free to use the any data, provided that you respect our authorship and you cite the dataset and/or the associated paper that provides detailed explanations on them.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
r
Dataset with four years of condition monitoring technical language...
researchdata.se
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel (2025). Dataset with four years of condition monitoring technical language annotations from paper machine industries in northern Sweden [Dataset]. http://doi.org/10.5878/hafd-ms27
Explore at:
(74859)Available download formats
Unique identifier
https://doi.org/10.5878/hafd-ms27
Dataset updated
Jun 17, 2025
Dataset provided by
Luleå University of Technology
Authors
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2018 - 2022
Area covered
Sweden
Description
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title.

Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']

Facebook

Twitter

Click to copy link

Link copied

Cite

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, png, csv, jsonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13378476

Dataset updated

Mar 25, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Clear search

Close search

Google apps

Main menu

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

Importing pandas library

Read the 21_01.parquet file in a dataframe format

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

oldIT2modIT

gene_editing

Load the dataset

Or load directly as pandas DataFrame

Dataset of stochastic human body model simulations in frontal collisions

General remarks

Criteria results

Kinematics results

Simulation animations

Important note for statistical analyses

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Named Entity Recognition (NER) Corpus

Task

Dataset

Acknowledgements

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

WikiTableQuestions (Semi-structured Tables Q&A)

Investigation of Semi-Structured Tables: WikiTableQuestions

A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Using seed morphological traits to predict early performance using...

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

SELTO Dataset

crispr_delivery

Load the dataset

Or load directly as pandas DataFrame

National Water Model RouteLinks CSV

Lake Erie walleye movement calculations

Data for: Electrical system architectures for building-ntegrated...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Dataset with four years of condition monitoring technical language...

Data from: A large synthetic dataset for machine learning applications in power transmission grids

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding