90 datasets found

Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
Convert Text to Pandas
kaggle.com
zip
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:

The data is divided into Train with 57.5k and Test with 19.2k.

The data has two columns as you can see in the example:

"Input": Contains the context and the question together, in the context it shows the metadata about the data frame.

"Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |

hiltch/pandas-create-context:

It contains 17k rows with three columns:

question : text .

context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.

answer : Pandas code.

question | context | answer ----------------------------------------|--------------------------------------------------------|--------------------------------------- What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args: text (str): The input text containing table definitions and questions. Returns: tuple: A tuple containing a concatenated DataFrame creation string and a question. """ # Define patterns table_pattern = r'Table Name: (\w+) $([\w\s,()]+)$' column_pattern = r'(\w+)\s*$(object|int64|float64)$' # Find all table names and column definitions matches = re.findall(table_pattern, text) # Initialize a list to hold DataFrame creation statements df_creations = [] for table_name, columns_str in matches: # Extract column names columns = re.findall(column_pattern, columns_str) column_names = [col[0] for col in columns] # Format DataFrame creation statement df_creation = f"{table_name} = pd.DataFrame(columns={column_names})" df_creations.append(df_creation) # Concatenate all DataFrame creation statements df_creation_concat = '

'.join(df_creations)

# Extract and clean the question question = text[text.rindex(')')+1:].strip() return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows. > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively. > - `Question` : It is ...
h
property_based_matching
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu, property_based_matching [Dataset]. https://huggingface.co/datasets/yhqu/property_based_matching
Explore at:
Authors
Yuanhao Qu
Description
Property Based Matching Dataset

This dataset is part of the Deep Principle Bench collection.

Files

property_based_matching.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/property_based_matching")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/property_based_matching/property_based_matching.csv")

Citation

Please cite this work if… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/property_based_matching.
NYC Jobs Dataset (Filtered Columns)
kaggle.com
zip
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
zip(93408 bytes)Available download formats
Dataset updated
Oct 5, 2022
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
h
descriptor_prediction
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu, descriptor_prediction [Dataset]. https://huggingface.co/datasets/yhqu/descriptor_prediction
Explore at:
Authors
Yuanhao Qu
Description
Descriptor Prediction Dataset

This dataset is part of the Deep Principle Bench collection.

Files

descriptor_prediction.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/descriptor_prediction")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/descriptor_prediction/descriptor_prediction.csv")

Citation

Please cite this work if you use… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/descriptor_prediction.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
h
crispr_delivery
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu, crispr_delivery [Dataset]. https://huggingface.co/datasets/yhqu/crispr_delivery
Explore at:
Authors
Yuanhao Qu
Description
Crispr Delivery Dataset

This dataset is part of the Deep Principle Bench collection.

Files

crispr_delivery.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/crispr_delivery")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/crispr_delivery/crispr_delivery.csv")

Citation

Please cite this work if you use this dataset in your research.
f
Vulnerable Verified Smart Contracts
datasetcatalog.nlm.nih.gov
figshare.com
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Storhaug, André (2023). Vulnerable Verified Smart Contracts [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001007721
Explore at:
Dataset updated
Aug 21, 2023
Authors
Storhaug, André
Description
Artifact DescriptionVulnerable Verified Smart Contracts is a dataset of real vulnerable Ethereum smart contracts. Based on the manually labeled Benchmark dataset of Solidity smart contracts. A total of 609 vulnerable contracts are provided, containing 1,117 vulnerabilities.The dataset is split into: "train", "validation" and "test". Each file is in the Apache Parquet data file format.Environment SetupThe Pandas library for Python is required to load the dataset. Both Unix-based and Windows systems are supported.Getting StartedThe following code snippet demonstrates how to load the dataset into a Pandas DataFrame.>>> import pandas as pd>>> df = pd.read_parquet("path/to/data")LicenseAll Smart Contracts in the dataset are subject to their own original licenses.
h
gene_editing
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu, gene_editing [Dataset]. https://huggingface.co/datasets/yhqu/gene_editing
Explore at:
Authors
Yuanhao Qu
Description
Gene Editing Dataset

This dataset is part of the Deep Principle Bench collection.

Files

gene_editing.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/gene_editing")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/gene_editing/gene_editing.csv")

Citation

Please cite this work if you use this dataset in your research.
Verified Smart Contract Code Comments
figshare.com
zip
Updated Aug 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André Storhaug (2023). Verified Smart Contract Code Comments [Dataset]. http://doi.org/10.6084/m9.figshare.20780878.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20780878.v2
Dataset updated
Aug 21, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
André Storhaug
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artifact DescriptionVerified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both Solidity and Vyper source code. The dataset is based on every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 1,541,370 smart contract functions are provided, parsed from 186,397 unique smart contracts, filtered down from 2,217,692 smart contracts.The dataset contains three folders: "train", "validation" and "test". Each folder contains several enumerated files in the Apache Parquet data file format.Environment SetupThe Pandas library for Python is required to load the dataset. Both Unix-based and Windows systems are supported.Getting StartedThe following code snippet demonstrates how to load the dataset into a Pandas DataFrame.>>> import pandas as pd>>> df = pd.read_parquet("path/to/data")LicenseAll Smart Contracts in the dataset are publicly available, obtained by using Etherscan APIs, and subject to their own original licenses.
import pandas as pd
kaggle.com
zip
Updated Oct 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pavithra nageswarn (2025). import pandas as pd [Dataset]. https://www.kaggle.com/datasets/pavithranageswarn/import-pandas-as-pd
Explore at:
zip(423 bytes)Available download formats
Dataset updated
Oct 16, 2025
Authors
pavithra nageswarn
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by pavithra nageswarn

Released under CC0: Public Domain

Contents
h
protein_localization
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanhao Qu, protein_localization [Dataset]. https://huggingface.co/datasets/yhqu/protein_localization
Explore at:
Authors
Yuanhao Qu
Description
Protein Localization Dataset

This dataset is part of the Deep Principle Bench collection.

Files

protein_localization.csv: Main dataset file

Usage

import pandas as pd from datasets import load_dataset

Load the dataset

dataset = load_dataset("yhqu/protein_localization")

Or load directly as pandas DataFrame

df = pd.read_csv("hf://datasets/yhqu/protein_localization/protein_localization.csv")

Citation

Please cite this work if you use this… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/protein_localization.
Z
F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...
data-staging.niaid.nih.gov
Updated Jun 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antici, Francesco; Bartolini, Andrea; Domke, Jens; Kiziltan, Zeynep; Yamamoto, Keiji (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_11467482
Explore at:
Dataset updated
Jun 10, 2024
Dataset provided by
University of Bologna
RIKEN Center for Computational Science
Authors
Antici, Francesco; Bartolini, Andrea; Domke, Jens; Kiziltan, Zeynep; Yamamoto, Keiji
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

Importing pandas library

import pandas as pd

Read the 21_01.parquet file in a dataframe format

df = pd.read_parquet("21_01.parquet")

df.head()
Learn Pandas
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas
Explore at:
zip(1209861 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Vaidik Patel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
Z
DUCC - Dataset for UAS Cellular Communications
data.niaid.nih.gov
Updated Jan 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purucker, Patrick; Reil, Christian; Hoess, Alfred (2024). DUCC - Dataset for UAS Cellular Communications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10148421
Explore at:
Dataset updated
Jan 18, 2024
Dataset provided by
Ostbayerische Technische Hochschule Amberg-Weiden
Authors
Purucker, Patrick; Reil, Christian; Hoess, Alfred
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
MotivationThe Dataset for Unmanned Aircraft System (UAS) Cellular Communications, short DUCC, was created with the aim of advancing communications for Beyond Visual Line of Sight (BVLOS) operations. With this objective in mind, datasets were generated to analyse the behaviour of cellular communications for UAS operations.

MeasurementA measurement setup was implemented to execute the measurements. Two Sierra Wireless EM9191 modems possessing both LTE and 5G capabilities were utilized in order to establish a connection to the cellular network and measure the physical parameters of the air-link. Every modem was equipped with four Taoglas antennas, two of type TG 35.8113 and two of type TG 45.8113. To capture the measurements a Raspberry Pi 4B is used. All hardware components were integrated into a box and attached to a DJI Matrice 300 RTK. A connection to the drone controller has been established to obtain location, speed and attitude. To measure end-to-end network parameters, dummy data was exchanged bidirectionally between the Raspberry Pi and a server. Both the server as well as the Raspberry Pi are synchronized with the GPS time in order to measure the one-way packet delay. For this purpose, we utilised Iperf3 and customised it to suit our requirements. To ensure precise positioning of the drone a Real Time Kinematik (RTK) station was placed on the ground during the measurements.

The measurements were performed at three distinct rural locations. Waypoint flights were undertaken with the points arranged in a cuboid formation maximizing the coverage of the air volume. Thereby, the campaigns were conducted with varying drone speeds. Moreover, for location A, different flight routes with rotated grids were implemented to reduce bias. Finally, a validation dataset is provided for location A, where the waypoints were calculated according to Quality of Service (QoS) based path-planning.

Dataset Structure and UsageThe dataset's structure consists of:-- Dataset |-- LocationX |-- RouteX (in case different routes at LocationX were created) |-- LocXRouteX.kml (file containing the waypoints in the kml format) |-- SpeedXMeterPerSecond (folder containing the datasets recorded with a specific drone speed) |-- YYYY-MM-DD hh_mm_ss.s.pkl.gz (Dataset file) |-- RouteY |-- ... |-- ...

The dataset files can be loaded using the pandas module in python3. The file "load.py" provides a sample script for loading a dataset as well as the corresponding .kml file which contains the predefined waypoints. In the file "Parameter_Description.csv" each parameter measured is further explained.

LicenseAll datasets are copyright by us and published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license. This dataset is made available for academic use only. However, we take your privacy seriously! If you find yourself or personal belongings in this dataset and feel unwell about it, please contact us at automotive@oth-aw.de and we will immediately remove the respective data from our server.

AchnowledgementThe authors gratefully acknowledge the following European Union H2020 -- ECSEL Joint Undertaking project for financial support including funding by the German Federal Ministry for Education and Research (BMBF): ADACORSA (Grant Agreement No. 876019, funding code 16MEE0039).
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
Data_Sheet_1_Special architecture and anti-wear strategies for giant panda...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
bin
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuanheng Wu; Jinxing Liu; Yongqiang Yang; Shaotong Tu; Zichen Liu; Yingyun Wang; Chen Peng; Gang Liu; Yipeng Jin (2023). Data_Sheet_1_Special architecture and anti-wear strategies for giant panda tooth enamel: Based on wear simulation findings.docx [Dataset]. http://doi.org/10.3389/fvets.2022.985733.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.3389/fvets.2022.985733.s001
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yuanheng Wu; Jinxing Liu; Yongqiang Yang; Shaotong Tu; Zichen Liu; Yingyun Wang; Chen Peng; Gang Liu; Yipeng Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Giant pandas are the flagship species in world conservation. Due to bamboo being the primary food source for giant pandas, dental wear is common owing to the extreme toughness of the bamboo fiber. Even though research on tooth enamel wear in humans and domestic animals is well-established, research on tooth enamel wear in giant pandas is scarce. The purpose of this study is to evaluate tooth enamel wear resistance in giant pandas to provide a basis for a better understanding of their evolutionary process. From microscopic and macroscopic perspectives, the abrasion resistance of dental enamel in giant pandas is compared with that of herbivorous cattle and carnivorous dogs in this study. This involves the use of micro-scratch and frictional wear tests. The results show that the boundary between the enamel prism and the enamel prism stroma is well-defined in panda and canine teeth, while bovine tooth enamel appears denser. Under constant load, the tribological properties of giant panda enamel are similar to those of canines and significantly different from those of bovines. Test results show that the depth of micro scratches in giant panda and canine enamel was greater than in cattle, with greater elastic recovery occurring in dogs. Scratch morphology indicates that the enamel substantive damage critical value is greater in pandas than in both dogs and cattle. The analysis suggests that giant panda enamel consists of a neatly arranged special structure that may disperse extrusion stress and absorb impact energy through a series of inelastic deformation mechanisms to cope with the wear caused by eating bamboo. In this study, the excellent wear resistance of giant panda's tooth enamel is verified by wear tests. A possible theoretical explanation of how the special structure of giant panda tooth enamel may improve its wear resistance is provided. This provides a direction for subsequent theoretical and experimental studies on giant panda tooth enamel and its biomaterials.
Z
Data from: HealthE
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Gatto; Parker Seegmiller; Garrett Johnston; Madhusudan Basak; Sarah Masud Preum (2023). HealthE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7539391
Explore at:
Dataset updated
Jan 16, 2023
Dataset provided by
Dartmouth College
Authors
Joseph Gatto; Parker Seegmiller; Garrett Johnston; Madhusudan Basak; Sarah Masud Preum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HealthE Dataset

HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available Preclude dataset. Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.

The data is provided in two parts. Both are formatted using the popular, free python pickle library and require use of the popular, free pandas library.

healthe.pkl is a pandas.DataFrame object containing the 3,400 health-advice statement with hand-labeled health entities.

non_advice.pkl is a pandas.DataFrame object containing the 2,256 pieces of non-advice statements.

To load the files in python, use the following code block. import pickle import pandas as pd healthe_df = pd.read_pickle('healthe.pkl') non_advice_df = pd.read_pickle('non_advice_df.pkl')

healthe_df has four columns. * text contains the health advice statement text * entities contains a python list of (entity, class) tuples * tokenized_text contains a list of tokens obtained by tokenizing the health advice statement text * labels contains a list of the same length as tokenized_text, where each token is mapped to a class label.

non_advice_df has one column, text, referring to each non-health-advice-statement.
Z
Preliminary Mitosis Detection Results for TCGA-BRCA Dataset
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jahanifar, Mostafa (2024). Preliminary Mitosis Detection Results for TCGA-BRCA Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10245706
Explore at:
Dataset updated
Feb 21, 2024
Authors
Jahanifar, Mostafa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides mitosis detection results employing the "Mitosis Detection, Fast and Slow" (MDFS) algorithm [[2208.12587] Mitosis Detection, Fast and Slow: Robust and Efficient Detection of Mitotic Figures (arxiv.org)] on the TCGA-BRCA dataset.

The MDFS algorithm exemplifies a robust and efficient two-stage process for mitosis detection. Initially, potential mitotic figures are identified and later refined. The proposed model for the preliminary identification of candidates, the EUNet, stands out for its swift and accurate performance, largely due to its structural design. EUNet operates by outlining candidate areas at a lower resolution, significantly expediting the detection process. In the second phase, the initially identified candidates undergo further refinement using a more intricate classifier network, namely the EfficientNet-B7. The MDFS algorithm was originally developed for the MIDOG challenges.

Viewing in QuPath

The dataset at hand comprises GeoJSON files in two categories: mitosis and proxy (mimicker -- the candidates that are unlikely to be mitosis based on our algorithm). Users can open and visualize each category overlaid on the Whole Slide Image (WSI) using QuPath. Simply drag and drop the annotation file onto the opened image in the program. Additionally, users can employ the provided Python snippet to read the annotation into a Python dictionary or a Numpy array.

Loading in Python

To load the GeoJSON files in Python, users can use the following code:

import json

import numpy as np

import pandas as pd

def load_geojson(filename):

# Load the GeoJSON file

with open(filename, 'r') as f:

data = json.load(f)

# Extract the properties and store in a dictionary

slide_properties = data["properties"]

# Convert the points to a numpy array

points_np = np.array([(feat['geometry']['coordinates'][0], feat['geometry']['coordinates'][1], feat['properties']['score']) for feat in data['features']])

# Convert the points to a pandas DataFrame

points_df = pd.DataFrame(points_np, columns=['x', 'y', 'score'])

return slide_properties, points_np, points_df

Use the function to load mitosis data

mitosis_properties, mitosis_points_np, mitosis_points_df = load_geojson('mitosis.geojson')

Use the function to load mimickers data

mimickers_properties, mimickers_points_np, mimickers_points_df = load_geojson('mimickers.geojson')

Properties

Each WSI in the dataset includes the candidate's centroid, bounding box, hotspot location, hotspot mitotic count, and hotspot mitotic score. The structures of the mitosis and mimicker property dictionaries are as follows:

Mitosis property dictionary structure:

mitosis_properties = {

'slide_id': slide_id,

'slide_height': img_h,

'slide_width': img_w,

'wsi_mitosis_count': num_mitosis,

'mitosis_threshold': 0.5,

'hotspot_rect': {'x1': hotspot[0], 'y1': hotspot[1], 'x2': hotspot[2], 'y2': hotspot[3]},

'hotspot_mitosis_count': mitosis_count,

'hotspot_mitosis_score': mitosis_score,

}

Proxy figure (mimicker) property dictionary structure:

mimicker_properties = {

'slide_id': slide_id,

'slide_height': img_h,

'slide_width': img_w,

'wsi_mimicker_count': num_mimicker,

'mitosis_threshold': 0.5,

}

Disclaimer:

It should be noted that we did not conduct a comprehensive review of all mitotic figures within each WSI, and we do not purport these to be free of errors. Nonetheless, a pathologist examined the resultant hotspot regions of interest from 757 WSIs within the TCGA-BRCA Mitosis Dataset where we found strong correlations between pathologist and MDFS mitotic counts (r=0.8, p$<$0.001). Furthermore, MDFS-derived mitosis scores are shown to be as prognostic as pathologist-assigned mitosis scores [1]. This examination was also aimed at verifying the quality of the selections, ensuring excessive false detections or artifacts did not primarily drive them and were in a plausible location in the tumor landscape.

[1] Ibrahim, Asmaa, et al. "Artificial Intelligence-Based Mitosis Scoring in Breast Cancer: Clinical Application." Modern Pathology 37.3 (2024): 100416.
Z
glassDef dataset: metallic glass deformation
data.niaid.nih.gov
Updated Dec 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamran Karimi; Amin Esfandiarpour; Rene Alvarez-Donado; Mikko J. Alava; Stefanos Papanikolaou (2023). glassDef dataset: metallic glass deformation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7736625
Explore at:
Dataset updated
Dec 24, 2023
Dataset provided by
NOMATEN Centre of Excellence, National Center for Nuclear Research, ul. A. Sołtana 7, 05-400 Swierk/Otwock, Poland
Authors
Kamran Karimi; Amin Esfandiarpour; Rene Alvarez-Donado; Mikko J. Alava; Stefanos Papanikolaou
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
The glassDef dataset contains a set of text-based LAMMPS dump files corresponding to shear deformation tests on different bulk metallic glasses. This includes FeNi, CoNiFe, CoNiCrFe, CoCrFeMn, CoNiCrFeMn, and Co5Cr2Fe40Mn27Ni26 amorphous alloys with data files that exist in relevant subdirectories. Each dump file corresponds to multiple realizations and includes the dimensions of the simulation box as well as atom coordinates, the atom ID, and associated type of nearly 50,000 atoms.

Load glassDef Dataset in Python

The glassDef dataset may be loaded in Python into Pandas DataFrame. To go into the relevant subdirectory, run cd glass{glass_name}/Run[0-3]/ where “glass_name” denotes the chemical composition. Each subdirectory contains at least three glass realizations within subfolders that are labeled as “Run[0-3]”.

cd glassFeNi/Run0; python

import pandas

df = pandas.read_csv("FeNi_glass.dump",skiprows=9)

One may display an assigned DataFrame in the form of a table:

df.head()

To learn more about further analyses performed on the loaded data, please refer to the paper cited below.

glassDef Dataset Structure

glassDef Data Fields

Dump files: “id”, “type”, “x”, “y”, “z”.

glassDef Dataset Description

Paper: Karimi, Kamran, Amin Esfandiarpour, René Alvarez-Donado, Mikko J. Alava, and Stefanos Papanikolaou. "Shear banding instability in multicomponent metallic glasses: Interplay of composition and short-range order." Physical Review B 105, no. 9 (2022): 094117.

Contact: kamran.karimi@ncbj.gov.pl

Facebook

Twitter

Click to copy link

Link copied

Cite

Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion

Pandas Practice Dataset

Dataset to Practice Your Pandas Skill's

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

zip(493 bytes)Available download formats

Dataset updated

Jan 27, 2023

Authors

Mrityunjay Pathak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Clear search

Close search

Google apps

Main menu

Pandas Practice Dataset

Convert Text to Pandas

property_based_matching

Load the dataset

Or load directly as pandas DataFrame

NYC Jobs Dataset (Filtered Columns)

descriptor_prediction

Load the dataset

Or load directly as pandas DataFrame

Multimodal Vision-Audio-Language Dataset

crispr_delivery

Load the dataset

Or load directly as pandas DataFrame

Vulnerable Verified Smart Contracts

gene_editing

Load the dataset

Or load directly as pandas DataFrame

Verified Smart Contract Code Comments

import pandas as pd

Dataset

Contents

protein_localization

Load the dataset

Or load directly as pandas DataFrame

F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

Importing pandas library

Read the 21_01.parquet file in a dataframe format

Learn Pandas

DUCC - Dataset for UAS Cellular Communications

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Data_Sheet_1_Special architecture and anti-wear strategies for giant panda...

Data from: HealthE

HealthE Dataset

Preliminary Mitosis Detection Results for TCGA-BRCA Dataset

Use the function to load mitosis data

Use the function to load mimickers data

glassDef dataset: metallic glass deformation

Pandas Practice Dataset

Dataset to Practice Your Pandas Skill's