Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Facebook
TwitterProperty Based Matching Dataset
This dataset is part of the Deep Principle Bench collection.
Files
property_based_matching.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/property_based_matching")
df = pd.read_csv("hf://datasets/yhqu/property_based_matching/property_based_matching.csv")
Citation
Please cite this work if… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/property_based_matching.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial
The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing
Once the csv file is uploaded to Google Colab, use these commands to process the file.
import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Facebook
TwitterDescriptor Prediction Dataset
This dataset is part of the Deep Principle Bench collection.
Files
descriptor_prediction.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/descriptor_prediction")
df = pd.read_csv("hf://datasets/yhqu/descriptor_prediction/descriptor_prediction.csv")
Citation
Please cite this work if you use… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/descriptor_prediction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterCrispr Delivery Dataset
This dataset is part of the Deep Principle Bench collection.
Files
crispr_delivery.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/crispr_delivery")
df = pd.read_csv("hf://datasets/yhqu/crispr_delivery/crispr_delivery.csv")
Citation
Please cite this work if you use this dataset in your research.
Facebook
TwitterArtifact DescriptionVulnerable Verified Smart Contracts is a dataset of real vulnerable Ethereum smart contracts. Based on the manually labeled Benchmark dataset of Solidity smart contracts. A total of 609 vulnerable contracts are provided, containing 1,117 vulnerabilities.The dataset is split into: "train", "validation" and "test". Each file is in the Apache Parquet data file format.Environment SetupThe Pandas library for Python is required to load the dataset. Both Unix-based and Windows systems are supported.Getting StartedThe following code snippet demonstrates how to load the dataset into a Pandas DataFrame.>>> import pandas as pd>>> df = pd.read_parquet("path/to/data")LicenseAll Smart Contracts in the dataset are subject to their own original licenses.
Facebook
TwitterGene Editing Dataset
This dataset is part of the Deep Principle Bench collection.
Files
gene_editing.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/gene_editing")
df = pd.read_csv("hf://datasets/yhqu/gene_editing/gene_editing.csv")
Citation
Please cite this work if you use this dataset in your research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artifact DescriptionVerified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both Solidity and Vyper source code. The dataset is based on every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 1,541,370 smart contract functions are provided, parsed from 186,397 unique smart contracts, filtered down from 2,217,692 smart contracts.The dataset contains three folders: "train", "validation" and "test". Each folder contains several enumerated files in the Apache Parquet data file format.Environment SetupThe Pandas library for Python is required to load the dataset. Both Unix-based and Windows systems are supported.Getting StartedThe following code snippet demonstrates how to load the dataset into a Pandas DataFrame.>>> import pandas as pd>>> df = pd.read_parquet("path/to/data")LicenseAll Smart Contracts in the dataset are publicly available, obtained by using Etherscan APIs, and subject to their own original licenses.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by pavithra nageswarn
Released under CC0: Public Domain
Facebook
TwitterProtein Localization Dataset
This dataset is part of the Deep Principle Bench collection.
Files
protein_localization.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/protein_localization")
df = pd.read_csv("hf://datasets/yhqu/protein_localization/protein_localization.csv")
Citation
Please cite this work if you use this… See the full description on the dataset page: https://huggingface.co/datasets/yhqu/protein_localization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄
This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.
Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
MotivationThe Dataset for Unmanned Aircraft System (UAS) Cellular Communications, short DUCC, was created with the aim of advancing communications for Beyond Visual Line of Sight (BVLOS) operations. With this objective in mind, datasets were generated to analyse the behaviour of cellular communications for UAS operations.
MeasurementA measurement setup was implemented to execute the measurements. Two Sierra Wireless EM9191 modems possessing both LTE and 5G capabilities were utilized in order to establish a connection to the cellular network and measure the physical parameters of the air-link. Every modem was equipped with four Taoglas antennas, two of type TG 35.8113 and two of type TG 45.8113. To capture the measurements a Raspberry Pi 4B is used. All hardware components were integrated into a box and attached to a DJI Matrice 300 RTK. A connection to the drone controller has been established to obtain location, speed and attitude. To measure end-to-end network parameters, dummy data was exchanged bidirectionally between the Raspberry Pi and a server. Both the server as well as the Raspberry Pi are synchronized with the GPS time in order to measure the one-way packet delay. For this purpose, we utilised Iperf3 and customised it to suit our requirements. To ensure precise positioning of the drone a Real Time Kinematik (RTK) station was placed on the ground during the measurements.
The measurements were performed at three distinct rural locations. Waypoint flights were undertaken with the points arranged in a cuboid formation maximizing the coverage of the air volume. Thereby, the campaigns were conducted with varying drone speeds. Moreover, for location A, different flight routes with rotated grids were implemented to reduce bias. Finally, a validation dataset is provided for location A, where the waypoints were calculated according to Quality of Service (QoS) based path-planning.
Dataset Structure and UsageThe dataset's structure consists of:-- Dataset |-- LocationX |-- RouteX (in case different routes at LocationX were created) |-- LocXRouteX.kml (file containing the waypoints in the kml format) |-- SpeedXMeterPerSecond (folder containing the datasets recorded with a specific drone speed) |-- YYYY-MM-DD hh_mm_ss.s.pkl.gz (Dataset file) |-- RouteY |-- ... |-- ...
The dataset files can be loaded using the pandas module in python3. The file "load.py" provides a sample script for loading a dataset as well as the corresponding .kml file which contains the predefined waypoints. In the file "Parameter_Description.csv" each parameter measured is further explained.
LicenseAll datasets are copyright by us and published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license. This dataset is made available for academic use only. However, we take your privacy seriously! If you find yourself or personal belongings in this dataset and feel unwell about it, please contact us at automotive@oth-aw.de and we will immediately remove the respective data from our server.
AchnowledgementThe authors gratefully acknowledge the following European Union H2020 -- ECSEL Joint Undertaking project for financial support including funding by the German Federal Ministry for Education and Research (BMBF): ADACORSA (Grant Agreement No. 876019, funding code 16MEE0039).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Giant pandas are the flagship species in world conservation. Due to bamboo being the primary food source for giant pandas, dental wear is common owing to the extreme toughness of the bamboo fiber. Even though research on tooth enamel wear in humans and domestic animals is well-established, research on tooth enamel wear in giant pandas is scarce. The purpose of this study is to evaluate tooth enamel wear resistance in giant pandas to provide a basis for a better understanding of their evolutionary process. From microscopic and macroscopic perspectives, the abrasion resistance of dental enamel in giant pandas is compared with that of herbivorous cattle and carnivorous dogs in this study. This involves the use of micro-scratch and frictional wear tests. The results show that the boundary between the enamel prism and the enamel prism stroma is well-defined in panda and canine teeth, while bovine tooth enamel appears denser. Under constant load, the tribological properties of giant panda enamel are similar to those of canines and significantly different from those of bovines. Test results show that the depth of micro scratches in giant panda and canine enamel was greater than in cattle, with greater elastic recovery occurring in dogs. Scratch morphology indicates that the enamel substantive damage critical value is greater in pandas than in both dogs and cattle. The analysis suggests that giant panda enamel consists of a neatly arranged special structure that may disperse extrusion stress and absorb impact energy through a series of inelastic deformation mechanisms to cope with the wear caused by eating bamboo. In this study, the excellent wear resistance of giant panda's tooth enamel is verified by wear tests. A possible theoretical explanation of how the special structure of giant panda tooth enamel may improve its wear resistance is provided. This provides a direction for subsequent theoretical and experimental studies on giant panda tooth enamel and its biomaterials.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available Preclude dataset. Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.
The data is provided in two parts. Both are formatted using the popular, free python pickle library and require use of the popular, free pandas library.
healthe.pkl is a pandas.DataFrame object containing the 3,400 health-advice statement with hand-labeled health entities.
non_advice.pkl is a pandas.DataFrame object containing the 2,256 pieces of non-advice statements.
To load the files in python, use the following code block.
import pickle
import pandas as pd
healthe_df = pd.read_pickle('healthe.pkl')
non_advice_df = pd.read_pickle('non_advice_df.pkl')
healthe_df has four columns.
* text contains the health advice statement text
* entities contains a python list of (entity, class) tuples
* tokenized_text contains a list of tokens obtained by tokenizing the health advice statement text
* labels contains a list of the same length as tokenized_text, where each token is mapped to a class label.
non_advice_df has one column, text, referring to each non-health-advice-statement.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides mitosis detection results employing the "Mitosis Detection, Fast and Slow" (MDFS) algorithm [[2208.12587] Mitosis Detection, Fast and Slow: Robust and Efficient Detection of Mitotic Figures (arxiv.org)] on the TCGA-BRCA dataset.
The MDFS algorithm exemplifies a robust and efficient two-stage process for mitosis detection. Initially, potential mitotic figures are identified and later refined. The proposed model for the preliminary identification of candidates, the EUNet, stands out for its swift and accurate performance, largely due to its structural design. EUNet operates by outlining candidate areas at a lower resolution, significantly expediting the detection process. In the second phase, the initially identified candidates undergo further refinement using a more intricate classifier network, namely the EfficientNet-B7. The MDFS algorithm was originally developed for the MIDOG challenges.
Viewing in QuPath
The dataset at hand comprises GeoJSON files in two categories: mitosis and proxy (mimicker -- the candidates that are unlikely to be mitosis based on our algorithm). Users can open and visualize each category overlaid on the Whole Slide Image (WSI) using QuPath. Simply drag and drop the annotation file onto the opened image in the program. Additionally, users can employ the provided Python snippet to read the annotation into a Python dictionary or a Numpy array.
Loading in Python
To load the GeoJSON files in Python, users can use the following code:
import json
import numpy as np
import pandas as pd
def load_geojson(filename):
# Load the GeoJSON file
with open(filename, 'r') as f:
data = json.load(f)
# Extract the properties and store in a dictionary
slide_properties = data["properties"]
# Convert the points to a numpy array
points_np = np.array([(feat['geometry']['coordinates'][0], feat['geometry']['coordinates'][1], feat['properties']['score']) for feat in data['features']])
# Convert the points to a pandas DataFrame
points_df = pd.DataFrame(points_np, columns=['x', 'y', 'score'])
return slide_properties, points_np, points_df
mitosis_properties, mitosis_points_np, mitosis_points_df = load_geojson('mitosis.geojson')
mimickers_properties, mimickers_points_np, mimickers_points_df = load_geojson('mimickers.geojson')
Properties
Each WSI in the dataset includes the candidate's centroid, bounding box, hotspot location, hotspot mitotic count, and hotspot mitotic score. The structures of the mitosis and mimicker property dictionaries are as follows:
Mitosis property dictionary structure:
mitosis_properties = {
'slide_id': slide_id,
'slide_height': img_h,
'slide_width': img_w,
'wsi_mitosis_count': num_mitosis,
'mitosis_threshold': 0.5,
'hotspot_rect': {'x1': hotspot[0], 'y1': hotspot[1], 'x2': hotspot[2], 'y2': hotspot[3]},
'hotspot_mitosis_count': mitosis_count,
'hotspot_mitosis_score': mitosis_score,
}
Proxy figure (mimicker) property dictionary structure:
mimicker_properties = {
'slide_id': slide_id,
'slide_height': img_h,
'slide_width': img_w,
'wsi_mimicker_count': num_mimicker,
'mitosis_threshold': 0.5,
}
Disclaimer:
It should be noted that we did not conduct a comprehensive review of all mitotic figures within each WSI, and we do not purport these to be free of errors. Nonetheless, a pathologist examined the resultant hotspot regions of interest from 757 WSIs within the TCGA-BRCA Mitosis Dataset where we found strong correlations between pathologist and MDFS mitotic counts (r=0.8, p$<$0.001). Furthermore, MDFS-derived mitosis scores are shown to be as prognostic as pathologist-assigned mitosis scores [1]. This examination was also aimed at verifying the quality of the selections, ensuring excessive false detections or artifacts did not primarily drive them and were in a plausible location in the tumor landscape.
[1] Ibrahim, Asmaa, et al. "Artificial Intelligence-Based Mitosis Scoring in Breast Cancer: Clinical Application." Modern Pathology 37.3 (2024): 100416.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The glassDef dataset contains a set of text-based LAMMPS dump files corresponding to shear deformation tests on different bulk metallic glasses. This includes FeNi, CoNiFe, CoNiCrFe, CoCrFeMn, CoNiCrFeMn, and Co5Cr2Fe40Mn27Ni26 amorphous alloys with data files that exist in relevant subdirectories. Each dump file corresponds to multiple realizations and includes the dimensions of the simulation box as well as atom coordinates, the atom ID, and associated type of nearly 50,000 atoms.
Load glassDef Dataset in Python
The glassDef dataset may be loaded in Python into Pandas DataFrame. To go into the relevant subdirectory, run cd glass{glass_name}/Run[0-3]/ where “glass_name” denotes the chemical composition. Each subdirectory contains at least three glass realizations within subfolders that are labeled as “Run[0-3]”.
cd glassFeNi/Run0; python
import pandas
df = pandas.read_csv("FeNi_glass.dump",skiprows=9)
One may display an assigned DataFrame in the form of a table:
df.head()
To learn more about further analyses performed on the loaded data, please refer to the paper cited below.
glassDef Dataset Structure
glassDef Data Fields
Dump files: “id”, “type”, “x”, “y”, “z”.
glassDef Dataset Description
Paper: Karimi, Kamran, Amin Esfandiarpour, René Alvarez-Donado, Mikko J. Alava, and Stefanos Papanikolaou. "Shear banding instability in multicomponent metallic glasses: Interplay of composition and short-range order." Physical Review B 105, no. 9 (2022): 094117.
Contact: kamran.karimi@ncbj.gov.pl
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?