Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Libraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Australia
-> Aus200
Brazil
-> Bra50 and MinDol
Spain
-> Esp35
France
-> Fra40
Germany
-> Ger40
Hong Kong
-> HkInd
Italy
-> Ita40
Netherlands
-> Neth25
Switzerland
-> Swi20
United Kingdom
-> UK100
United States
-> Usa500, UsaTec and UsaRus
Note: the MinDol, Swi20 and Neth25 data were taken by it's monthly contract, because MetaTrader5 don't have their historical series (like S&P 500, that has the 'Usa500' and 'Usa500Mar24'):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17272056%2Fefa5c9f6d7841c496d20d467d4a1c874%2Ffutures_dailycontract.png?generation=1704756245532483&alt=media" alt="">
import MetaTrader5 as mt5
import pandas as pd
import numpy as np
import pytz
from datetime import datetime
if not mt5.initialize(login= , server= "server", password=""):
# you can use your login and password if you have an account on a broker to use mt5
print("initialize() failed, error code =", mt5.last_error())
quit()
symbols = mt5.symbols_get()
list_symbols = []
for num in range(0, len(symbols)):
list_symbols.append(symbols[num].name)
print(list_symbols)
list_futures = ['Aus200', 'Bra50', 'Esp35', 'Fra40', 'Ger40', 'HKInd', 'Ita40Mar24', 'Jp225', 'MinDolFeb24', 'Neth25Jan24', 'UK100', 'Usa500', 'UsaRus', 'UsaTec', 'Swi20Mar24']
time_frame = mt5.TIMEFRAME_D1
dynamic_vars = {}
time_zone = pytz.timezone('Etc/UTC')
time_start = datetime(2017, 1, 1, tzinfo= time_zone)
time_end = datetime(2023, 12, 31, tzinfo= time_zone)
for sym in list_futures:
var = f'{sym}'
rates = mt5.copy_rates_range(sym, time_frame, time_start, time_end)
rates_frame = pd.DataFrame(rates)
rates_frame['time'] = pd.to_datetime(rates_frame['time'], unit='s')
rates_frame = rates_frame[['time', 'close']]
rates_frame.rename(columns = {'close': var}, inplace = True)
dynamic_vars[var] = rates_frame
dynamic_vars[sym].to_csv(f'{sym}.csv', index = False)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))
Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet
.
I recommend using dask (pip install dask
) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Benchmark Dataset for Deep Learning for 3D Topology Optimization
This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.
The following paper provides full documentation and examples:
Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.
The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ
file container consists of multiple enumerated pairs of CSV
files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i
-th sample is stored in the files i.csv
and i_info.csv
, where i.csv
contains all voxel-wise information and i_info.csv
contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.
For the i
-th sample, the columns of i_info.csv
correspond to the following scalar information:
E
- Young's modulus [Pa]ν
- Poisson's ratio [-]σ_ys
- a yield stress [Pa]h
- discretization size of the voxel grid [m]The columns of i.csv
correspond to the following voxel-wise information:
x
, y
, z
- the indices that state the location of the voxel within the voxel meshΩ_design
- design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0
and 1
indicate that the density is fixed at 0 or 1, respectively. -1
indicates the absence of constraints, i.e., the density in that voxel can be freely optimizedΩ_dirichlet_x
, Ω_dirichlet_y
, Ω_dirichlet_z
- homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimensionF_x
, F_y
, F_z
- floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]density
- defines the binary voxel-wise density of the ground truth solution to the topology optimization problem
How to Import the Dataset
with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset
object. As shown in the tutorial this can be done via:
from dl4to.datasets import SELTODataset
dataset = SELTODataset(root=root, name=name, train=train)
Here, root
is the path where the dataset should be saved. name
is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train
is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset
class.
without DL4TO: After downloading and unzipping, any of the i.csv
files can be manually imported into Python as a Pandas dataframe object:
import pandas as pd
root = ...
file_path = f'{root}/{i}.csv'
columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density']
df = pd.read_csv(file_path, names=columns)
Similarly, we can import a i_info.csv
file via:
file_path = f'{root}/{i}_info.csv'
info_column_names = ['E', 'ν', 'σ_ys', 'h']
df_info = pd.read_csv(file_path, names=info_columns)
We can extract PyTorch tensors from the Pandas dataframe df
using the following function:
import torch
def get_torch_tensors_from_dataframe(df, dtype=torch.float32):
shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1
voxels = [df['x'].values, df['y'].values, df['z'].values]
Ω_design = torch.zeros(1, *shape, dtype=int)
Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
F = torch.zeros(3, *shape, dtype=dtype)
F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
density = torch.zeros(1, *shape, dtype=dtype)
density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
return Ω_design, Ω_Dirichlet, F, density
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Find out import shipments and details about Panda Cycles Llc Import Data report along with address, suppliers, products and import shipments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the relevant data for the algorithms described in the paper "Irradiance and cloud optical properties from solar photovoltaic systems", which were developed within the framework of the MetPVNet project.
Input data:
COSMO weather model data (DWD) as NetCDF files (cosmo_d2_2018(9).tar.gz)
COSMO atmospheres for libRadtran (cosmo_atmosphere_libradtran_input.tar.gz)
COSMO surface data for calibration (cosmo_pvcal_output.tar.gz)
Aeronet data as text files (MetPVNet_Aeronet_Input_Data.zip)
Measured data from the MetPVNet measurement campaigns as text files (MetPVNet_Messkampagne_2018(9).tar.gz)
PV power data
Horizontal and tilted irradiance from pyranometers
Longwave irradiance from pyrgeometer
MYSTIC-based lookup table for translated tilted to horizontal irradiance (gti2ghi_lut_v1.nc)
Output data:
Global tilted irradiance (GTI) inferred from PV power plants (with calibration parameters in comments)
Linear temperature model: MetPVNet_gti_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_gti_cf_inversion_results_faiman.tar.gz
Global horizontal irradiance (GHI) inferred from PV power plants
Linear temperature model: MetPVNet_ghi_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_results_faiman.tar.gz
Combined GHI averaged to 60 minutes and compared with COSMO data
Linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_ghi_inversion_combo_60min_results_faiman.tar.gz
Cloud optical depth inferred from PV power plants
Linear temperature model: MetPVNet_cod_cf_inversion_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_cf_inversion_results_faiman.tar.gz
Combined COD averaged to 60 minutes and compared with COSMO and APOLLO_NG data
Linear temperature model: MetPVNet_cod_inversion_combo_60min_results_linear.tar.gz
Faiman non-linear temperature model: MetPVNet_cod_inversion_combo_60min_results_faiman.tar.gz
Validation data:
COSMO cloud optical depth (cosmo_cod_output.tar.gz)
APOLLO_NG cloud optical depth (MetPVNet_apng_extract_all_stations_2018(9).tar.gz)
COSMO irradiance data for validation (cosmo_irradiance_output.tar.gz)
CAMS irradiance data for validation (CAMS_irradiation_detailed_MetPVNet_MK_2018(9).zip)
How to import results:
The results files are stored as text files ".dat", using Python multi-index columns. In order to import the data into a Pandas dataframe, use the following lines of code (replace [filename] with the relevant file name):
import pandas as pddata = pd.read_csv("[filename].dat",comment='#',header=[0,1],delimiter=';',index_col=0,parse_dates=True)
This gives a multi-index Dataframe with the index column the timestamp, the first column label corresponds to the measured variable and the second column to the relevant sensor
Note:
The output data has been updated to match the latest version of the paper, whereas the input and validation data remains the same as in Version 1.0.0
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset for whiskey_classificator
How is this dataset generated?
This dataset comes from a function which creates a sintetic dataset emulating data that could be used in a real whiskey classification. import pandas as pd import numpy as np import random
def generate_whiskey(num_rows=500): """ Generate a balanced and shuffled DataFrame with whiskey data across all price categories.
Parameters:
- num_rows: int — Number of rows to generate (default… See the full description on the dataset page: https://huggingface.co/datasets/Rogudev/whiskey_dataset.
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Speed profiles of freeways in California (I5-S and I210-E). Original data is retrieved from PeMS.
Each YEAR_FREEWAY.csv file contains Timestamp and Speed data.
freeway_meta.csv file contains meta information for each detector: freeway number, direction, detector ID, absolute milepost, and x y coordinates.
# Freeway speed data description
### Data loading example (single freeway: I5-S 2012)
```python
%%time
import pandas as pd
# Date time parser
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%Y %H:%M:%S")
# Freeway data loading (This part should be changed to a proper URL in zenodo.org)
data = pd.read_csv("dataset/2012_I5S.csv",
parse_dates=["Timestamp"],
date_parser=mydateparser).pivot(index="Timestamp",columns='Station_ID', values='Speed')
# Meta data loading
meta = pd.read_csv("dataset/freeway_meta.csv").set_index(['Fwy','Dir'])
```
CPU times: user 50.5 s, sys: 911 ms, total: 51.4 s
Wall time: 50.9 s
### Speed data and meta data
```python
data.head()
```
Station_ID 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89 Timestamp 2012-01-01 06:00:00 70.0 69.8 70.1 69.6 69.9 70.8 70.1 69.3 69.2 68.2 ... 72.1 67.6 71.0 66.8 65.9 58.2 67.1 63.8 67.1 71.6 2012-01-01 06:05:00 69.2 69.8 69.8 69.4 69.5 69.5 68.3 67.5 67.4 67.2 ... 71.5 66.1 69.5 67.4 68.3 59.0 66.9 60.8 66.6 65.7 2012-01-01 06:10:00 69.2 69.0 68.6 68.7 68.6 68.9 61.7 68.3 67.4 67.7 ... 71.1 65.2 71.2 66.5 65.4 59.6 66.3 58.4 68.2 65.6 2012-01-01 06:15:00 69.9 69.6 69.7 69.2 69.0 69.1 65.3 67.6 67.1 66.8 ... 69.9 67.1 69.3 66.9 68.2 60.6 66.0 55.5 67.1 69.7 2012-01-01 06:20:00 68.7 68.4 68.2 67.9 68.3 69.3 67.0 68.4 68.2 68.2 ... 70.9 67.2 69.9 65.6 66.7 62.8 66.2 62.6 67.2 67.5
5 rows × 89 columns
```python
meta.head()
```
ID Abs_mp Latitude Longitude Fwy Dir 5 S 1 0.058 32.542731 -117.030501 S 2 0.146 32.543587 -117.031769 S 3 1.291 32.552409 -117.048120 S 4 2.222 32.558422 -117.062360 S 5 2.559 32.561106 -117.067228
### Choose a day
```python
# Sampling (2012-01-13)
myday = "2012-01-13"
# Filter the data by the day
myday_speed_data = data.loc[myday]
```
### A speed profile
```python
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
# Axis value setting
mp = meta[meta.ID.isin(data.columns)].Abs_mp
hour = myday_speed_data.index
# Draw the day
fig, ax = plt.subplots()
heatmap = ax.pcolormesh(hour,mp,myday_speed_data.T, cmap=plt.cm.RdYlGn, vmin=0, vmax=80, alpha=1)
plt.colorbar(heatmap, ax=ax)
# Appearance setting
ax.xaxis.set_major_formatter(mdates.DateFormatter("%H"))
plt.title(pd.Timestamp(myday).strftime("%Y-%m-%d [%a]"))
plt.xlabel("hour")
plt.ylabel("milepost")
plt.show()
```

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets comprising:(a) FigData (Figure data):.npz files(https://numpy.org/doc/stable/reference/generated/numpy.savez.html)containing pandas data structures(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html,https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html,https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)(effectively, N-dimensional arrays with labelled axes)containing the data plotted in the paper,named according to the figure to which the data corresponds(e.g. S1a_inset contains the data for the inset to subfigure a of figure 1 ofthe Supplementary Material).As well as having the data in .npz files,versions of the data are stored in .pkl pandas data frames(read with "import pandas as pd; pd.read_pickle('path-to-bla.npz')")as well as .csv (comma-separated values) files (a text format).Note: Bootstrapped data will not be identical to that from the paper,due to the use of an unset seed when generating the figures for the paper.(b) FigGenData (Figure-generation data)Data sets used in the actual generation of the figures,as seen in the figure-generation code (FigGen.py file)at https://physics.paperswithcode.com/paper/elastoplasticity-mediates-dynamicalTo read the metadata of file bla.npz,which includes the name of the code used to generate it, use:import numpy as npnp.load('path-to-bla.npz',allow_pickle=True)['metadata'].flat[0]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Customs records of are available for TIANJIN PANDA IMPORT AND EXPORT TRADE CO.,LTD AS AGENT SIA D&D LOGISTIC RIGA LATVIJA. Learn about its Importer, supply capabilities and the countries to which it supplies goods
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Find out import shipments and details about Panda Logistics Usa New York Import Data report along with address, suppliers, products and import shipments.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.