Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Giant Panda Service B V Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Purple Panda Fashions Limited Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Llc Panda Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Llc Panda Vetservis Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Find out import shipments and details about To Order Panda Electronics Imp Import Data report along with address, suppliers, products and import shipments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string ‘egennamn’, meaning “personal name” in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Import pandas as pd
import pandas as pd import json data = [ ("Today around 11:30am I realized that I've only ever worked in JavaScript and am wondering if I'm cooked...", 7), ("Today I learned how to check in. I'll be working on fixing my CoLab and starting to work on any datasets missed", 6), ("Today I finally figured out how to set up GitHub repositories and push my first project...", 8), ("Today I decided to dive into Python’s object-oriented programming...", 8)… See the full description on the dataset page: https://huggingface.co/datasets/MaryahGreene/regression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Find out import shipments and details about Panda Logistics Usa New York Import Data report along with address, suppliers, products and import shipments.
Find details of Panda Trade And Manufacturing Inc Buyer/importer data in US (United States) with product description, price, shipment date, quantity, imported products list, major us ports name, overseas suppliers/exporters name etc. at sear.co.in.
import pandas as pd import numpy as np
PERFORMING EDA
data.head() data.info()
attributes_data = data.iloc[:, 1:] attributes_data
attributes_data.describe() attributes_data.corr()
import seaborn as sns import matplotlib.pyplot as plt
correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()
CHECKING IF DATASET IS LINEAR OR NON-LINEAR
correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')
plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()
CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM
print(data.isnull().sum())
print(data.isna().sum())
LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold
X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42)
lassoReg = Lasso()
lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold)
lasso_cv.fit(X, y)
print("Best Params {}".format(lasso_cv.best_params_))
column_names = list(data) column_names = column_names[1:] column_names
lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()
RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)
num_features_selected = len(rfecv.rankin_)
cv_scores = rfecv.ranking_
plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()
print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])
PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]
df1=pd.DataFrame(data = data,columns=data.columns) print(df1)
scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)
principal.components_
plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')
print(principal.explained_variance_ratio_)
T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns
tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)
df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
What follows is research code. It is by no means optimized for speed, efficiency, or readability.
Data loading, tokenizing and sharding
import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc
fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ONE DATA data science workflow dataset ODDS-full comprises 815 unique workflows in temporally ordered versions.
A version of a workflow describes its evolution over time, so whenever a workflow is altered meaningfully, a new version of this respective workflow is persisted.
Overall, 16035 versions are available.
The ODDS-full workflows represent machine learning workflows expressed as node-heterogeneous DAGs with 156 different node types.
These node types represent various kinds of processing steps of a general machine learning workflow and are grouped into 5 categories, which are listed below.
Any metadata beyond the structure and node types of a workflow has been removed for anonymization purposes
ODDS, a filtered variant, which enforces weak connectedness and only contains workflows with at least 5 different versions and 5 nodes, is available as the default version for supervised and unsupvervised learning.
Workflows are served as JSON node-link graphs via networkx.
They can be loaded into python as follows:
import pandas as pd
import networkx as nx
import json
with open('ODDS.json', 'r') as f:
graphs = pd.Series(list(map(nx.node_link_graph, json.load(f)['graphs'])))
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Find out import shipments and details about Panda Power Funds Sunbury Hummel S Import Data report along with address, suppliers, products and import shipments.
This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.
This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.
Add the dataset to your notebook and then
python
import numpy as np
ntimeids=3830
nstocks=112
ncolumns = 27
nseq = 600
arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))
There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:
import numpy as np
import pandas as pd
import numpy as np
targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
ntimeids = targets.time_id.nunique()
stock_ids = list(sorted(targets.stock_id.unique()))
timeids = sorted(targets.time_id.unique())
timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
So to get the data for stock_id 13 on time_id 146 you'd do
stock_ix = stock_id_to_ix[13]
time_ix = timeid_to_ix[146]
arr[time_ix,stock_ix]
Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty.
To get truncate a single stocks data do
max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max()
arr[time_ix,stock_ix,:max_seq_ix,]
There are 27 columns in the last dimension these are:
['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']
This dataset contains information about the state of crime in Russia since 2003. The table shows the number of crimes of various types for each month.
Your task (or desire) is to analyze the data and forecast the crimes number for next month or year. Good luck and peace🙌
Этот датасет содержит информацию о состоянии преступности в России за каждый месяц, начиная с 2003 года.
Ваша задача - проанализировать данные и попробовать предсказать число преступлений на следующий месяц или год.
Import:
python
import pandas as pd
data = pd.read_csv('crime.csv', parse_dates=['month'], index_col=['month'], dayfirst=True)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large go-around, also referred to as missed approach, data set. The data set is in support of the paper presented at the OpenSky Symposium on November the 10th.
If you use this data for a scientific publication, please consider citing our paper.
The data set contains landings from 176 (mostly) large airports from 44 different countries. The landings are labelled as performing a go-around (GA) or not. In total, the data set contains almost 9 million landings with more than 33000 GAs. The data was collected from OpenSky Network's historical data base for the year 2019. The published data set contains multiple files:
go_arounds_minimal.csv.gz
Compressed CSV containing the minimal data set. It contains a row for each landing and a minimal amount of information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
The last two columns, n_approaches and n_rwy_approached, are useful to filter out training and calibration flight. These have usually a large number of n_approaches, so an easy way to exclude them is to filter by n_approaches > 2.
go_arounds_augmented.csv.gz
Compressed CSV containing the augmented data set. It contains a row for each landing and additional information about the landing, and if it was a GA. The data is structured in the following way:
Column name
Type
Description
time
date time
UTC time of landing or first GA attempt
icao24
string
Unique 24-bit (hexadecimal number) ICAO identifier of the aircraft concerned
callsign
string
Aircraft identifier in air-ground communications
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
has_ga
string
"True" if at least one GA was performed, otherwise "False"
n_approaches
integer
Number of approaches identified for this flight
n_rwy_approached
integer
Number of unique runways approached by this flight
registration
string
Aircraft registration
typecode
string
Aircraft ICAO typecode
icaoaircrafttype
string
ICAO aircraft type
wtc
string
ICAO wake turbulence category
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometre
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
operator_country
string
ISO Alpha-3 country code of the operator
operator_region
string
Geographical region of the operator of the aircraft (either Europe, North America, South America, Asia, Africa, or Oceania)
wind_speed_knts
integer
METAR, surface wind speed in knots
wind_dir_deg
integer
METAR, surface wind direction in degrees
wind_gust_knts
integer
METAR, surface wind gust speed in knots
visibility_m
float
METAR, visibility in m
temperature_deg
integer
METAR, temperature in degrees Celsius
press_sea_level_p
float
METAR, sea level pressure in hPa
press_p
float
METAR, QNH in hPA
weather_intensity
list
METAR, list of present weather codes: qualifier - intensity
weather_precipitation
list
METAR, list of present weather codes: weather phenomena - precipitation
weather_desc
list
METAR, list of present weather codes: qualifier - descriptor
weather_obscuration
list
METAR, list of present weather codes: weather phenomena - obscuration
weather_other
list
METAR, list of present weather codes: weather phenomena - other
This data set is augmented with data from various public data sources. Aircraft related data is mostly from the OpenSky Network's aircraft data base, the METAR information is from the Iowa State University, and the rest is mostly scraped from different web sites. If you need help with the METAR information, you can consult the WMO's Aerodrom Reports and Forecasts handbook.
go_arounds_agg.csv.gz
Compressed CSV containing the aggregated data set. It contains a row for each airport-runway, i.e. every runway at every airport for which data is available. The data is structured in the following way:
Column name
Type
Description
airport
string
ICAO airport code where the aircraft is landing
runway
string
Runway designator on which the aircraft landed
n_landings
integer
Total number of landings observed on this runway in 2019
ga_rate
float
Go-around rate, per 1000 landings
glide_slope_angle
float
Angle of the ILS glide slope in degrees
has_intersection
string
Boolean that is true if the runway has an other runway intersecting it, otherwise false
rwy_length
float
Length of the runway in kilometres
airport_country
string
ISO Alpha-3 country code of the airport
airport_region
string
Geographical region of the airport (either Europe, North America, South America, Asia, Africa, or Oceania)
This aggregated data set is used in the paper for the generalized linear regression model.
Downloading the trajectories
Users of this data set with access to OpenSky Network's Impala shell can download the historical trajectories from the historical data base with a few lines of Python code. For example, you want to get all the go-arounds of the 4th of January 2019 at London City Airport (EGLC). You can use the Traffic library for easy access to the database:
import datetime from tqdm.auto import tqdm import pandas as pd from traffic.data import opensky from traffic.core import Traffic
df = pd.read_csv("go_arounds_minimal.csv.gz", low_memory=False) df["time"] = pd.to_datetime(df["time"])
airport = "EGLC" start = datetime.datetime(year=2019, month=1, day=4).replace( tzinfo=datetime.timezone.utc ) stop = datetime.datetime(year=2019, month=1, day=5).replace( tzinfo=datetime.timezone.utc )
df_selection = df.query("airport==@airport & has_ga & (@start <= time <= @stop)")
flights = [] delta_time = pd.Timedelta(minutes=10) for _, row in tqdm(df_selection.iterrows(), total=df_selection.shape[0]): # take at most 10 minutes before and 10 minutes after the landing or go-around start_time = row["time"] - delta_time stop_time = row["time"] + delta_time
# fetch the data from OpenSky Network
flights.append(
opensky.history(
start=start_time.strftime("%Y-%m-%d %H:%M:%S"),
stop=stop_time.strftime("%Y-%m-%d %H:%M:%S"),
callsign=row["callsign"],
return_flight=True,
)
)
Traffic.from_flights(flights)
Additional files
Additional files are available to check the quality of the classification into GA/not GA and the selection of the landing runway. These are:
validation_table.xlsx: This Excel sheet was manually completed during the review of the samples for each runway in the data set. It provides an estimate of the false positive and false negative rate of the go-around classification. It also provides an estimate of the runway misclassification rate when the airport has two or more parallel runways. The columns with the headers highlighted in red were filled in manually, the rest is generated automatically.
validation_sample.zip: For each runway, 8 batches of 500 randomly selected trajectories (or as many as available, if fewer than 4000) classified as not having a GA and up to 8 batches of 10 random landings, classified as GA, are plotted. This allows the interested user to visually inspect a random sample of the landings and go-arounds easily.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.