100+ datasets found

s
Python Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
i
Code to import PSCAD data into Python (Spyder)
ieee-dataport.org
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder
Explore at:
Dataset updated
Nov 20, 2025
Authors
Franz Guzman Llanos
Description
minimizes errors
s
Python Import Data in February - Seair.co.in
seair.co.in
Updated Feb 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Python Import Data in February - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 18, 2016
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
Argentina, Nauru, Malaysia, Gibraltar, Slovakia, Tokelau, Timor-Leste, French Guiana, Korea (Democratic People's Republic of), Austria
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
s
Python Import Data in August - Seair.co.in
seair.co.in
Updated Aug 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Python Import Data in August - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Aug 20, 2016
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
Belgium, South Africa, Christmas Island, Nepal, Lebanon, Virgin Islands (U.S.), Saint Pierre and Miquelon, Falkland Islands (Malvinas), Ecuador, Gambia
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Z
Storage and Transit Time Data and Code
data.niaid.nih.gov
zenodo.org
Updated Jun 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136816
Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Montana State University
Authors
Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. FeltonDate: 5/5/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably in this project.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.

Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a particular function:

01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.

02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.

03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.

04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.

05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.

06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
UCI Automobile Dataset
kaggle.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description
In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

Pre-processing data in python

Dealing with missing values

Data formatting

Data normalization

Binning

Exploratory Data Analysis

Descriptive statistics

Groupby

Analysis of variance

Correlation

Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
z
Open Context Database SQL Dump
zenodo.org
data-staging.niaid.nih.gov
+2more
zip
Updated Jan 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Kansa; Eric Kansa; Sarah Whitcher Kansa; Sarah Whitcher Kansa (2025). Open Context Database SQL Dump [Dataset]. http://doi.org/10.5281/zenodo.14728229
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14728229
Dataset updated
Jan 23, 2025
Dataset provided by
Open Context
Authors
Eric Kansa; Eric Kansa; Sarah Whitcher Kansa; Sarah Whitcher Kansa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py

The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).

This database dump includes all published structured data organized used by Open Context (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.

These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.

IMPORTANT

This database dump contains data from roughly 190+ different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
s
Acerola Extract Import Data | Python Logistics Llc Prod
seair.co.in
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2024). Acerola Extract Import Data | Python Logistics Llc Prod [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Mar 7, 2024
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St Petersburg
European University at St. Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Pre-Processed Power Grid Frequency Time Series
zenodo.org
bin, zip
Updated Jul 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.3744121
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3744121
Dataset updated
Jul 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut
Description
Overview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The second column contains the frequency values in Hz.

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases
We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much. If your application cannot deal with NaNs, you could build upon the following commands to select the longest interval of valid data from the cleansed data:

from helper_functions import * import pandas as pd cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip', index_col=0, header=None, squeeze=True, parse_dates=[0]) valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull()) start,end= valid_bounds[ np.argmax(valid_sizes) ] data_without_nan = cleansed_data.iloc[start:end]

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "Data_converted".

License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Digitisation of Weather Records of Seungjeongwon Ilgi: A Historical Weather...
zenodo.org
bin, csv, json, txt
Updated Sep 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyu Lyu; Zeyu Lyu; Kohei Ichikawa; Kohei Ichikawa; Yongchao Cheng; Yongchao Cheng; Hisashi Hayakawa; Hisashi Hayakawa; Yukiko Kawamoto; Yukiko Kawamoto (2023). Digitisation of Weather Records of Seungjeongwon Ilgi: A Historical Weather Dynamics Dataset of the Korean Peninsula (1623-1910) [Dataset]. http://doi.org/10.5281/zenodo.7453644
Explore at:
csv, json, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7453644
Dataset updated
Sep 27, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zeyu Lyu; Zeyu Lyu; Kohei Ichikawa; Kohei Ichikawa; Yongchao Cheng; Yongchao Cheng; Hisashi Hayakawa; Hisashi Hayakawa; Yukiko Kawamoto; Yukiko Kawamoto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Korea
Description
Introduction

This study has exploited the daily weather records of Seungjeongwon Ilgi from the NIKH database. Seungjeongwon Ilgi (http://sjw.history.go.kr/main.do) is a daily record of the Seungjeongwon, the Royal Secretariat of the Joseon Dynasty of Korea. These diaries span from 1623 to 1910 and generally involve daily weather records in the entry header. Their observational site would be located in Seoul (N37°35′, E126°59′). We have exploited the weather records from the NIKH database and classified the daily weather using text mining method. We have also converted the report dates from the traditional lunisolar calendar to the Gregorian calendar, to better contextualise our data into the contemporary daily measurements.

Data

We provide different formats (csv, xlsx, json) to facilitate the usage of data. The main contents of data are listed as below.

ID: The unique identifier of a specific record in the metadata, which can also serve as the identifier to merge with external data in the NIKH digital database.

Traditional calendar: The original lunar dates in the NIKH digital database, which are listed in data format "YYYY-MM-DD". More specifically, "L0" implies the leap year and "L1" implies the common year.

Leap: The identifier of a leap year.

Gregorian calendar: The Gregorian calendar date that converted by the traditional calendar date.

Weather Text: The text that describe the weather conditions. Specifically, multiple weather descriptions of the same day have been put together.

Flag: The computed value that indicates different combinations of weather conditions.

Volume: The volume of text in the original record.

Herbal Volume: The volume of text in the herbal record.

Sunny: A dummy variable that represents whether the weather description contains the expression of sunny.

Cloudy: A dummy variable that represents whether the weather description contains the expression of cloudy.

Rainy: A dummy variable that represents whether the weather description contains the expression of rainy.

Snow: A dummy variable that represents whether the weather description contains the expression of snow.

Wind: A dummy variable that represents whether the weather description contains the expression of wind.

Import Data

# Python # CSV file import pandas as pd data=pd.read_csv('~/SJWilgi_Seoul_Weather_YR1623_1910.csv',encoding="utf-8") # JSON file data=pd.read_json('~/SJWilgi_Seoul_Weather_YR1623_1910.json',encoding="utf-8") # Excel file data=pd.read_excel('~/SJWilgi_Seoul_Weather_YR1623_1910.xlsx') # Excel file

# R # CSV file library(readr) data<- read_csv("~/SJWilgi_Seoul_Weather_YR1623_1910.csv") # Excel file library(readxl) data <- read_excel("~/SJWilgi_Seoul_Weather_YR1623_1910.xlsx")
Vezora/Tested-188k-Python-Alpaca: Functional
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
Explore at:
zip(12200606 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

By Vezora (From Huggingface) [source]

About this dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

How to use the dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

Contents of the Dataset

The dataset consists of several columns:

output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.

instruction: It provides information about the task or instruction that each Python code sample is intended to solve.

input: The input parameters or values required to execute each Python code sample.

Exploring the Dataset

To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.

import pandas as pd # Load the dataset df = pd.read_csv('train.csv')

Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.

# Display column names print(df.columns)

Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.

# Display random samples from 'output' column print(df['output'].sample(5))

Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.

# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))

Potential Use Cases

The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

Code Analysis: Analyze the code samples to understand common programming patterns and best practices.

Code Debugging: Use code samples with known outputs to test and debug your own Python programs.

Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.

Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

Research Ideas

Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.

Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.

Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
m
Customers order for a Printing Company (2D Bin Packing and Scheduling)
data.mendeley.com
Updated Dec 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mahdi mostajabdaveh (2021). Customers order for a Printing Company (2D Bin Packing and Scheduling) [Dataset]. http://doi.org/10.17632/bxh46tps75.5
Explore at:
Unique identifier
https://doi.org/10.17632/bxh46tps75.5
Dataset updated
Dec 30, 2021
Authors
mahdi mostajabdaveh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data belongs to an actual printing company . Each record in Excel file Raw Data/Big_Data present an order from customers. In column "ColorMode" ; 4+0 means the order is one sided and 4+4 means it is two-sided. Files in Instances folder correspond to the instances used for computational tests in the article. Each of these instances has two related file with the same characteristics. One with gdx suffix and one with out any file extension.

Files with gdx suffix can be read by GAMS

Files without suffix are imported by pickle package in Python as objects of class Input (defined in "Input.py" ). You can read the files using the pickle package and Input.py. More information on pickle package at docs.python.org/3/library/pickle

These files are used to import data to the python implementation. The code and relevant description can be found in Read_input.py file.
Z
Geographic Diversity in Public Code Contributions — Replication Package
data.niaid.nih.gov
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Rossi; Stefano Zacchiroli (2022). Geographic Diversity in Public Code Contributions — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6390354
Explore at:
Dataset updated
Mar 31, 2022
Dataset provided by
University of Bologna, Italy
LTCI, Télécom Paris, Institut Polytechnique de Paris
Authors
Davide Rossi; Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geographic Diversity in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:

click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2

Initial data

swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

names.tab - forenames and surnames per country with their frequency

zones.acc.tab - countries/territories, timezones, population and world zones

c_c.tab - ccTDL entities - world zones matches

Data preparation

Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst

sh> ./export.sh

Run the authors cleanup script to create authors--clean.csv.zst

sh> ./cleanup.sh authors.csv.zst

Filter out implausible names and create authors--plausible.csv.zst

sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Zone detection by email

Run the email detection script to create author-country-by-email.tab.zst

sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst

Database creation and initial data ingestion

Create the PostgreSQL DB

sh> createdb zones-commit

Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.

Import data into PostgreSQL DB

sh> ./import_data.sh

Zone detection by name

Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script

sh> psql -f extract_commits.sql zones-commit

Run the world zone detection script to create commit_zones.tab.zst

sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

Ingest zones assignment data into the DB

psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).

sh> ./extract_data.sh

Run the script to create the graphs from all the previously extracted tabfiles.

sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.europa.eu
zenodo.org
unknown
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6832242?locale=fr
Explore at:
unknown(642961582)Available download formats
Dataset updated
Jul 12, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
SSURGO Portal User Guide
ngda-soils-geoplatform.hub.arcgis.com
arc-gis-hub-home-arcgishub.hub.arcgis.com
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA NRCS ArcGIS Online (2025). SSURGO Portal User Guide [Dataset]. https://ngda-soils-geoplatform.hub.arcgis.com/datasets/nrcs::ssurgo-portal-user-guide
Explore at:
Dataset updated
Jul 16, 2025
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Natural Resources Conservation Servicehttp://www.nrcs.usda.gov/
Authors
USDA NRCS ArcGIS Online
Area covered

Description
SSURGO PortalThe newest version of SSURGO Portal with Soil Data Viewer is available via the Quick Start Guide. Install Python to C:\Program Files. This is a different version than what ArcGIS Pro uses.If you need data for multiple states, we also offer a prebuilt large database with all SSURGO for the entire United States and all Islands. The prebuilt saves you time but it’s large and takes a while to download.You can also use the prebuilt gNATSGO GeoPackage database in SSURGO Portal – Soil Data Viewer. Read the ReadMe.txt in the folder. More about gNATSGO here. You can also import STATSGO2 data into SSURGO Portal and create a database to use in Soil Data Viewer – Available for download via the Soils Box folder. SSURGO Portal NotesThis 10 minute video covers it all, other than installation of SSURGO Portal and the GIS tool. Installation is typically smooth and easy.There is also a user guide on the SSURGO Portal website that can be very helpful. It has info about using the data in ArcGIS Pro or QGIS. SQLite SSURGO database be opened and queried with DB Browser. It’s essentially free Microsoft Access.Guidance about setting up DB Browser to easily open SQLite databases is available in section 4 of this Installation Guide.Workflow if you need to make your own databaseInstall SSURGO PortalInstall SSURGO Downloader GIS tool (Refer to the Installation and User Guide for assistance)There is one for QGIS and one for ArcGIS Pro. They both do the same thing. Quickly download California SSURGO data with toolEnter two digit state symbol followed by asterisk in “Search by Areasymbol” to download all data for state.For example, enter CA* to batch download all data for CaliforniaOpen SSURGO Portal and create a new SQLite SSURGO Template database (Refer to the User Guide for assistance)Import SSURGO data you downloaded into databaseYou can import SSURGO data from many states at once, building a database that spans many statesAfter SSURGO data is done importing, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a map (Refer to the User Guide for assistance)Workflow if you need use large prebuilt database (don’t make own database) Install SSURGO PortalIn SSURGO Portal, browse to unzipped prebuilt GeoPackage database with all SSURGOprebuilt large database with all SSURGOgNATSGO GeoPackage databaseIn SSURGO Portal, click on Soil Data Viewer tab and run ratingsThese are the exact same ratings as Web Soil SurveyA new table is added to your database for each ratingYou can search for ratings by keywordIf desired, open database in GIS and make a mapIf you have trouble installing SSURGO Portal. Its usually the connection with Python. Create Desktop short cut that tells SSURGO Portal which Python to useThese were created for Windows 11 Right click anywhere on your desktop and choose New > ShortcutIn the text bar enter your path to the python.exe and your path to the SSURGO Portal.pyz. Notes:Example of format:"C:\Program Files\Python310\python.exe" "C:\SSURGO Portal\SSURGO_Portal-0.3.0.8.pyz"Include quotation marks.Paths may be different on your machine. To avoid typing, you can browse to python.exe in windows explorer, right click and select "Copy as Path and paste results into box. Paste into short location and then do the same for SSURGO Portal.pyz file, but paste to the right of the python.exe path. Click NextName the shortcut anything you want.
s
Acerola Extract USA Import Data, US Acerola Extract Importers / Buyers List
seair.co.in
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim Solutions (2024). Acerola Extract USA Import Data, US Acerola Extract Importers / Buyers List [Dataset]. https://www.seair.co.in/us-import/product-acerola-extract/i-python-logistics-llc-prod/e-delphi-fretes-internacionais-ltda.aspx
Explore at:
.text/.csv/.xml/.xls/.binAvailable download formats
Dataset updated
Mar 7, 2024
Dataset authored and provided by
Seair Exim Solutions
Area covered
United States
Description
View details of Acerola Extract import data and shipment reports in US with product description, price, date, quantity, major us ports, countries and US buyers/importers list, overseas suppliers/exporters list.
Z
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.niaid.nih.gov
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682
Explore at:
Dataset updated
Oct 20, 2022
Dataset provided by
University of Insubria
Foundation for Research and Technology Hellas
Earkick
KTH Royal Institute of Technology
Aristotle University of Thessaloniki
Authors
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

Code Text engage[SQ001] I exercise because other people say I should engage[SQ002] I feel guilty when I don’t exercise engage[SQ003] I value the benefits of exercise engage[SQ004] I exercise because it’s fun engage[SQ005] I don’t see why I should have to exercise engage[SQ006] I take part in exercise because my friends/family/partner say I should engage[SQ007] I feel ashamed when I miss an exercise session engage[SQ008] It’s important to me to exercise regularly engage[SQ009] I can’t see why I should bother exercising engage[SQ010] I enjoy my exercise sessions engage[SQ011] I exercise because others will not be pleased with me if I don’t engage[SQ012] I don’t see the point in exercising engage[SQ013] I feel like a failure when I haven’t exercised in a while engage[SQ014] I think it is important to make the effort to exercise regularly engage[SQ015] I find exercise a pleasurable activity engage[SQ016] I feel under pressure from my friends/family to exercise engage[SQ017] I get restless if I don’t exercise regularly engage[SQ018] I get pleasure and satisfaction from participating in exercise engage[SQ019] I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

P1[SQ001] Interested P1[SQ002] Distressed P1[SQ003] Excited P1[SQ004] Upset P1[SQ005] Strong P1[SQ006] Guilty P1[SQ007] Scared P1[SQ008] Hostile P1[SQ009] Enthusiastic P1[SQ010] Proud P1[SQ011] Irritable P1[SQ012] Alert P1[SQ013] Ashamed P1[SQ014] Inspired P1[SQ015] Nervous P1[SQ016] Determined P1[SQ017] Attentive P1[SQ018] Jittery P1[SQ019] Active P1[SQ020] Afraid

Personality

How Accurately Can You Describe Yourself?

Code Text ipip[SQ001] Am the life of the party. ipip[SQ002] Feel little concern for others. ipip[SQ003] Am always prepared. ipip[SQ004] Get stressed out easily. ipip[SQ005] Have a rich vocabulary. ipip[SQ006] Don't talk a lot. ipip[SQ007] Am interested in people. ipip[SQ008] Leave my belongings around. ipip[SQ009] Am relaxed most of the time. ipip[SQ010] Have difficulty understanding abstract ideas. ipip[SQ011] Feel comfortable around people. ipip[SQ012] Insult people. ipip[SQ013] Pay attention to details. ipip[SQ014] Worry about things. ipip[SQ015] Have a vivid imagination. ipip[SQ016] Keep in the background. ipip[SQ017] Sympathize with others' feelings. ipip[SQ018] Make a mess of things. ipip[SQ019] Seldom feel blue. ipip[SQ020] Am not interested in abstract ideas. ipip[SQ021] Start conversations. ipip[SQ022] Am not interested in other people's problems. ipip[SQ023] Get chores done right away. ipip[SQ024] Am easily disturbed. ipip[SQ025] Have excellent ideas. ipip[SQ026] Have little to say. ipip[SQ027] Have a soft heart. ipip[SQ028] Often forget to put things back in their proper place. ipip[SQ029] Get upset easily. ipip[SQ030] Do not have a good imagination. ipip[SQ031] Talk to a lot of different people at parties. ipip[SQ032] Am not really interested in others. ipip[SQ033] Like order. ipip[SQ034] Change my mood a lot. ipip[SQ035] Am quick to understand things. ipip[SQ036] Don't like to draw attention to myself. ipip[SQ037] Take time out for others. ipip[SQ038] Shirk my duties. ipip[SQ039] Have frequent mood swings. ipip[SQ040] Use difficult words. ipip[SQ041] Don't mind being the centre of attention. ipip[SQ042] Feel others' emotions. ipip[SQ043] Follow a schedule. ipip[SQ044] Get irritated easily. ipip[SQ045] Spend time reflecting on things. ipip[SQ046] Am quiet around strangers. ipip[SQ047] Make people feel at ease. ipip[SQ048] Am exacting in my work. ipip[SQ049] Often feel blue. ipip[SQ050] Am full of ideas.

STAI

Indicate how you feel right now

Code Text STAI[SQ001] I feel calm STAI[SQ002] I feel secure STAI[SQ003] I am tense STAI[SQ004] I feel strained STAI[SQ005] I feel at ease STAI[SQ006] I feel upset STAI[SQ007] I am presently worrying over possible misfortunes STAI[SQ008] I feel satisfied STAI[SQ009] I feel frightened STAI[SQ010] I feel comfortable STAI[SQ011] I feel self-confident STAI[SQ012] I feel nervous STAI[SQ013] I am jittery STAI[SQ014] I feel indecisive STAI[SQ015] I am relaxed STAI[SQ016] I feel content STAI[SQ017] I am worried STAI[SQ018] I feel confused STAI[SQ019] I feel steady STAI[SQ020] I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

Code Text processes[SQ002] I read articles to learn more about physical

Facebook

Twitter

Click to copy link

Link copied

Cite

Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD

Explore at:

27 scholarly articles cite this dataset (View in Google Scholar)

.bin, .xml, .csv, .xlsAvailable download formats

Dataset provided by

Seair Info Solutions PVT LTD

Authors

Seair Exim

Area covered

India

Description

Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Clear search

Close search

Google apps

Main menu

Python Import Data India – Buyers & Importers List

Code to import PSCAD data into Python (Spyder)

Python Import Data in February - Seair.co.in

Python Import Data in August - Seair.co.in

Storage and Transit Time Data and Code

Code information

UCI Automobile Dataset

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Open Context Database SQL Dump

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Acerola Extract Import Data | Python Logistics Llc Prod

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Pre-Processed Power Grid Frequency Time Series

Digitisation of Weather Records of Seungjeongwon Ilgi: A Historical Weather...

Vezora/Tested-188k-Python-Alpaca: Functional

Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

About this dataset

How to use the dataset

Contents of the Dataset

Exploring the Dataset

Potential Use Cases

Research Ideas

Customers order for a Printing Company (2D Bin Packing and Scheduling)

Geographic Diversity in Public Code Contributions — Replication Package

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

SSURGO Portal User Guide

Acerola Extract USA Import Data, US Acerola Extract Importers / Buyers List

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD