Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.
Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.
Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:
import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :
df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so onData Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as
- Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
- Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
- Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Facebook
TwitterQuickstart Usage
This dataset can be loaded into python using the Huggingface datasets library. First, install the datasets library via command line: $ pip install datasets
With datasets installed, the user should then import it into their python script / environment:
import datasets
The user can then load the CF-MS_Homo_sapiens_PPI dataset using datasets.load_dataset(...). There are two configurations, or 'views' for the set. The user can choose between them via the name… See the full description on the dataset page: https://huggingface.co/datasets/viridono/CF-MS_Homo_sapiens_PPI.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Simeonov2008
The Simeonov2008 dataset contains 7,152 compounds in the train set, with high-throughput screening (HTS) results recorded in the "Activity Outcome" column.
Quickstart Usage
Load a dataset in python
Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets
then, from within python load the datasets library
import datasets
and load the… See the full description on the dataset page: https://huggingface.co/datasets/haneulpark/Simeonov2008.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. FeltonDate: 5/5/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably in this project.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a particular function:
01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.
02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.
03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.
04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.
05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.
06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
http://www.raspberryturk.com/assets/img/logo.png" alt="Raspberry Turk logo">
This dataset was created as part of the Raspberry Turk project. The Raspberry Turk is a robot that can play chess—it's entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk. The dataset was used to train models for the vision portion of the project.
http://www.raspberryturk.com/assets/img/rawcapture.png" alt="Raw chessboard image">
In the raw form the dataset contains 312 480x480 images of chessboards with their associated board FENs. Each chessboard contains 30 empty squares, 8 orange pawns, 2 orange knights, 2 orange bishops, 2 orange rooks, 2 orange queens, 1 orange king, 8 green pawns, 2 green knights, 2 green bishops, 2 green rooks, 2 green queens, and 1 green king arranged in different random positions.
The Raspberry Turk source code includes several scripts for converting this raw data to a more usable form.
To get started download the raw.zip file below and then:
$ git clone git@github.com:joeymeyer/raspberryturk.git
$ cd raspberryturk
$ unzip ~/Downloads/raw.zip -d data
$ conda env create -f data/environment.yml
$ source activate raspberryturk
From this point there are two scripts you will need to run. First, convert the raw data to an interim form (individual 60x60 rgb/grayscale images) using process_raw.py like this:
$ python -m raspberryturk.core.data.process_raw data/raw/ data/interim/
This will split the raw images into individual squares and put them in labeled folders inside the interim folder. The final step is to convert the images into a dataset that can be loaded into a numpy array for training/validation. The create_dataset.py utility accomplishes this. The tool takes a number of parameters that can be used to customize the dataset (ex. choose the labels, rgb/grayscale, zca whiten images first, include rotated images, etc). Below is the documentation for create_dataset.py.
$ python -m raspberryturk.core.data.create_dataset --help
usage: raspberryturk/core/data/create_dataset.py [-h] [-g] [-r] [-s SAMPLE]
[-o] [-t TEST_SIZE] [-e] [-z]
base_path
{empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
filename
Utility used to create a dataset from processed images.
positional arguments:
base_path Base path for data processing.
{empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
Encoding function to use for piece classification. See
class_encoding.py for possible values.
filename Output filename for dataset. Should be .npz
optional arguments:
-h, --help show this help message and exit
-g, --grayscale Dataset should use grayscale images.
-r, --rotation Dataset should use rotated images.
-s SAMPLE, --sample SAMPLE
Dataset should be created by only a sample of images.
Must be value between 0 and 1.
-o, --one_hot Dataset should use one hot encoding for labels.
-t TEST_SIZE, --test_size TEST_SIZE
Test set partition size. Must be value between 0 and
1.
-e, --equalize_classes
Equalize class distributions.
-z, --zca ZCA whiten dataset.
Example of how it can be used:
$ python -m raspberryturk.core.data.create_dataset data/interim/ promotable_piece data/processed/example_dataset.npz --rotation --grayscale --one_hot --sample=0.3 --zca
Finally, the dataset is created and can be easily loaded into Python either using raspberryturk.core.data.dataset.Dataset or simply np.load.
In [1]: from raspberryturk.core.data.dataset import Dataset
In [2]: d = Dataset.load_file('data/processed/example_dataset.npz')
or
In [1]: with open('data/processed/example_dataset.npz', 'r') as f:
: data = np.load(f)
Visit the data collection page of the Raspberry Turk website for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used in the various stage two experiments in: "Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance". This includes datasets with varied characteristics.All datasets are stored in a dict with tuples of (time series array, class labels). To access data in python:import picklefilename = "dataset.txt"with open(filename, 'rb') as f: data = pickle.load(f)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for Fake News Classification using NLP & Machine Learning. It contains labeled fake and real news articles, sourced from credible datasets. It is optimized for text analysis, deep learning models, and AI research.
1️⃣ Load the dataset in Python using Pandas
```python
import pandas as pd
df = pd.read_csv("fake_news_data.csv")
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used in the stage one 1NN classification experiment in: "Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance"All datasets are stored in a dict with tuples of (time series array, class labels). To access data in python:import picklefilename = "dataset.txt"with open(filename, 'rb') as f: data = pickle.load(f)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterTo use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('spoc_robot', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterData model and generic query templates for translating and integrating a set of related CSV event logs into a single event graph for as used in https://dx.doi.org/10.1007/s13740-021-00122-1
Provides input data for 5 datasets (BPIC14, BPIC15, BPIC16, BPIC17, BPIC19)
Provides Python scripts to prepare and import each dataset into a Neo4j database instance through Cypher queries, representing behavioral information not globally (as in an event log), but locally per entity and per relation between entities.
Provides Python scripts to retrieve event data from a Neo4j database instance and render it using Graphviz dot.
The data model and queries are described in detail in: Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases (2020) https://arxiv.org/abs/2005.14552 and https://dx.doi.org/10.1007/s13740-021-00122-1
Fork the query code from Github: https://github.com/multi-dimensional-process-mining/graphdb-eventlogs
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attentive Skin
To Predict Skin Corrosion/Irritation Potentials of Chemicals via Explainable Machine Learning Methods Download: https://github.com/BeeBeeWong/AttentiveSkin/releases/tag/v1.0
Quickstart Usage
Load a dataset in python
Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets
then, from within python load the datasets library
import datasets… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/AttentiveSkin.
Facebook
TwitterThe datasets contain images of surfaces with structures generated by simplex noise. The initial systems consist of two surfaces of \(\alpha\)-quartz (SiO\(_2\)), where the simplex noise is carved out on the lower surface. The systems are created using molecular-builder . To generate the dataset, we apply Ovito for rendering the images, and transform the images to binary images using Python. The relevant Python scripts can be found at my github repo. All images have shape (1, 128, 128), e.g. one channel.
Labeled dataset
We label the different surfaces by the friction response. When labeling, we measure the slope of the static friction, the maximum static friction and the dynamic friction. This is done using molecular dynamics simulations in LAMMPS. Therefore, each labeled surface is associated with three targets.
The labeled dataset is named simplex_friction_128x128.pbz2, and contains 1800 surfaces labeled with the friction response. The surfaces were generated on a grid with scales 1-9 and octaves numbers 1-4, 50 surfaces in each category. To unzip and read the dataset in a Python script, use:
import bz2
import _pickle as cPickle
with bz2.BZ2File('simplex_friction_128x128.pbz2', 'rb') as f:
dataset = cPickle.load(f)
inputs = dataset['x']
labels = dataset['y']
where inputs and labels are Python lists.
Unlabeled dataset
The unlabeled dataset is stored as unlabeled_simplex_friction_128x128.pbz2. It contains 180,000 surfaces generated on a grid with scales 1-9 and octave numbers 1-4, with 5000 surfaces in each category. To unzip and read, use:
import bz2
import _pickle as cPickle
with bz2.BZ2File('unlabeled_simplex_friction_128x128.pbz2', 'rb') as f:
dataset = cPickle.load(f)
inputs = dataset['x']
attr = dataset['y']
where inputs and attr are Python lists. attr contains information about the simplex noise (scale, octaves, seed).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Information
This dataset presents long-term term indoor solar harvesting traces and jointly monitored with the ambient conditions. The data is recorded at 6 indoor positions with diverse characteristics at our institute at ETH Zurich in Zurich, Switzerland.
The data is collected with a measurement platform [3] consisting of a solar panel (AM-5412) connected to a bq25505 energy harvesting chip that stores the harvested energy in a virtual battery circuit. Two TSL45315 light sensors placed on opposite sides of the solar panel monitor the illuminance level and a BME280 sensor logs ambient conditions like temperature, humidity and air pressure.
The dataset contains the measurement of the energy flow at the input and the output of the bq25505 harvesting circuit, as well as the illuminance, temperature, humidity and air pressure measurements of the ambient sensors. The following timestamped data columns are available in the raw measurement format, as well as preprocessed and filtered HDF5 datasets:
V_in - Converter input/solar panel output voltage, in volt
I_in - Converter input/solar panel output current, in ampere
V_bat - Battery voltage (emulated through circuit), in volt
I_bat - Net Battery current, in/out flowing current, in ampere
Ev_left - Illuminance left of solar panel, in lux
Ev_right - Illuminance left of solar panel, in lux
P_amb - Ambient air pressure, in pascal
RH_amb - Ambient relative humidity, unit-less between 0 and 1
T_amb - Ambient temperature, in centigrade Celsius
The following publication presents and overview of the dataset and more details on the deployment used for data collection. A copy of the abstract is included in this dataset, see the file abstract.pdf.
L. Sigrist, A. Gomez, and L. Thiele. "Dataset: Tracing Indoor Solar Harvesting." In Proceedings of the 2nd Workshop on Data Acquisition To Analysis (DATA '19), 2019.
Folder Structure and Files
processed/ - This folder holds the imported, merged and filtered datasets of the power and sensor measurements. The datasets are stored in HDF5 format and split by measurement position posXX and and power and ambient sensor measurements. The files belonging to this folder are contained in archives named yyyy_mm_processed.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).
raw/ - This folder holds the raw measurement files recorded with the RocketLogger [1, 2] and using the measurement platform available at [3]. The files belonging to this folder are contained in archives named yyyy_mm_raw.tar, where yyyy and mmrepresent the year and month the data was published. A separate file lists the exact content of each archive (see below).
LICENSE - License information for the dataset.
README.md - The README file containing this information.
abstract.pdf - A copy of the above mentioned abstract submitted to the DATA '19 Workshop, introducing this dataset and the deployment used to collect it.
raw_import.ipynb [open in nbviewer] - Jupyter Python notebook to import, merge, and filter the raw dataset from the raw/ folder. This is the exact code used to generate the processed dataset and store it in the HDF5 format in the processed/folder.
raw_preview.ipynb [open in nbviewer] - This Jupyter Python notebook imports the raw dataset directly and plots a preview of the full power trace for all measurement positions.
processing_python.ipynb [open in nbviewer] - Jupyter Python notebook demonstrating the import and use of the processed dataset in Python. Calculates column-wise statistics, includes more detailed power plots and the simple energy predictor performance comparison included in the abstract.
processing_r.ipynb [open in nbviewer] - Jupyter R notebook demonstrating the import and use of the processed dataset in R. Calculates column-wise statistics and extracts and plots the energy harvesting conversion efficiency included in the abstract. Furthermore, the harvested power is analyzed as a function of the ambient light level.
Dataset File Lists
Processed Dataset Files
The list of the processed datasets included in the yyyy_mm_processed.tar archive is provided in yyyy_mm_processed.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.
Raw Dataset Files
A list of the raw measurement files included in the yyyy_mm_raw.tar archive(s) is provided in yyyy_mm_raw.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.
Dataset Revisions
v1.0 (2019-08-03)
Initial release. Includes the data collected from 2017-07-27 to 2019-08-01. The dataset archive files related to this revision are 2019_08_raw.tar and 2019_08_processed.tar. For position pos06, the measurements from 2018-01-06 00:00:00 to 2018-01-10 00:00:00 are filtered (data inconsistency in file indoor1_p27.rld).
v1.1 (2019-09-09)
Revision of the processed dataset v1.0 and addition of the final dataset abstract. Updated processing scripts reduce the timestamp drift in the processed dataset, the archive 2019_08_processed.tar has been replaced. For position pos06, the measurements from 2018-01-06 16:00:00 to 2018-01-10 00:00:00 are filtered (indoor1_p27.rld data inconsistency).
v2.0 (2020-03-20)
Addition of new data. Includes the raw data collected from 2019-08-01 to 2019-03-16. The processed data is updated with full coverage from 2017-07-27 to 2019-03-16. The dataset archive files related to this revision are 2020_03_raw.tar and 2020_03_processed.tar.
Dataset Authors, Copyright and License
Authors: Lukas Sigrist, Andres Gomez, and Lothar Thiele
Contact: Lukas Sigrist (lukas.sigrist@tik.ee.ethz.ch)
Copyright: (c) 2017-2019, ETH Zurich, Computer Engineering Group
License: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)
References
[1] L. Sigrist, A. Gomez, R. Lim, S. Lippuner, M. Leubin, and L. Thiele. Measurement and validation of energy harvesting IoT devices. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.
[2] ETH Zurich, Computer Engineering Group. RocketLogger Project Website, https://rocketlogger.ethz.ch/.
[3] L. Sigrist. Solar Harvesting and Ambient Tracing Platform, 2019. https://gitlab.ethz.ch/tec/public/employees/sigristl/harvesting_tracing
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images
71 partly labeled images
altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
A set of metrics and a novel ranking score for respective meaningful method benchmarking
An evaluation of three baseline methods in terms of the above metrics and score
Abstract
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
Dataset documentation:
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
FISBe Datasheet
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
How to open zarr files
Install the python zarr package:
pip install zarr
Opened a zarr file with:
import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.
How to view zarr image files
We recommend to use napari to view the image data.
Install napari:
pip install "napari[all]"
Save the following Python script:
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute:
python view_data.py /R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C
avF1: Average F1 Score
C: Average ground truth coverage
clDice_TP: Average true positives clDice
FS: Number of false splits
FM: Number of false merges
tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper.
Baseline
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.
License
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }
Acknowledgments
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.
Changelog
There have been no changes to the dataset so far.All future change will be listed on the changelog page.
Contributing
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is related to sklearn library in python.
we have 1796 sample image.
classes of data = 0 1 2 3 ... 7 8 9
image size = 64 -> (8,8)
you can import this datasets from :
from sklearn.datasets import load_digits dataset = load_digits() x = datasets.data y = datasets.target
Facebook
TwitterOpen Images is a dataset of ~9M images that have been annotated with image-level labels and object bounding boxes.
The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('open_images_v4', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/open_images_v4-original-2.0.0.png" alt="Visualization" width="500px">
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.
Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.
Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:
import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :
df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so onData Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as
- Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
- Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
- Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...