This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.
This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.
Add the dataset to your notebook and then
python
import numpy as np
ntimeids=3830
nstocks=112
ncolumns = 27
nseq = 600
arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))
There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:
import numpy as np
import pandas as pd
import numpy as np
targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
ntimeids = targets.time_id.nunique()
stock_ids = list(sorted(targets.stock_id.unique()))
timeids = sorted(targets.time_id.unique())
timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
So to get the data for stock_id 13 on time_id 146 you'd do
stock_ix = stock_id_to_ix[13]
time_ix = timeid_to_ix[146]
arr[time_ix,stock_ix]
Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty.
To get truncate a single stocks data do
max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max()
arr[time_ix,stock_ix,:max_seq_ix,]
There are 27 columns in the last dimension these are:
['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env
pip install zarr
import zarr
raw = zarr.open(
seg = zarr.open(
# optional:
import numpy as np
raw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
The Durban Large Firm Survey is the result of an agreement between the Durban Unicity Council and USAID in 2000 to fund a World Bank technically-supported survey of firms in the greater Durban region. The survey, carried out between May 2002 – April 2003, involved 22 fieldworkers. These were managed by the Bureau of Market Research (BMR) at Unisa.
Units of analysis for the survey were enterprises
The survey covered large manufacturing firms in the Durban metropolitan area of South Africa
Sample survey data [ssd]
To assist fieldworkers in selecting sample units (firms) the World Bank provided the Bureau of Market Research (BMR) at Unisa with various sample frames (listings of firm names). These sample frames were used to contact firms randomly by telephone in order to set up appointments with the managing directors, managers or owners of the firms. Because no sample frame is comprehensive enough to include all firms operating in the Durban Metropolitan Area, various sample frames had to be utilized. The study was constructed to stratify industry by type, employment size group and geographic area. The sample frames had some limitations in this regard: they lacked employment size group classifications and they had a limited number of firms for certain sectors and they showed geographic location problems. In some cases information on the firms was outdated.
Face-to-face [f2f]
The eight questionnaires for the survey covered the following topics: Questionnaire 1: General issues (completed by md/ceo): General business information, economic policy environment, government's role in investment promotion and local economic development, corporate governance and ownership structure, parent company, corporate finance, interest rates, exports, imports and exchange rates, company tax rates, pdi (previously disadvantaged individual) participation in company ownership, future expectations, mergers/acquisitions Questionnaire 2: Production related issues (completed by production manager): Products manufactured, choice of location, capacity utilization, capital assets Questionnaire 3: Financial related issues (completed by financial manager) Questionnaire 4: Purchase related issues (completed by the purchasing manager): Business relations, purchase of raw materials, import duties and tariffs, import of raw materials, Questionnaire 5: Sales/marketing related issues (completed by sales/marketing manager): Business relations, membership of business associations, sources of information, sale of products, exports, business operations and profitability, government contracts, Questionnaire 6: Human resource related issues (completed by the hr manager): Recruitment, training, labour relations, labour market regulation, employment size and patterns, employees' transport infrastructure, labour cost and impact of HIV/AIDS Questionnaire 7: Administrative/legal related issues (completed by the administrative manager): Start-up licenses and permits, licensing and permits required to continue operations Questionnaire 8: Port related issues (completed by the shipping/export/marketing manager)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This version of the data set is deprecated. The simulation model used to generate the data set has an erroneous boundary condition (see lines 53 to 55 in simulation.xml). It has been removed in newer versions.
This dataset contains the results of 276282 finite-element simulations of the complex, frequency-dependent electrical impedance of a piezoelectric ring with randomised material parameters. Each impedance consists of 2000 samples in the frequency domain up to 8 MHz. We assume the sample to be dielectric, thus the impedance a frequency 0 Hz is infinite. The piezoelectric ring has an outer radius of 6.35 mm an inner radius of 2.6 mm and a thickness of 1 mm. The transverse isotropic material parameters are sampled from independent uniform distributions with ranges that are intended to represent the behaviour of different piezoceramic materials. The parameters of the Rayleigh damping model (alpha_M and alpha_K) are sampled from a logarithmic distribution to account for the larger parameter range.
Parameter | Min | Max | Unit | Description |
c11 | 120 | 165 | GPa | Elastic stiffness |
c12 | 70 | 150 | GPa | Elastic stiffness |
c13 | 65 | 95 | GPa | Elastic stiffness |
c33 | 110 | 140 | GPa | Elastic stiffness |
c44 | 18 | 30 | GPa | Elastic stiffness |
eps11 | 3 | 12 | nF/m | Dielectric permittivity |
eps33 | 4 | 8 | nF/m | Dielectric permittivity |
e15 | 8 | 18 | C/m^2 | Piezoelectric coupling |
e31 | 3 | 7.5 | C/m^2 | Piezoelectric coupling |
e33 | 12 | 20 | C/m^2 | Piezoelectric coupling |
alpha_M | 2 | 150 | 1/ms | Mass-proportional damping |
alpha_K | 10 | 800 | ps | Stiffness-proportional damping |
density | 7600 | 7850 | kg/m^3 | Density |
The dataset contains the following files:
The dataset is stored as a HDF5 file, which can be opened with all libraries that support that format, e.g. in Python using the h5py library:
import h5py
# Open the dataset in read mode.
file = h5py.File("dataset.hdf5", "r")
# Impedances as a 276282 x 2000 array of complex numbers.
impedances = file["impedances"]
# Material parameter values as a 276282 x 13 array of real numbers.
parameters = file["parameters"]
# Frequency vector of the impedance with length 2000.
frequencies = file["meta"]["frequencies"]
# 13 strings with the identifiers of the material parameters.
parameter_labels = file["meta"]["parameter_labels"]
To generate a result for the electrical impedance using the supplied simulation files, download and install openCFS and call the executable with the simulation.xml, but omit the file extension, e.g.:
cfsbin.exe simulation
The path to the executable of openCFS will depend on your operating system and installation. Running the simulation will result in the creation of several files and folders. Among those files will be the result for the electric charge on one of the electrodes of the sample, which will be placed in the 'history' subfolder. We can determine the current by taking the time derivative of the charge and already know the voltage because we excited the piezoceramic with an electric potential of 1 V in the simulation. Because the simulation is conducted in the frequency regime, all we have to do is to divide voltage by current to get the frequency dependent electrical impedance. The loading of the result and calculation of the impedance is implemented in the following Python script as an example:
import numpy as np
# Load result file for electric charge
result_path = 'history/simulation-elecCharge-surfRegion-ground.hist'
data = np.loadtxt(result_path)
frequency = data[:, 0]
# Convert polar representation from file to complex numbers.
charge = data[:, 1] * np.exp(1j * 2 * np. pi / 360 * data[:, 2])
# Excitation potential is 1 V in simulation.
potential = 1
# Determine impedance by applying Z = V / I = V / (j omega Q).
impedance = potential / (1j * 2 * np.pi * frequency * charge)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.
This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.
Add the dataset to your notebook and then
python
import numpy as np
ntimeids=3830
nstocks=112
ncolumns = 27
nseq = 600
arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))
There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:
import numpy as np
import pandas as pd
import numpy as np
targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
ntimeids = targets.time_id.nunique()
stock_ids = list(sorted(targets.stock_id.unique()))
timeids = sorted(targets.time_id.unique())
timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
So to get the data for stock_id 13 on time_id 146 you'd do
stock_ix = stock_id_to_ix[13]
time_ix = timeid_to_ix[146]
arr[time_ix,stock_ix]
Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty.
To get truncate a single stocks data do
max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max()
arr[time_ix,stock_ix,:max_seq_ix,]
There are 27 columns in the last dimension these are:
['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']