4 datasets found
  1. Optiver Precomputed Features Numpy Array

    • kaggle.com
    Updated Aug 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tal Perry (2021). Optiver Precomputed Features Numpy Array [Dataset]. https://www.kaggle.com/lighttag/optiver-precomputed-features-numpy-array
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tal Perry
    Description

    What's In This

    This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.

    This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.

    How To Use It

    Add the dataset to your notebook and then python import numpy as np ntimeids=3830 nstocks=112 ncolumns = 27 nseq = 600 arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))

    Caveats

    Handling Varying Sequence Sizes

    There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:

    import numpy as np
    
    
    
    import pandas as pd
    import numpy as np
    
    targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
    ntimeids = targets.time_id.nunique()
    stock_ids = list(sorted(targets.stock_id.unique()))
    timeids = sorted(targets.time_id.unique())
    timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
    stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
    
    

    Getting data For a particular stock id / time id

    So to get the data for stock_id 13 on time_id 146 you'd do stock_ix = stock_id_to_ix[13] time_ix = timeid_to_ix[146] arr[time_ix,stock_ix]

    Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty. To get truncate a single stocks data do max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max() arr[time_ix,stock_ix,:max_seq_ix,]

    Column Mappings

    There are 27 columns in the last dimension these are:

    ['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']

  2. Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • zenodo.org
    • data.niaid.nih.gov
    bin, json +3
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
    Explore at:
    zip, text/x-python, bin, json, txtAvailable download formats
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 26, 2024
    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    • A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
      • 30 completely labeled (segmented) images
      • 71 partly labeled images
      • altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
    • To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
    • A set of metrics and a novel ranking score for respective meaningful method benchmarking
    • An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    >> FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    • fisbe_v1.0_{completely,partly}.zip
      • contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
    • fisbe_v1.0_mips.zip
      • maximum intensity projections of all samples, for convenience.
    • sample_list_per_split.txt
      • a simple list of all samples and the subset they are in, for convenience.
    • view_data.py
      • a simple python script to visualize samples, see below for more information on how to use it.
    • dim_neurons_val_and_test_sets.json
      • a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
    • Readme.md
      • general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
    For each image, we provide a pixel-wise instance segmentation for all separable neurons.
    Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
    The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
    The segmentation mask for each neuron is stored in a separate channel.
    The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9
    conda activate flylight-env

    How to open zarr files

    1. Install the python zarr package:
      pip install zarr
    2. Opened a zarr file with:

      import zarr
      raw = zarr.open(
      seg = zarr.open(

      # optional:
      import numpy as np
      raw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.
    Many functions that expect numpy arrays also work with zarr arrays.
    Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    1. Install napari:
      pip install "napari[all]"
    2. Save the following Python script:

      import zarr, sys, napari

      raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
      gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

      viewer = napari.Viewer(ndisplay=3)
      for idx, gt in enumerate(gts):
      viewer.add_labels(
      gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
      viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
      viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
      viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
      napari.run()

    3. Execute:
      python view_data.py 

    Metrics

    • S: Average of avF1 and C
    • avF1: Average F1 Score
    • C: Average ground truth coverage
    • clDice_TP: Average true positives clDice
    • FS: Number of false splits
    • FM: Number of false merges
    • tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
    For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe,
     title =    {FISBe: A real-world benchmark dataset for instance
             segmentation of long-range thin filamentous structures},
     author =    {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
             Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
             Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
     year =     2024,
     eprint =    {2404.00130},
     archivePrefix ={arXiv},
     primaryClass = {cs.CV}
    }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
    discussions.
    P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
    This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.
    All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  3. Durban Large Firm Survey 2002-2003 - South Africa

    • microdata.worldbank.org
    • dev.ihsn.org
    • +2more
    Updated May 7, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2014). Durban Large Firm Survey 2002-2003 - South Africa [Dataset]. https://microdata.worldbank.org/index.php/catalog/1274
    Explore at:
    Dataset updated
    May 7, 2014
    Dataset provided by
    World Bankhttps://www.worldbank.org/

    Durban Unicity Council
    Time period covered
    2002 - 2003
    Area covered
    South Africa
    Description

    Abstract

    The Durban Large Firm Survey is the result of an agreement between the Durban Unicity Council and USAID in 2000 to fund a World Bank technically-supported survey of firms in the greater Durban region. The survey, carried out between May 2002 – April 2003, involved 22 fieldworkers. These were managed by the Bureau of Market Research (BMR) at Unisa.

    Analysis unit

    Units of analysis for the survey were enterprises

    Universe

    The survey covered large manufacturing firms in the Durban metropolitan area of South Africa

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    To assist fieldworkers in selecting sample units (firms) the World Bank provided the Bureau of Market Research (BMR) at Unisa with various sample frames (listings of firm names). These sample frames were used to contact firms randomly by telephone in order to set up appointments with the managing directors, managers or owners of the firms. Because no sample frame is comprehensive enough to include all firms operating in the Durban Metropolitan Area, various sample frames had to be utilized. The study was constructed to stratify industry by type, employment size group and geographic area. The sample frames had some limitations in this regard: they lacked employment size group classifications and they had a limited number of firms for certain sectors and they showed geographic location problems. In some cases information on the firms was outdated.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The eight questionnaires for the survey covered the following topics: Questionnaire 1: General issues (completed by md/ceo): General business information, economic policy environment, government's role in investment promotion and local economic development, corporate governance and ownership structure, parent company, corporate finance, interest rates, exports, imports and exchange rates, company tax rates, pdi (previously disadvantaged individual) participation in company ownership, future expectations, mergers/acquisitions Questionnaire 2: Production related issues (completed by production manager): Products manufactured, choice of location, capacity utilization, capital assets Questionnaire 3: Financial related issues (completed by financial manager) Questionnaire 4: Purchase related issues (completed by the purchasing manager): Business relations, purchase of raw materials, import duties and tariffs, import of raw materials, Questionnaire 5: Sales/marketing related issues (completed by sales/marketing manager): Business relations, membership of business associations, sources of information, sale of products, exports, business operations and profitability, government contracts, Questionnaire 6: Human resource related issues (completed by the hr manager): Recruitment, training, labour relations, labour market regulation, employment size and patterns, employees' transport infrastructure, labour cost and impact of HIV/AIDS Questionnaire 7: Administrative/legal related issues (completed by the administrative manager): Start-up licenses and permits, licensing and permits required to continue operations Questionnaire 8: Port related issues (completed by the shipping/export/marketing manager)

  4. Randomised material parameter impedance dataset of piezoelectric rings...

    • zenodo.org
    bin, xml
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Koch; Olga Friesen; Leander Claes; Leander Claes; Kevin Koch; Olga Friesen (2024). Randomised material parameter impedance dataset of piezoelectric rings (RaPIDring) [Dataset]. http://doi.org/10.5281/zenodo.11207806
    Explore at:
    xml, binAvailable download formats
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin Koch; Olga Friesen; Leander Claes; Leander Claes; Kevin Koch; Olga Friesen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Version notes

    This version of the data set is deprecated. The simulation model used to generate the data set has an erroneous boundary condition (see lines 53 to 55 in simulation.xml). It has been removed in newer versions.

    Description of the dataset

    This dataset contains the results of 276282 finite-element simulations of the complex, frequency-dependent electrical impedance of a piezoelectric ring with randomised material parameters. Each impedance consists of 2000 samples in the frequency domain up to 8 MHz. We assume the sample to be dielectric, thus the impedance a frequency 0 Hz is infinite. The piezoelectric ring has an outer radius of 6.35 mm an inner radius of 2.6 mm and a thickness of 1 mm. The transverse isotropic material parameters are sampled from independent uniform distributions with ranges that are intended to represent the behaviour of different piezoceramic materials. The parameters of the Rayleigh damping model (alpha_M and alpha_K) are sampled from a logarithmic distribution to account for the larger parameter range.

    Parameter MinMaxUnit Description
    c11120165GPaElastic stiffness
    c1270 150GPaElastic stiffness
    c1365 95 GPaElastic stiffness
    c33110 140GPaElastic stiffness
    c4418 30 GPaElastic stiffness
    eps11 312nF/mDielectric permittivity
    eps33 4 8nF/mDielectric permittivity
    e15818C/m^2Piezoelectric coupling
    e3137.5C/m^2Piezoelectric coupling
    e331220C/m^2Piezoelectric coupling
    alpha_M2150 1/msMass-proportional damping
    alpha_K10800ps Stiffness-proportional damping
    density76007850kg/m^3Density

    Files

    The dataset contains the following files:

    • dataset.hdf5: The main dataset file in HDF5 format. Refer to the next section on how to load the dataset.
    • simulation.xml: The file describing the parameters of the finite element simulation. This file can be used with openCFS along with the mesh and material file to simulate the behaviour of the piezoelectric ceramic in the frequency domain. This simulation file was used with randomised material parameters to generate the dataset.
    • material.xml: The material file used for the simulation with exemplary material parameters values.
    • ring.geo: The gmsh geometry file with the axisymmetric representation of the piezoelectric ring.
    • ring.msh: The mesh file generated with gmsh using the ring.geo file.

    Loading the dataset

    The dataset is stored as a HDF5 file, which can be opened with all libraries that support that format, e.g. in Python using the h5py library:

    import h5py

    # Open the dataset in read mode.
    file = h5py.File("dataset.hdf5", "r")

    # Impedances as a 276282 x 2000 array of complex numbers.
    impedances = file["impedances"]
    # Material parameter values as a 276282 x 13 array of real numbers.
    parameters = file["parameters"]
    # Frequency vector of the impedance with length 2000.
    frequencies = file["meta"]["frequencies"]
    # 13 strings with the identifiers of the material parameters.
    parameter_labels = file["meta"]["parameter_labels"]

    Simulating impedances

    To generate a result for the electrical impedance using the supplied simulation files, download and install openCFS and call the executable with the simulation.xml, but omit the file extension, e.g.:

    cfsbin.exe simulation

    The path to the executable of openCFS will depend on your operating system and installation. Running the simulation will result in the creation of several files and folders. Among those files will be the result for the electric charge on one of the electrodes of the sample, which will be placed in the 'history' subfolder. We can determine the current by taking the time derivative of the charge and already know the voltage because we excited the piezoceramic with an electric potential of 1 V in the simulation. Because the simulation is conducted in the frequency regime, all we have to do is to divide voltage by current to get the frequency dependent electrical impedance. The loading of the result and calculation of the impedance is implemented in the following Python script as an example:

    import numpy as np

    # Load result file for electric charge
    result_path = 'history/simulation-elecCharge-surfRegion-ground.hist'
    data = np.loadtxt(result_path)

    frequency = data[:, 0]
    # Convert polar representation from file to complex numbers.
    charge = data[:, 1] * np.exp(1j * 2 * np. pi / 360 * data[:, 2])

    # Excitation potential is 1 V in simulation.
    potential = 1
    # Determine impedance by applying Z = V / I = V / (j omega Q).
    impedance = potential / (1j * 2 * np.pi * frequency * charge)

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tal Perry (2021). Optiver Precomputed Features Numpy Array [Dataset]. https://www.kaggle.com/lighttag/optiver-precomputed-features-numpy-array
Organization logo

Optiver Precomputed Features Numpy Array

All Book and Trade data + features, joined together in one big NP array

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tal Perry
Description

What's In This

This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.

This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.

How To Use It

Add the dataset to your notebook and then python import numpy as np ntimeids=3830 nstocks=112 ncolumns = 27 nseq = 600 arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))

Caveats

Handling Varying Sequence Sizes

There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:

import numpy as np



import pandas as pd
import numpy as np

targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
ntimeids = targets.time_id.nunique()
stock_ids = list(sorted(targets.stock_id.unique()))
timeids = sorted(targets.time_id.unique())
timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}

Getting data For a particular stock id / time id

So to get the data for stock_id 13 on time_id 146 you'd do stock_ix = stock_id_to_ix[13] time_ix = timeid_to_ix[146] arr[time_ix,stock_ix]

Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty. To get truncate a single stocks data do max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max() arr[time_ix,stock_ix,:max_seq_ix,]

Column Mappings

There are 27 columns in the last dimension these are:

['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']

Search
Clear search
Close search
Google apps
Main menu