100+ datasets found
  1. i

    Code to import PSCAD data into Python (Spyder)

    • ieee-dataport.org
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder
    Explore at:
    Dataset updated
    Nov 20, 2025
    Authors
    Franz Guzman Llanos
    Description

    minimizes errors

  2. s

    Python Import Data India – Buyers & Importers List

    • seair.co.in
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    India
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  3. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  4. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  5. original : CIFAR 100

    • kaggle.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
    Explore at:
    zip(168517945 bytes)Available download formats
    Dataset updated
    Dec 28, 2024
    Authors
    Shashwat Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

    Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

    Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

    Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

    The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

    The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

    Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

    There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

    The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

  6. s

    Python Import Data in February - Seair.co.in

    • seair.co.in
    Updated Feb 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Python Import Data in February - Seair.co.in [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 18, 2016
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    Argentina, Nauru, Malaysia, Gibraltar, Slovakia, Tokelau, Timor-Leste, French Guiana, Korea (Democratic People's Republic of), Austria
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  7. s

    Python Import Data in August - Seair.co.in

    • seair.co.in
    Updated Aug 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Python Import Data in August - Seair.co.in [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Aug 20, 2016
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    Belgium, South Africa, Christmas Island, Nepal, Lebanon, Virgin Islands (U.S.), Saint Pierre and Miquelon, Falkland Islands (Malvinas), Ecuador, Gambia
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  8. h

    Python-DPO-Large

    • huggingface.co
    Updated Mar 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NextWealth Entrepreneurs Private Limited (2023). Python-DPO-Large [Dataset]. https://huggingface.co/datasets/NextWealth/Python-DPO-Large
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2023
    Dataset authored and provided by
    NextWealth Entrepreneurs Private Limited
    Description

    Dataset Card for Python-DPO

    This dataset is the larger version of Python-DPO dataset and has been created using Argilla.

      Load with datasets
    

    To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset

    ds = load_dataset("NextWealth/Python-DPO")

      Data Fields
    

    Each data instance contains:

    instruction: The problem description/requirements chosen_code:… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO-Large.

  9. Z

    MASCDB, a database of images, descriptors and microphysical properties of...

    • data.niaid.nih.gov
    • springerprofessional.de
    • +2more
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grazioli, Jacopo; Ghiggi, Gionata; Berne, Alexis (2023). MASCDB, a database of images, descriptors and microphysical properties of individual snowflakes in free fall [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5578920
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    EPFL-ENAC-IIE-LTE
    Authors
    Grazioli, Jacopo; Ghiggi, Gionata; Berne, Alexis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset overview

    This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:

    A triplet of gray-scale images corresponding to the three cameras of the MASC

    A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.

    The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:

    Textural and geometrical descriptors as in Praz et al 2017

    Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017

    Blowing snow identification, as in Schaer et al 2020

    Mass, volume, gyration estimation, as in Leinonen et al 2021

    Data format and structure

    The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.

    Supporting code

    A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.

    Download notes

    All files available here for download should be stored in the same folder, if the python-based API is used

    MASCdb.zarr.zip must be unzipped after download

    Field campaigns

    A list of campaigns included in the dataset, with a minimal description is given in the following table

        Campaign_name
        Information
    

    Shielded / Not shielded

    DFIR = Double Fence Intercomparison Reference

    APRES3-2016 & APRES3-2017

        Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017
        Not shielded
    
    
        Davos-2015
        Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment)
        Shielded (DFIR)
    
    
        Davos-2019
        Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow)
        Not shielded
    
    
        ICEGENESIS-2021
        Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023
        Not shielded
    
    
        ICEPOP-2018
        Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021.
        Shielded (DFIR)
    
    
        Jura-2019 & Jura-2023
        Instrument installed in the Swiss Jura within a MeteoSwiss measurement site
        Not shielded
    
    
        Norway-2016
        Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022.
        Not shielded
    
    
        PLATO-2019
        Instrument installed in the "Davis" Antarctic base during the PLATO field campaign
        Not shielded
    
    
        POPE-2020
        Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023.
        Not shielded
    
    
        Remoray-2022
        Instrument installed in the French Jura.
        Not shielded
    
    
        Valais-2016
        Instrument installed in the Swiss Alps in a ski resort.
        Not shielded
    

    Version

    1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.

    0.3 - a new campaign is added to the dataset ("Remoray-2022")

    0.2 - rename of variables. Variable precision (digits) standardized

    0.1 - first upload

  10. Z

    Auditory cortex single unit population activity during natural sound...

    • data.niaid.nih.gov
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pennington, Jacob; David, Stephen (2023). Auditory cortex single unit population activity during natural sound presentation -- dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7796573
    Explore at:
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Oregon Health & Science University
    Washington State University, Vancouver
    Authors
    Pennington, Jacob; David, Stephen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    High-density multi-channel neurophysiology data were collected from primary (A1) and secondary (PEG) fields of auditory cortex of passively listening ferrets during presentation of a large natural sound library. Single unit spikes were sorted using Kilosort. This dataset includes spike times for 849 A1 units and 398 PEG units. Stimulus waveforms were transformed to log-spaced spectrograms for analysis (18 channels, 10 ms time bins). Data set includes raw sound waveforms as well.

    The authors request that any publication using this data cite the following work: https://www.biorxiv.org/content/10.1101/2022.06.10.495698v2

    Data format/description

    Neural data are stored in two files. All recordings were performed during presentation of the same natural sound library.

    recordings/A1_NAT4_ozgf.fs100.ch18.tgz - data from 849 A1 single units and log spectrogram of stimuli aligned with spike times.

    recordings/PEG_NAT4_ozgf.fs100.ch18.tgz - data from 398 PEG single units and log spectrogram of stimuli aligned with spike times.

    wav.zip - raw wav files. Note: Only first 1-sec of each wav file was presented during experiments. Recordings have longer duration

    Example scripts

    Python scripts included with this dataset demonstrate how to load the neural data and perform a CNN model fit. Running the scripts requires the NEMS0 python library, which is available open source at https://github.com/lbhb/NEMS0.

    Quick install

    Create and activate a new conda environment:

    conda create -n NEMS0 python=3.7 conda activate NEMS0

    Download NEMS0:

    git clone https://github.com/lbhb/NEMS0

    Install NEMS0:

    pip install -e NEMS0

    Detailed instructions for installing NEMS0 are available in the Github repository (https://github.com/lbhb/NEMS0).

    Demo scripts

    Once NEMS0 is installed and the data are downloaded, move to the directory where the data and demo scripts are stored and run them in a NEMS0 environment.

    pop_cnn_load.py - Load the A1 data and compare predictions for two neurons (Fig 3) by two population models (stage 1 fit complete). Illustrates how to load the data using Python.

    pop_cnn_fit.py - Load a pre-fit A1 population model (stage 1) and complete stage 2 fit (refinement) for a single neuron. Illustrates use of NEMS0 for CNN model fitting.

    Funding

    Data collection, software development and processing were supported by funding from the NIH (R01DC014950, R01EB028155).

  11. r

    Open data: Visual load effects on the auditory steady-state responses to...

    • demo.researchdata.se
    • researchdata.se
    • +2more
    Updated Nov 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Wiens; Malina Szychowska (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones [Dataset]. http://doi.org/10.17045/STHLMUNI.12582002
    Explore at:
    Dataset updated
    Nov 8, 2020
    Dataset provided by
    Stockholm University
    Authors
    Stefan Wiens; Malina Szychowska
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The main results file are saved separately:

    • ASSR2.html: R output of the main analyses (N = 33)
    • ASSR2_subset.html: R output of the main analyses for the smaller sample (N = 25)

    FIGSHARE METADATA

    Categories

    • Biological psychology
    • Neuroscience and physiological psychology
    • Sensory processes, perception, and performance

    Keywords

    • crossmodal attention
    • electroencephalography (EEG)
    • early-filter theory
    • task difficulty
    • envelope following response

    References

    GENERAL INFORMATION

    1. Title of Dataset: Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones

    2. Author Information A. Principal Investigator Contact Information Name: Stefan Wiens Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.su.se/profiles/swiens-1.184142 Email: sws@psychology.su.se

      B. Associate or Co-investigator Contact Information Name: Malina Szychowska Institution: Department of Psychology, Stockholm University, Sweden Internet: https://www.researchgate.net/profile/Malina_Szychowska Email: malina.szychowska@psychology.su.se

    3. Date of data collection: Subjects (N = 33) were tested between 2019-11-15 and 2020-03-12.

    4. Geographic location of data collection: Department of Psychology, Stockholm, Sweden

    5. Information about funding sources that supported the collection of the data: Swedish Research Council (Vetenskapsrådet) 2015-01181

    SHARING/ACCESS INFORMATION

    1. Licenses/restrictions placed on the data: CC BY 4.0

    2. Links to publications that cite or use the data: Szychowska M., & Wiens S. (2020). Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Submitted manuscript.

    The study was preregistered: https://doi.org/10.17605/OSF.IO/6FHR8

    1. Links to other publicly accessible locations of the data: N/A

    2. Links/relationships to ancillary data sets: N/A

    3. Was data derived from another source? No

    4. Recommended citation for this dataset: Wiens, S., & Szychowska M. (2020). Open data: Visual load effects on the auditory steady-state responses to 20-, 40-, and 80-Hz amplitude-modulated tones. Stockholm: Stockholm University. https://doi.org/10.17045/sthlmuni.12582002

    DATA & FILE OVERVIEW

    File List: The files contain the raw data, scripts, and results of main and supplementary analyses of an electroencephalography (EEG) study. Links to the hardware and software are provided under methodological information.

    ASSR2_experiment_scripts.zip: contains the Python files to run the experiment.

    ASSR2_rawdata.zip: contains raw datafiles for each subject

    • data_EEG: EEG data in bdf format (generated by Biosemi)
    • data_log: logfiles of the EEG session (generated by Python)

    ASSR2_EEG_scripts.zip: Python-MNE scripts to process the EEG data

    ASSR2_EEG_preprocessed_data.zip: EEG data in fif format after preprocessing with Python-MNE scripts

    ASSR2_R_scripts.zip: R scripts to analyze the data together with the main datafiles. The main files in the folder are:

    • ASSR2.html: R output of the main analyses
    • ASSR2_subset.html: R output of the main analyses but after excluding eight subjects who were recorded as pilots before preregistering the study

    ASSR2_results.zip: contains all figures and tables that are created by Python-MNE and R.

    METHODOLOGICAL INFORMATION

    1. Description of methods used for collection/generation of data: The auditory stimuli were amplitude-modulated tones with a carrier frequency (fc) of 500 Hz and modulation frequencies (fm) of 20.48 Hz, 40.96 Hz, or 81.92 Hz. The experiment was programmed in python: https://www.python.org/ and used extra functions from here: https://github.com/stamnosslin/mn

    The EEG data were recorded with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com) and saved in .bdf format. For more information, see linked publication.

    1. Methods for processing the data: We conducted frequency analyses and computed event-related potentials. See linked publication

    2. Instrument- or software-specific information needed to interpret the data: MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html# Rstudio used with R (R Core Team, 2020): https://rstudio.com/products/rstudio/ Wiens, S. (2017). Aladins Bayes Factor in R (Version 3). https://www.doi.org/10.17045/sthlmuni.4981154.v3

    3. Standards and calibration information, if appropriate: For information, see linked publication.

    4. Environmental/experimental conditions: For information, see linked publication.

    5. Describe any quality-assurance procedures performed on the data: For information, see linked publication.

    6. People involved with sample collection, processing, analysis and/or submission:

    • Data collection: Malina Szychowska with assistance from Jenny Arctaedius.
    • Data processing, analysis, and submission: Malina Szychowska and Stefan Wiens

    DATA-SPECIFIC INFORMATION: All relevant information can be found in the MNE-Python and R scripts (in EEG_scripts and analysis_scripts folders) that process the raw data. For example, we added notes to explain what different variables mean.

  12. H

    Advancing Open and Reproducible Water Data Science by Integrating Data...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery S. Horsburgh (2024). Advancing Open and Reproducible Water Data Science by Integrating Data Analytics with an Online Data Repository [Dataset]. https://www.hydroshare.org/resource/45d3427e794543cfbee129c604d7e865
    Explore at:
    zip(50.9 MB)Available download formats
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    HydroShare
    Authors
    Jeffery S. Horsburgh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.

    This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/

  13. h

    rag

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VIGNESH M, rag [Dataset]. https://huggingface.co/datasets/vicky3241/rag
    Explore at:
    Authors
    VIGNESH M
    Description

    import pandas as pd

      Example dataset with new columns
    

    data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.

  14. Z

    Data from: MLFMF: Data Sets for Machine Learning for Mathematical...

    • data.niaid.nih.gov
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo (2023). MLFMF: Data Sets for Machine Learning for Mathematical Formalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10041074
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    University of Ljubljana
    Institute of Mathematics, Physics, and Mechanics
    Authors
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

    the largest Lean 4 library Mathlib, the three largest Agda libraries:

    the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

    the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

    This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

  15. Raspberry Turk Project

    • kaggle.com
    zip
    Updated Mar 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    joeymeyer (2017). Raspberry Turk Project [Dataset]. https://www.kaggle.com/datasets/joeymeyer/raspberryturk
    Explore at:
    zip(36266263 bytes)Available download formats
    Dataset updated
    Mar 14, 2017
    Authors
    joeymeyer
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    http://www.raspberryturk.com/assets/img/logo.png" alt="Raspberry Turk logo">

    Context

    This dataset was created as part of the Raspberry Turk project. The Raspberry Turk is a robot that can play chess—it's entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk. The dataset was used to train models for the vision portion of the project.

    Content

    http://www.raspberryturk.com/assets/img/rawcapture.png" alt="Raw chessboard image">

    In the raw form the dataset contains 312 480x480 images of chessboards with their associated board FENs. Each chessboard contains 30 empty squares, 8 orange pawns, 2 orange knights, 2 orange bishops, 2 orange rooks, 2 orange queens, 1 orange king, 8 green pawns, 2 green knights, 2 green bishops, 2 green rooks, 2 green queens, and 1 green king arranged in different random positions.

    Scripts for Data Processing

    The Raspberry Turk source code includes several scripts for converting this raw data to a more usable form.

    To get started download the raw.zip file below and then:

    $ git clone git@github.com:joeymeyer/raspberryturk.git
    $ cd raspberryturk
    $ unzip ~/Downloads/raw.zip -d data
    $ conda env create -f data/environment.yml
    $ source activate raspberryturk
    

    From this point there are two scripts you will need to run. First, convert the raw data to an interim form (individual 60x60 rgb/grayscale images) using process_raw.py like this:

    $ python -m raspberryturk.core.data.process_raw data/raw/ data/interim/
    

    This will split the raw images into individual squares and put them in labeled folders inside the interim folder. The final step is to convert the images into a dataset that can be loaded into a numpy array for training/validation. The create_dataset.py utility accomplishes this. The tool takes a number of parameters that can be used to customize the dataset (ex. choose the labels, rgb/grayscale, zca whiten images first, include rotated images, etc). Below is the documentation for create_dataset.py.

    $ python -m raspberryturk.core.data.create_dataset --help
    usage: raspberryturk/core/data/create_dataset.py [-h] [-g] [-r] [-s SAMPLE]
                             [-o] [-t TEST_SIZE] [-e] [-z]
                             base_path
                             {empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
                             filename
    
    Utility used to create a dataset from processed images.
    
    positional arguments:
     base_path       Base path for data processing.
     {empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
                Encoding function to use for piece classification. See
                class_encoding.py for possible values.
     filename       Output filename for dataset. Should be .npz
    
    optional arguments:
     -h, --help      show this help message and exit
     -g, --grayscale    Dataset should use grayscale images.
     -r, --rotation    Dataset should use rotated images.
     -s SAMPLE, --sample SAMPLE
                Dataset should be created by only a sample of images.
                Must be value between 0 and 1.
     -o, --one_hot     Dataset should use one hot encoding for labels.
     -t TEST_SIZE, --test_size TEST_SIZE
                Test set partition size. Must be value between 0 and
                1.
     -e, --equalize_classes
                Equalize class distributions.
     -z, --zca       ZCA whiten dataset.
    

    Example of how it can be used:

    $ python -m raspberryturk.core.data.create_dataset data/interim/ promotable_piece data/processed/example_dataset.npz --rotation --grayscale --one_hot --sample=0.3 --zca
    

    Finally, the dataset is created and can be easily loaded into Python either using raspberryturk.core.data.dataset.Dataset or simply np.load.

    In [1]: from raspberryturk.core.data.dataset import Dataset
    In [2]: d = Dataset.load_file('data/processed/example_dataset.npz')
    

    or

    In [1]: with open('data/processed/example_dataset.npz', 'r') as f:
       :   data = np.load(f)
    

    Visit the data collection page of the Raspberry Turk website for more details.

    Creator

    Joey Meyer

  16. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St Petersburg
    European University at St. Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  17. f

    Open data: Frequency mismatch negativity and visual load

    • su.figshare.com
    • researchdata.se
    • +1more
    pdf
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund (2021). Open data: Frequency mismatch negativity and visual load [Dataset]. http://doi.org/10.17045/sthlmuni.7016369.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 23, 2021
    Dataset provided by
    Stockholm University
    Authors
    Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiens, S., van Berlekom, E., Szychowska, M., & Eklund, R. (2019). Visual Perceptual Load Does Not Affect the Frequency Mismatch Negativity. Frontiers in Psychology, 10(1970). doi:10.3389/fpsyg.2019.01970We manipulated visual perceptual load (high and low load) while we recorded electroencephalography. Event-related potentials (ERPs) were computed from these data.OSF_*.pdf contains the preregistration at open science framework (osf).https://doi.org/10.17605/OSF.IO/EWG9XERP_2019_rawdata_bdf.zip contains the raw eeg data files that were recorded with a biosemi system (www.biosemi.com). The files can be opened in matlab with the fieldtrip toolbox. https://www.mathworks.com/products/matlab.htmlhttp://www.fieldtriptoolbox.org/ERP_2019_visual_load_fieldtrip_scripts.zip contains all the matlab scripts that were used to process the ERP data with the toolbox fieldtrip. http://www.fieldtriptoolbox.org/ERP_2019_fieldtrip_mat_*.zip contain the final, preprocessed individual data files. They can be opened with matlab.ERP_2019_visual_load_python_scripts.zip contains the python scripts for the main task. They need python (https://www.python.org/) and psychopy (http://www.psychopy.org/)ERP_2019_visual_load_wmc_R_scripts.zip contains the R scripts to process the working memory capacity (wmc) data. https://www.r-project.org/.ERP_2019_visual_load_R_scripts.zip contains the R scripts to analyze the data and the output files with figures (eg scatterplots). https://www.r-project.org/.

  18. Z

    Pre-Processed Power Grid Frequency Time Series

    • data.niaid.nih.gov
    Updated Jul 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruse, Johannes; Schäfer, Benjamin; Witthaut, Dirk (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3744120
    Explore at:
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
    Forschungszentrum Jülich GmbH, Institute for Energy and Climate Research - Systems Analysis and Technology Evaluation (IEK-STE), 52428 Jülich, Germany
    Authors
    Kruse, Johannes; Schäfer, Benjamin; Witthaut, Dirk
    Description

    Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

    Continental Europe

    Great Britain

    Nordic

    This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

    Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

    Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

    Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

    Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

    Content of the repository

    A) Scripts

    In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

    In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

    In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

    The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

    B) Yearly converted and cleansed data The folders "_converted" contain the output of "convert_data_format.py" and "_cleansed" contain the output of "clean_corrupted_data.py".

    File type: The files are zipped csv-files, where each file comprises one year.

    Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

    TransnetBW: Continental European Time (CE)

    Nationalgrid: Great Britain (GB)

    Fingrid: Finland (Europe/Helsinki)

    NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

    Use cases We point out that this repository can be used in two different was:

    Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

    Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "_converted".

    License

    This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

    We release the code in the folder "Scripts" under the MIT license .

    The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

    TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

    Changelog Version 2:

    Add time zone information to description

    Include new frequency data

    Update references

    Change folder structure to yearly folders

    Version 3:

    Correct TransnetBW files for missing data in May 2016

  19. d

    Community Geothermal: Soil Conductivity, Borehole Design, Energy Models, and...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTI Energy (2025). Community Geothermal: Soil Conductivity, Borehole Design, Energy Models, and Load Data for a Residential System Development - Hinesburg, VT [Dataset]. https://catalog.data.gov/dataset/community-geothermal-soil-conductivity-borehole-design-energy-models-and-load-data-for-a-r-46d5d
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    GTI Energy
    Area covered
    Hinesburg
    Description

    This dataset contains materials from the Coalition for Community-Supported Affordable Geothermal Energy Systems (C2SAGES) project, which evaluated the techno-economic feasibility of a community geothermal system for a residential development in Hinesburg, VT. The dataset includes detailed soil conductivity test reports, energy models, borehole design reports, hourly energy loads for heating, cooling, and hot water, and design layouts. EnergyPlus was used to model building energy loads, and Modelica software was applied for geothermal loop sizing based on these loads and soil conductivity results. Python scripts for network design further refined the models. Key files include PDF reports on borehole design (with projections for 1-year, 15-year, and 30-year systems), soil conductivity test results, EnergyPlus modeling outputs, and 2D/3D design drawings in PDF, DWG, and DXF formats. Python notebooks for network design and OnePipe model files are also provided, with Modelica required for viewing certain files. Outputs and modeling data are in various formats including CSV, JPG, HTML, and IDF, with units and data clearly labeled to support understanding of system design and performance for the proposed geothermal solution.

  20. O

    Time series

    • data.open-power-system-data.org
    csv, sqlite
    Updated Jul 14, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Power System Data (2016). Time series [Dataset]. https://data.open-power-system-data.org/time_series/2016-07-14
    Explore at:
    csv, sqliteAvailable download formats
    Dataset updated
    Jul 14, 2016
    Dataset provided by
    Open Power System Data
    Time period covered
    Dec 31, 2001 - Jul 11, 2016
    Variables measured
    timestamp, wind_DE_profile, solar_DE_profile, wind_BE_capacity, wind_BE_forecast, wind_DE_capacity, wind_DE_forecast, solar_DE_capacity, solar_DE_forecast, wind_BE_generation, and 21 more
    Description

    Load, wind and solar, prices in hourly resoltion. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity consumption (load) for 36 European countries as well as wind and solar power generation and capacities and prices for a growing subset of countries. The timeseries become available at different points in time depending on the sources. The full dataset is only available from 2012 onwards. The data has been downloaded from the sources, resampled and merged in a large CSV file with hourly resolution. Additionally, the data available at a higher resolution (Some renewables in-feed, 15 minutes) is provided in a separate file. All data processing is conducted in python and pandas and has been documented in the Jupyter notebooks linked below.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder

Code to import PSCAD data into Python (Spyder)

Explore at:
Dataset updated
Nov 20, 2025
Authors
Franz Guzman Llanos
Description

minimizes errors

Search
Clear search
Close search
Google apps
Main menu