15 datasets found
  1. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  2. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Ledenev, Victor
    Bondarkov, Sergey
    Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Russia
    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  3. c

    ckanext-salford

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-salford [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-salford
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The Salford extension for CKAN is designed to enhance CKAN's functionality for specific use cases, particularly involving the management and import of datasets relevant to the Salford City Council. By incorporating custom configurations and an ETL script, this extension streamlines the process of integrating external data sources, especially from data.gov.uk, into a CKAN instance. It also provides a structured approach to configuring CKAN for specific data management needs. Key Features: Custom Plugin Integration: Enables the addition of 'salford' and 'esd' plugins to extend CKAN's core functionality, addressing specific data management requirements. Configurable Licenses Group URL: Allows administrators to specify a licenses group URL in the CKAN configuration, streamlining access to license information pertinent to the dataset. ETL Script for Data.gov.uk Import: Includes a Python script (etl.py) to import datasets specifically from the Salford City Council publisher on data.gov.uk. Non-UKLP Dataset Compatibility: The ETL script is designed to filter and import non-UKLP datasets, excluding INSPIRE datasets from the data.gov.uk import process at this time. Bower Component Installation: Simplifies asset management by providing instructions of installing bower components. Technical Integration: The Salford extension requires modifications to the CKAN configuration file (production.ini). Specifically, it involves adding salford and esd to the ckan.plugins setting, defining the licensesgroupurl, and potentially configuring other custom options. The ETL script leverages the CKAN API (ckanapi) for data import. Additionally, Bower components must be installed. Benefits & Impact: Using the Salford CKAN extension, organizations can establish a more streamlined data ingestion process tailored to Salford City Council datasets, enhance data accessibility, improve asset management and facilitate better data governance aligned with specific licensing requirements. By selectively importing datasets and offering custom plugin support, it caters to specialized data management needs.

  4. Z

    F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in...

    • data.niaid.nih.gov
    Updated Jun 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiziltan, Zeynep (2024). F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467482
    Explore at:
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    Yamamoto, Keiji
    Domke, Jens
    Antici, Francesco
    Kiziltan, Zeynep
    Bartolini, Andrea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.

    The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.

    F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.

    The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:

    Importing pandas library

    import pandas as pd

    Read the 21_01.parquet file in a dataframe format

    df = pd.read_parquet("21_01.parquet")

    df.head()

  5. c

    Research data supporting "The sun always shines somewhere. The energetic...

    • repository.cam.ac.uk
    csv
    Updated Aug 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivens, Frederick (2024). Research data supporting "The sun always shines somewhere. The energetic feasibility of a global grid with 100% renewable electricity" [Dataset]. http://doi.org/10.17863/CAM.111494
    Explore at:
    csv(293239 bytes)Available download formats
    Dataset updated
    Aug 21, 2024
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Ivens, Frederick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EUROPEAN ELECTRICITY DEMAND TIME SERIES (2017)

    DATASET DESCRIPTION This dataset contains the aggregated electricity demand time series for Europe in 2017, measured in megawatts (MW). The data is presented with a timestamp in Coordinated Universal Time (UTC) and the corresponding electricity demand in MW.

    DATA SOURCE The original data was obtained from the European Network of Transmission System Operators for Electricity (ENTSO-E) and can be accessed at: https://www.entsoe.eu/data/power-stats/ Specifically, the data was extracted from the "MHLV_data-2015-2019.xlsx" file, which provides aggregated hourly electricity load data by country for the years 2015 to 2019.

    DATA PROCESSING The dataset was created using the following steps: - Importing Data: The original Excel file was imported into a Python environment using the pandas library. The data was checked for completeness to ensure no missing or corrupted entries. - Time Conversion: All timestamps in the dataset were converted to Coordinated Universal Time (UTC) to standardize the time reference across the dataset. - Aggregation: The data for the year 2017 was extracted from the dataset. The hourly electricity load for all European countries (defined as per as per Yu et al (2019), see below) was summed to generate an aggregated time series representing the total electricity demand across Europe.

    FILE FORMAT - CSV Format: The dataset is stored in a CSV file with two columns: - Timestamp (UTC): The time at which the electricity demand was recorded. - Electricity Demand (MW): The total aggregated electricity demand for Europe in megawatts.

    USAGE NOTES - Temporal Coverage: The dataset covers the entire year of 2017 with hourly granularity. - Geographical Coverage: The dataset aggregates data from multiple European countries, following the definition of Europe as per Yu et al (2019).

    REFERENCES J. Yu, K. Bakic, A. Kumar, A. Iliceto, L. Beleke Tabu, J. Ruaud, J. Fan, B. Cova, H. Li, D. Ernst, R. Fonteneau, M. Theku, G. Sanchis, M. Chamollet, M. Le Du, Y. Zhang, S. Chatzivasileiadis, D.-C. Radu, M. Berger, M. Stabile, F. Heymann, M. Dupré La Tour, M. Manuel de Villena Millan, and M. Ranjbar. Global electricity network - feasibility study. Technical report, CIGRE, 2019. URL https://hdl.handle.net/2268/239969. Accessed July 2024.

  6. T

    mnist

    • tensorflow.org
    • universe.roboflow.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The MNIST database of handwritten digits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('mnist', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">

  7. Electronic Sales

    • kaggle.com
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Electronic Sales [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/electronic-sales
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anshul Pachauri
    Description

    The provided Python code is a comprehensive analysis of sales data for a business that involves the merging of monthly sales data, cleaning and augmenting the dataset, and performing various analytical tasks. Here's a breakdown of the code:

    Data Preparation and Merging:

    The code begins by importing necessary libraries and filtering out warnings. It merges sales data from 12 months into a single file named "all_data.csv." Data Cleaning:

    Rows with NaN values are dropped, and any entries starting with 'Or' in the 'Order Date' column are removed. Columns like 'Quantity Ordered' and 'Price Each' are converted to numeric types for further analysis. Data Augmentation:

    Additional columns such as 'Month,' 'Sales,' and 'City' are added to the dataset. The 'City' column is derived from the 'Purchase Address' column. Analysis:

    Several analyses are conducted, answering questions such as: The best month for sales and total earnings. The city with the highest number of sales. The ideal time for advertisements based on the number of orders per hour. Products that are often sold together. The best-selling products and their correlation with price. Visualization:

    Bar charts and line plots are used for visualizing the analysis results, making it easier to interpret trends and patterns. Matplotlib is employed for creating visualizations. Summary:

    The code concludes with a comprehensive visualization that combines the quantity ordered and average price for each product, shedding light on product performance. This code is structured to offer insights into sales patterns, customer behavior, and product performance, providing valuable information for strategic decision-making in the business.

  8. HURRECON Model for Estimating Hurricane Wind Speed, Direction, and Damage (R...

    • search.dataone.org
    • portal.edirepository.org
    Updated Feb 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emery Boose (2024). HURRECON Model for Estimating Hurricane Wind Speed, Direction, and Damage (R and Python) [Dataset]. https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-hfr%2F446%2F1
    Explore at:
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Long Term Ecological Research Networkhttp://www.lternet.edu/
    Authors
    Emery Boose
    Area covered
    Earth
    Description

    The HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).

  9. Z

    Linux Kernel binary size

    • data.niaid.nih.gov
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo MARTIN (2021). Linux Kernel binary size [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4943883
    Explore at:
    Dataset updated
    Jun 14, 2021
    Dataset provided by
    Mathieu ACHER
    Hugo MARTIN
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.

    Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".

    In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.

    import pandas as pd import json import numpy

    with open("Linux_options.json","r") as f: linux_options = json.load(f)

    Load csv by setting options as int8 to save a lot of memory

    return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})

  10. T

    tiny_shakespeare

    • tensorflow.org
    • huggingface.co
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). tiny_shakespeare [Dataset]. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare
    Explore at:
    Dataset updated
    Feb 11, 2023
    Description

    40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

    To use for e.g. character modelling:

    d = tfds.load(name='tiny_shakespeare')['train']
    d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
    # train split includes vocabulary for other splits
    vocabulary = sorted(set(next(iter(d)).numpy()))
    d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
    d = d.unbatch()
    seq_len = 100
    batch_size = 2
    d = d.batch(seq_len)
    d = d.batch(batch_size)
    

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('tiny_shakespeare', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial

    • zenodo.org
    csv
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller (2025). Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial [Dataset]. http://doi.org/10.5281/zenodo.15263830
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."

    This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."

  12. Data from: Field measurement data of hydrodynamic and morphological...

    • zenodo.org
    zip
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rik Gijsman; Rik Gijsman; Sabine Engel; Daphne van der Wal; Daphne van der Wal; Rob van Zee; Jessica Johnson; Matthijs van der Geest; Matthijs van der Geest; Kathelijne Wijnberg; Kathelijne Wijnberg; Erik Horstman; Erik Horstman; Sabine Engel; Rob van Zee; Jessica Johnson (2025). Field measurement data of hydrodynamic and morphological processes in the mangrove forest of Lac Bay, Bonaire, Caribbean Netherlands [Dataset]. http://doi.org/10.5281/zenodo.13904523
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rik Gijsman; Rik Gijsman; Sabine Engel; Daphne van der Wal; Daphne van der Wal; Rob van Zee; Jessica Johnson; Matthijs van der Geest; Matthijs van der Geest; Kathelijne Wijnberg; Kathelijne Wijnberg; Erik Horstman; Erik Horstman; Sabine Engel; Rob van Zee; Jessica Johnson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Lac Bay, Bonaire, Caribbean Netherlands
    Description

    31-10-2024, Rik Gijsman

    _
    Dataset of field measurement of hydrodynamic and morphological processes in the mangrove forest of Lac Bay, Bonaire, Caribbean Netherlands.

    _
    For more information please see scientific publication:

    Gijsman, R., Engel, S., van der Wal, D., van Zee, R., Johnson, J., van der Geest, M., Wijnberg, K.M. and Horstman, E.M. (2024). The Importance of Tidal Creeks for Mangrove Survival on Small Oceanic Islands. Unpublished Manuscript.

    _

    Dataset contains:

    1. Timeseries data from field instruments:
      1. Atmospheric pressure
      2. Water depth
      3. Water level
      4. Flow velocity
      5. Waves
      6. Turbidity
      7. Rainfall
      8. Temperature
      9. Wind
      10. Tidal creek flows
      11. Tidal creek transport
      12. Lagoon suspension
    2. Mapped data of field surveys:
      1. Bathymetry and creek profiles
      2. Sediment characteristics
    3. Python script for importing and plotting data
    4. Overview map with shapefiles of instrument location coordinates

    _
    Additional notes for your information:

    • Timeseries data stored in .txt files
    • Visualization of data in .txt files provided with accompanying .png files
    • Timeseries data of Hobo sensors, Rain gauge and KNMI station interpolated between 19-01-2022 00:00:00 and 18-05-2022 23:55:00 (120 days) on 5 minute intervals
    • Timeseries data of other sensors interpolated between 19-01-2022 00:00:00 and 08-03-2022 23:55:00 (49 days) on 5 minute intervals
    • Column headers show instrument station abbreviation
    • Datetime in local timezone of Bonaire (GMT-4)
    • When a property (e.g., water depth) was measured by different instruments, the instrument used is mentioned in the file name, e.g. "water_depth_hobo.txt"
    • Measurements from KNMI (The Royal Netherlands Meteorological Institute) station 990 were included in the dataset and indicated with '_knmi' in the file name. Data obtained from 'https://www.knmi.nl/nederland-nu/klimatologie/uurgegevens_Caribisch_gebied'

  13. T

    qm9

    • tensorflow.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). qm9 [Dataset]. http://doi.org/10.6084/m9.figshare.c.978904.v5
    Explore at:
    Dataset updated
    Dec 11, 2024
    Description

    QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('qm9', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  14. Apple Leaf Disease Detection Using Vision Transformer

    • zenodo.org
    text/x-python
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amreen Batool; Amreen Batool
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Table of Contents

    Introduction

    The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Code Explanation

    1. Importing Libraries

    • The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

    2. Visualizing the Dataset

    • The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.
    • The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

    3. Data Augmentation

    • The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.
    • Separate generators are created for training, validation, and test datasets.

    4. Patch Visualization

    • The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.
    • The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

    5. Model Training

    • The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.
    • The model is trained for a specified number of epochs, and the training history is stored for later analysis.

    6. Model Evaluation

    • After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.
    • The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

    7. Visualizing Misclassified Images

    • The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

    8. Fine-Tuning and Learning Rate Adjustment

    • The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

    Steps for Implementation

    1. Dataset Preparation

      • Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).
    2. Install Required Libraries

      • Install the necessary Python libraries using pip:
        pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
    3. Run the Script

      • Execute the script in a Python environment. The script will automatically:
        • Load and preprocess the dataset.
        • Apply data augmentation.
        • Train the Vision Transformer model.
        • Evaluate the model and generate performance metrics.
    4. Analyze Results

      • Review the confusion matrix and classification report to understand the model's performance.
      • Visualize misclassified images to identify potential areas for improvement.
    5. Fine-Tuning

      • Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
  15. T

    curated_breast_imaging_ddsm

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). curated_breast_imaging_ddsm [Dataset]. https://www.tensorflow.org/datasets/catalog/curated_breast_imaging_ddsm
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.

    The default config is made of patches extracted from the original mammograms, following the description from (http://arxiv.org/abs/1708.09427), in order to frame the task to solve in a traditional image classification setting.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('curated_breast_imaging_ddsm', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/curated_breast_imaging_ddsm-patches-3.0.0.png" alt="Visualization" width="500px">

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Organization logo

Storage and Transit Time Data and Code

Explore at:
zipAvailable download formats
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

Search
Clear search
Close search
Google apps
Main menu