70 datasets found
  1. Data from: Phospho-seq: Integrated, multi-modal profiling of intracellular...

    • zenodo.org
    application/gzip, bin
    Updated Apr 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John D Blair; John D Blair; Austin Hartman; Austin Hartman; Fides Zenk; Fides Zenk; Carol Dalgarno; Carol Dalgarno; Barbara Treutlein; Barbara Treutlein; Rahul Satija; Rahul Satija (2023). Phospho-seq: Integrated, multi-modal profiling of intracellular protein dynamics in single cells [Dataset]. http://doi.org/10.5281/zenodo.7754315
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Apr 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    John D Blair; John D Blair; Austin Hartman; Austin Hartman; Fides Zenk; Fides Zenk; Carol Dalgarno; Carol Dalgarno; Barbara Treutlein; Barbara Treutlein; Rahul Satija; Rahul Satija
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets to go along with the publication listed:

    full_object.rds: Brain Organoid Phospho-Seq dataset with ATAC, Protein and imputed RNA data

    rna_object.rds: Reference whole cell scRNA-Seq object on Brain organoids

    multiome_object.rds: Bridge dataset containing RNA and ATAC modalities for Brain organoids

    metacell_allnorm.rds: Metacell object for finding gene-peak-protein linkages in Brain organoid dataset

    fullobject_fragments.tsv.gz: fragment file to go with the full object

    fullobject_fragments.tsv.gz.tbi:index file for the full object fragment file

    multiome_fragments.tsv.gz: fragment file to go with the multiome object

    multiome_fragments.tsv.gz.tbi:index file for the multiome object fragment file

    K562_Stem.rds : object corresponding to the pilot experiment including K562 cells and iPS cells

    K562_stem_fragments.tsv.gz: fragment file to go with the K562_stem object

    K562_stem_fragments.tsv.gz.tbi: index file for the K562_stem object fragment file

    To use the K562 and multiome datasets provided, please use these lines of code to import the object into Signac/Seurat and change the fragment file path to the corresponding downloaded fragment file:

    obj <- readRDS("obj.rds")
    # remove fragment file information
    Fragments(obj) <- NULL
    # Update the path of the fragment file 
    Fragments(obj) <- CreateFragmentObject(path = "download/obj_fragments.tsv.gz", cells = Cells(obj))

    To use the "fullobject" dataset provided, please use these lines of code to import the object into Signac/Seurat and change the fragment file path to the corresponding downloaded fragment file:

    #load the stringr package
    library(stringr)
    #load the object
    obj <- readRDS("obj.rds")
    # remove fragment file information
    Fragments(obj) <- NULL
    #Remove unwanted residual information and rename cells
    obj@reductions$norm.adt.pca <- NULL
    obj@reductions$norm.pca <- NULL
    obj <- RenameCells(obj, new.names = str_remove(Cells(obj), "atac_"))
    # Update the path of the fragment file 
    Fragments(obj) <- CreateFragmentObject(path = "download/obj_fragments.tsv.gz", cells = Cells(obj))

  2. Titanic-json-format

    • kaggle.com
    zip
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Basit AI (2025). Titanic-json-format [Dataset]. https://www.kaggle.com/datasets/engrbasit62/titanic-json-format
    Explore at:
    zip(33844 bytes)Available download formats
    Dataset updated
    Sep 21, 2025
    Authors
    Abdul Basit AI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🛳️ Titanic Dataset (JSON Format) 📌 Overview

    This is the classic Titanic: Machine Learning from Disaster dataset, converted into JSON format for easier use in APIs, data pipelines, and Python projects. It contains the same passenger details as the original CSV version, but stored as JSON for convenience.

    📂 Dataset Contents

    File: titanic.json

    Columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

    Use Cases: Exploratory Data Analysis (EDA), feature engineering, machine learning model training, web app backends, JSON parsing practice.

    🛠️ How to Use 🔹 1. Load with kagglehub import kagglehub

    Download the latest version of the dataset

    path = kagglehub.dataset_download("engrbasit62/titanic-json-format") print("Path to dataset files:", path)

    🔹 2. Load into Pandas import pandas as pd

    Read the JSON file into a DataFrame

    df = pd.read_json(f"{path}/titanic.json")

    print(df.head())

    💡 Notes

    Preview truncation: Kaggle may show only part of the JSON in the preview panel because of its size. ✅ Don’t worry — the full dataset is available when loaded via code.

    Benefits of JSON format: Ideal for web apps, APIs, or projects that work with structured data. Easily convertible back to CSV if needed.

  3. h

    celeba

    • huggingface.co
    • datasetninja.com
    • +3more
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuehao Wang (2025). celeba [Dataset]. https://huggingface.co/datasets/Yuehao/celeba
    Explore at:
    Dataset updated
    May 13, 2025
    Authors
    Yuehao Wang
    Description

    CelebA dataset

    A copy of celeba dataset. https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

      How to use
    

    Download data

    huggingface-cli download --local-dir /path/to/datasets/celeba --repo-type dataset Yuehao/celeba unzip /path/to/datasets/celeba/img_align_celeba.zip -d /path/to/datasets/celeba

    Load data via torchvision.datasets.CelebA

    torchvision.datasets.CelebA(root='/path/to/datasets')

  4. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St Petersburg
    European University at St. Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  5. The codes and data for "A Graph Convolutional Neural Network-based Method...

    • figshare.com
    txt
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FirstName LastName (2025). The codes and data for "A Graph Convolutional Neural Network-based Method for Predicting Computational Intensity of Geocomputation" [Dataset]. http://doi.org/10.6084/m9.figshare.28200623.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    FirstName LastName
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Graph Convolutional Neural Network-based Method for Predicting Computational Intensity of GeocomputationThis is the implementation for the paper "A Graph Convolutional Neural Network-based Method for Predicting Computational Intensity of Geocomputation".The framework is Learning-based Computing Framework for Geospatial data(LCF-G).Prediction, ParallelComputation and SampleGeneration.This paper includes three case studies, each corresponding to a folder. Each folder contains four subfolders: data, CIThe data folder contains geospatail data.The CIPrediction folder contains model training code.The ParallelComputation folder contains geographic computation code.The SampleGeneration folder contains code for sample generation.Case 1: Generation of DEM from point cloud datastep 1: Data downloadDataset 1 has been uploaded to the directory 1point2dem/data. The other two datasets, Dataset 2 and Dataset 3, can be downloaded from the following website:OpenTopographyBelow are the steps for downloading Dataset 2 and Dataset 3, along with the query parameters:Dataset 2:Visit OpenTopography Website: Go to Dataset 2 Download Link.https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.112018.2193.1Coordinates & Classification:In the section "1. Coordinates & Classification", select the option "Manually enter selection coordinates".Set the coordinates as follows: Xmin = 1372495.692761,Ymin = 5076006.86821,Xmax = 1378779.529766,Ymax = 5085586.39531Point Cloud Data Download:Under section "2. Point Cloud Data Download", choose the option "Point cloud data in LAS format".Submit:Click on "SUBMIT" to initiate the download.Dataset 3:Visit OpenTopography Website:Go to Dataset 3 Download Link: https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.052016.26912.1Coordinates & Classification:In the section "1. Coordinates & Classification", select the option "Manually enter selection coordinates".Set the coordinates as follows:Xmin = 470047.153826,Ymin = 4963418.512121,Xmax = 479547.16556,Ymax = 4972078.92768Point Cloud Data Download:Under section "2. Point Cloud Data Download", choose the option "Point cloud data in LAS format".Submit:Click on "SUBMIT" to initiate the download.step 2: Sample generationThis step involves data preparation, and samples can be generated using the provided code. Since the samples have already been uploaded to 1point2dem/SampleGeneration/data, this step is optional.cd 1point2dem/SampleGenerationg++ PointCloud2DEMSampleGeneration.cpp -o PointCloud2DEMSampleGenerationmpiexec -n {number_processes} ./PointCloud2DEMSampleGeneration ../data/pcd path/to/outputstep 3: Model trainingThis step involves training three models (GAN, CNN, GAT). The model results are saved in 1point2dem/SampleGeneration/result, and the results for Table 3 in the paper are derived from this output.cd 1point2dem/CIPredictionpython -u point_prediction.py --model [GCN|ChebNet|GATNet]step 4: Parallel computationThis step uses the trained models to optimize parallel computation. The results for Figures 11-13 in the paper are generated from the output of this command.cd 1point2dem/ParallelComputationg++ ParallelPointCloud2DEM.cpp -o ParallelPointCloud2DEMmpiexec -n {number_processes} ./ParallelPointCloud2DEM ../data/pcdCase 2: Spatial intersection of vector datastep 1: Data downloadSome data from the paper has been uploaded to 2intersection/data. The remaining OSM data can be downloaded from GeoFabrik. Below are the download steps and parameters:Directly click the following link to download the OSM data: GeoFabrik - Czech Republic OSM Datastep 2: Sample generationThis step involves data preparation, and samples can be generated using the provided code. Since the samples have already been uploaded to 2intersection/SampleGeneration/data, this step is optional.cd 2intersection/SampleGenerationg++ ParallelIntersection.cpp -o ParallelIntersectionmpiexec -n {number_processes} ./ParallelIntersection ../data/shpfile ../data/shpfilestep 3: Model trainingThis step involves training three models (GAN, CNN, GAT). The model results are saved in 2intersection/SampleGeneration/result, and the results for Table 5 in the paper are derived from this output.cd 2intersection/CIPredictionpython -u vector_prediction.py --model [GCN|ChebNet|GATNet]step 4: Parallel computationThis step uses the trained models to optimize parallel computation. The results for Figures 14-16 in the paper are generated from the output of this command.cd 2intersection/ParallelComputationg++ ParallelIntersection.cpp -o ParallelIntersectionmpiexec -n {number_processes} ./ParallelIntersection ../data/shpfile1 ../data/shpfile2Case 3: WOfS analysis using raster datastep 1: Data downloadSome data from the paper has been uploaded to 3wofs/data. The remaining data can be downloaded from http://openge.org.cn/advancedRetrieval?type=dataset. Below are the query parameters:Product Selection: Select LC08_L1TP and LC08_L1GTLatitude and Longitude Selection:Minimum Longitude: 112.5,Maximum Longitude: 115.5, Minimum Latitude: 29.5, Maximum Latitude: 31.5Time Range: 2013-01-01 to 2018-12-31Other parameters: Defaultstep 2: Sample generationThis step involves data preparation, and samples can be generated using the provided code. Since the samples have already been uploaded to 3wofs/SampleGeneration/data, this step is optional.cd 3wofs/SampleGenerationsbt packeagespark-submit --master {host1,host2,host3} --class whu.edu.cn.core.cube.raster.WOfSSampleGeneration path/to/package.jarstep 3: Model trainingThis step involves training three models (GAN, CNN, GAT). The model results are saved in 3wofs/SampleGeneration/result, and the results for Table 6 in the paper are derived from this output.cd 3wofs/CIPredictionpython -u raster_prediction.py --model [GCN|ChebNet|GATNet]step 4: Parallel computationThis step uses the trained models to optimize parallel computation. The results for Figures 18, 19 in the paper are generated from the output of this command.cd 3wofs/ParallelComputationsbt packeagespark-submit --master {host1,host2,host3} --class whu.edu.cn.core.cube.raster.WOfSOptimizedByDL path/to/package.jar path/to/outputStatement about Case 3The experiment Case 3 presented in this paper was conducted with improvements made on the GeoCube platform.Code Name: GeoCubeCode Link: GeoCube Source CodeLicense Information: The GeoCube project is openly available under the CC BY 4.0 license.The GeoCube project is licensed under CC BY 4.0, which is the Creative Commons Attribution 4.0 International License, allowing anyone to freely share, modify, and distribute the platform's code.Citation:Gao, Fan (2022). A multi-source spatio-temporal data cube for large-scale geospatial analysis. figshare. Software. https://doi.org/10.6084/m9.figshare.15032847.v1Clarification Statement:The authors of this code are not affiliated with this manuscript. The innovations and steps in Case 3, including data download, sample generation, and parallel computation optimization, were independently developed and are not dependent on the GeoCube’s code.RequirementsThe codes use the following dependencies with Python 3.8torch==2.0.0torch_geometric==2.5.3networkx==2.6.3pyshp==2.3.1tensorrt==8.6.1matplotlib==3.7.2scipy==1.10.1scikit-learn==1.3.0geopandas==0.13.2

  6. h

    character-llm-data

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenMOSS (2024). character-llm-data [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    OpenMOSS
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Character-LLM: A Trainable Agent for Role-Playing

    This is the training datasets for Character-LLM, which contains nine characters experience data used to train Character-LLMs. To download the dataset, please run the following code with Python, and you can find the downloaded data in /path/to/local_dir. from huggingface_hub import snapshot_download snapshot_download( local_dir_use_symlinks=True, repo_type="dataset", repo_id="fnlp/character-llm-data"… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data.

  7. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  8. h

    triviaqa-verified

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yair Feldman, triviaqa-verified [Dataset]. https://huggingface.co/datasets/yairfeldman/triviaqa-verified
    Explore at:
    Authors
    Yair Feldman
    Description

    This is the verified subset of the original TriviaQA dataset (https://nlp.cs.washington.edu/triviaqa/). Steps to reproduce:

    Download: wget https://nlp.cs.washington.edu/triviaqa/data/triviaqa-rc.tar.gz Extract: pv triviaqa-rc.tar.gz | tar -xz Process:from pathlib import Path from tqdm.auto import tqdm import pandas as pd from typing import NamedTuple import json

    triviaqa_base_dir = Path("

  9. MAPLE-GNN Hybrid-Feature Graph Representation Data, PDB Files, and...

    • zenodo.org
    txt, zip
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruce Tang; Bruce Tang (2024). MAPLE-GNN Hybrid-Feature Graph Representation Data, PDB Files, and Struct2Graph Dataset [Dataset]. http://doi.org/10.5281/zenodo.13123920
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bruce Tang; Bruce Tang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, PDB Files, and Protein Graph Representation Data for MAPLE-GNN. When downloaded, extracted graphrepresentation.zip files should be put into the codebase/data/npy folder path. Extracted PDB files can be put into the codebase/data/pdb folder path.

  10. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  11. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  12. KaggleFinalRewodoriData

    • kaggle.com
    zip
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kaerururu (2022). KaggleFinalRewodoriData [Dataset]. https://www.kaggle.com/kaerunantoka/kaggle-final-rewodori-data
    Explore at:
    zip(338817607 bytes)Available download formats
    Dataset updated
    Oct 29, 2022
    Authors
    kaerururu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    download

    • kaggle datasets download -w kaerunantoka/kaggle-final-rewodori-data

    upload

    • mkdir /upload/path
    • mv dataset-metadata.json.tmp /upload/path/dataset-metadata.json
    • kaggle datasets version -p /upload/path -m "upload data"
  13. Z

    A subsection of England and Wales EPC households, joined with PPD data, used...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jenkinson, Ryan; Chan, Stephanie; Phillips, Tom; Lopez-Garcia, Daniel (2022). A subsection of England and Wales EPC households, joined with PPD data, used for simulation modelling [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7322966
    Explore at:
    Dataset updated
    Nov 15, 2022
    Dataset provided by
    Centre for Net Zero
    Authors
    Jenkinson, Ryan; Chan, Stephanie; Phillips, Tom; Lopez-Garcia, Daniel
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    England
    Description

    If you want to give feedback on this dataset, or wish to request it in another form (e.g csv), please fill out this survey here. We are a not-for-profit research organisation keen to see how others use our open models and tools, so all feedback is appreciated! It's a short form that takes 5 minutes to complete.

    Important Note: Before downloading this dataset, please read the License and Software Attribution section at the bottom.

    This dataset aligns with the work published in Centre for Net Zero's report "Hitting the Target". In this work, we simulate a range of interventions to model the situations in which we believe the UK will meet its 600,000 heat pump installation per year target by 2028. For full modelling assumptions and findings, read our report on our website.

    The code for running our simulation is open source here.

    This dataset contains over 9 million households that have been address matched between Energy Performance Certificates (EPC) data and Price Paid Data (PPD). The code for our address matching is here. Since these datasets are Open Government License (OGL), this dataset is too. We basically model specific columns from various datasets, as set out in our methodology section in our report, to simplify and clean up this dataset for academic use. License information is also available in the appendix of our report above.

    The EPC data loaders can be found here (the data is here) and the rest of the schemas and data download locations can be found here.

    Note that this dataset is not regularly maintained or updated. It is correct as of January 2022. The data was curated and tested using dbt via this Github repository and would be simple to rerun on the latest data.

    The schema / data dictionary for this data can be found here.

    Our recommended way of loading this data is in Python. After downloading all "parts" of the dataset to a folder. You can run:

    
    
    import pandas as pd
    
    
    data = pd.read_parquet("path/to/data/folder/")
    
    
    

    Licenses and software attribution:

    For EPC, PPD and UK House Price Index data:

    For the EPC data, we are permitted to republish this providing we mention that all researchers who download this dataset follow these copyright restrictions. We do not explicitly release any Royal Mail address data, instead we use these fields to generate a pseudonymised "address_cluster_id" which reflects a unique combination of the address lines and postcodes, as well as other metadata. When viewing ICO and GDPR guidelines, this still counts as personal data, but we have gone to measures to pseudonymise as much as possible to fulfil our obligations as a data processor. You must read this carefully before downloading the data, and ensure that you are using it for the research purposes as determined by this copyright notice.

    Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.

    Contains OS data © Crown copyright and database right 2022.

    Contains Office for National Statistics data licensed under the Open Government Licence v.3.0.

    The OGL v3.0 license states that we are free to:

    copy, publish, distribute and transmit the Information;

    adapt the Information;

    exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application.

    However we must (where we do any of the above):

    acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence;

    You can see more information here.

    For XOServe Off Gas Postcodes:

    This dataset has been released openly for all uses here.

    For the address matching:

    GNU Parallel: O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014

  14. Integrated IDPS Security 3Datasets (IIS3D)

    • kaggle.com
    zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roger Nick Anaedevha (2025). Integrated IDPS Security 3Datasets (IIS3D) [Dataset]. https://www.kaggle.com/datasets/rogernickanaedevha/integrated-idps-security-3datasets/discussion
    Explore at:
    zip(911339607 bytes)Available download formats
    Dataset updated
    Jul 15, 2025
    Authors
    Roger Nick Anaedevha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview

    This research-grade integrated cybersecurity dataset combines three premier cybersecurity datasets into a unified, balanced collection totaling 5.6 million records across multiple threat domains. The integration provides comprehensive coverage of modern cybersecurity threats including intrusion detection, IoT security, and network-based attacks, making it ideal for robust machine learning research and comparative cybersecurity analysis.

    Dataset Composition and Statistics

    Total Dataset Metrics

    • Total Records: 5,597,712 rows
    • Total Size: 2.6 GB
    • Processing Time: 8.7 minutes (optimized)
    • Integration Date: 2025
    • Format: Three standardized CSV files with unified metadata

    Research Applications

    Ideal Use Cases

    • Comparative cybersecurity analysis across domains
    • Multi-domain threat detection model development
    • IoT security research with comprehensive attack coverage
    • Intrusion detection system evaluation and benchmarking
    • Network forensics and behavioral analysis
    • Cross-domain transfer learning in cybersecurity

    Statistical Significance

    • 5.6M records provide robust statistical power
    • Balanced domain coverage enables fair comparative analysis
    • Modern attack vectors reflect current threat landscape
    • Multiple device types (servers, IoT devices, network infrastructure)

    Dataset Balance

    • IDS (36%): 2M rows - Network intrusion detection
    • UNSW (50%): 2.8M rows - Comprehensive attack taxonomy
    • IoT (14%): 800K rows - Modern IoT threat landscape

    Technical Specifications

    File Structure

    integrated_dataset_fixed/
    ├── integrated_ids_intrusion_dataset.csv (884 MB)
    ├── integrated_ciciot2023_dataset.csv (311 MB) 
    ├── integrated_unsw_nb15_dataset.csv (1,419 MB)
    ├── fixed_integration_summary.json
    └── dataset-metadata.json
    

    Usage Example

    import kagglehub
    import pandas as pd
    
    # Download integrated dataset
    path = kagglehub.dataset_download('rogernickanaedevha/integrated-cybersecurity-dataset')
    
    # Load individual domain datasets
    ids_data = pd.read_csv(f'{path}/integrated_ids_intrusion_dataset.csv')
    iot_data = pd.read_csv(f'{path}/integrated_ciciot2023_dataset.csv')
    unsw_data = pd.read_csv(f'{path}/integrated_unsw_nb15_dataset.csv')
    
    # Combined analysis
    print(f"Total records: {len(ids_data) + len(iot_data) + len(unsw_data):,}")
    

    Academic Citations

    Original Datasets: - CSE-CIC-IDS2018: University of New Brunswick Centre for Cybersecurity - CICIoT2023: Canadian Institute for Cybersecurity (CIC)
    - UNSW-NB15: Australian Centre for Cyber Security (ACCS)

    Integration: Research-grade processing and unification for comparative cybersecurity analysis.

    License and Usage

    • Academic Research: Free use permitted (original dataset terms)
    • Commercial Use: Requires approval from original dataset creators
    • Integration: Creative Commons CC0-1.0 for processing methodology
    • Attribution: Please cite both original datasets and this integration

    Dataset Quality Metrics

    • Completeness: 100% successful integration across all three domains
    • Balance: Proportional representation suitable for comparative analysis
    • Performance: Ultra-fast processing (8.7 minutes total)
    • Size Optimization: 2.6 GB total (manageable for most research environments)
    • Standards Compliance: Research-grade formatting and documentation

    This integrated dataset represents the most comprehensive multi-domain cybersecurity collection available, specifically optimized for machine learning research, comparative analysis, and advanced threat detection model development.

  15. ENV17 - Bathing water quality: additional datasets

    • gov.uk
    • s3.amazonaws.com
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Environment, Food & Rural Affairs (2025). ENV17 - Bathing water quality: additional datasets [Dataset]. https://www.gov.uk/government/statistical-data-sets/env17-bathing-water-quality-additional-datasets
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Environment, Food & Rural Affairs
    Description

    This data contains compliance information for bathing waters in England.

    Site data and summary information for English bathing waters is available from here:

    These tables show compliance with the Bathing Water Directive for bathing waters in England from 2017. The previous tables show compliance across the UK.

    https://assets.publishing.service.gov.uk/media/69205a9da0e0fb4c2936ab72/EMBARGOED_EA_National_Bathing_Waters_Classification_Results_2025.ods">National bathing waters in England for 2025: Individual bathing water classifications

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute"><abbr title="OpenDocument Spreadsheet" class="gem-c-attachment_abbr">ODS</abbr></span>, <span class="gem-c-attachment_attribute">38.4 KB</span></p>
    
    
    
      <p class="gem-c-attachment_metadata">
       This file is in an <a href="https://www.gov.uk/guidance/using-open-document-formats-odf-in-your-organisation" target="_self" class="govuk-link">OpenDocument</a> format
    

    https://assets.publishing.service.gov.uk/media/674059cf02bf39539bdee83d/EMBARGOED_EA_National_Bathing_Waters_Classification_Results_2024.ods">National bathing waters in England for 2024: Individual bathing water classifications

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute"><abbr title="OpenDocument Spreadsheet" class="gem-c-attachment_abbr">ODS</abbr></span>, <span class="gem-c-attachment_attribute">45.6 KB</span></p>
    
    
    
      <p class="gem-c-attachment_metadata">
       This file is in an <a href="https://www.gov.uk/guidance/using-open-document-formats-odf-in-your-organisation" target="_self" class="govuk-link">OpenDocument</a> format
    

  16. U

    USGS National Transportation Dataset (NTD) Downloadable Data Collection

    • data.usgs.gov
    • catalog.data.gov
    Updated Dec 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey, National Geospatial Technical Operations Center (2024). USGS National Transportation Dataset (NTD) Downloadable Data Collection [Dataset]. https://data.usgs.gov/datacatalog/data/USGS:ad3d631d-f51f-4b6a-91a3-e617d6a58b4e
    Explore at:
    Dataset updated
    Dec 25, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    U.S. Geological Survey, National Geospatial Technical Operations Center
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The USGS Transportation downloadable data from The National Map (TNM) is based on TIGER/Line data provided through U.S. Census Bureau and supplemented with HERE road data to create tile cache base maps. Some of the TIGER/Line data includes limited corrections done by USGS. Transportation data consists of roads, railroads, trails, airports, and other features associated with the transport of people or commerce. The data include the name or route designator, classification, and location. Transportation data support general mapping and geographic information system technology analysis for applications such as traffic safety, congestion mitigation, disaster planning, and emergency response. The National Map transportation data is commonly combined with other data themes, such as boundaries, elevation, hydrography, and structure ...

  17. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    pdf, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  18. e

    Simple download service (Atom) of the dataset: Type A map of agregated areas...

    • data.europa.eu
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simple download service (Atom) of the dataset: Type A map of agregated areas exposed to noise (night) for the road network of Côte-d’Or [Dataset]. https://data.europa.eu/data/datasets/fr-120066022-srv-32fdbdf3-4040-43a9-aef5-55a3def4fd02
    Explore at:
    unknownAvailable download formats
    Description

    These maps also referred to as “type A maps” represent for the year 2017 in the form of isophone curves, the areas exposed to more than 50 dB(A) according to the Ln indicator, with a step of 5 in 5 dB(A). They concern the road network of Côte-d’Or.

    Geographic objects have been aggregated and cut together to avoid overlap.

    LN: noise level indicator for the night period (22h-6h).

    These aggregated data are published for use in mapping purposes. It is advisable to load the detail data for more accurate use.

  19. Z

    Data and Processing from "Carbon-centric dynamics of Earth's marine...

    • data.niaid.nih.gov
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stoer, Adam; Fennel, Katja (2024). Data and Processing from "Carbon-centric dynamics of Earth's marine phytoplankton" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10949681
    Explore at:
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Dalhousie University
    Authors
    Stoer, Adam; Fennel, Katja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    Brief Summary:

    This documentation is for associated data and code for:

    A. Stoer, K. Fennel, Carbon-centric dynamics of Earth's marine phytoplankton. Proceedings of the National Academy of Sciences (2024).

    To cite this software and data, please use:

    A. Stoer, K. Fennel, Data and processing from "Carbon-centric dynamics of Earth's marine phytoplankton". Zenodo. https://doi.org/10.5281/zenodo.10949682. Deposited 1 October 2024.

    List of folders and subfolders and what they contain:

    raw data: Contains raw data used in the analysis. This folder does not contain the satellite imagery, which will need to be downloaded from the NASA Ocean Color website (https://oceancolor.gsfc.nasa.gov/).

    bgc-argo float data (subfolder): Includes Argo data from its original source or put into a similar Argo format

    global region data (subfolder): Includes data used to subset the Argo profiles into each 10deg lat region and basin.

    graff et al 2015 data (subfolder): Include the data digitized from Graff et al.'s Fig. 2.

    processed data: data processing by this study (Stoer and Fennel, 2024)

    processed bgc-argo data (subfolder): A binned processed file is present for each Argo float used in the analysis. Note these files include those describe in Table S1 (these are later processed in "3_stock_bloom_calc.py")

    processed satellite data (subfolder): includes a 10-deg latitude averaged for each satellite image processed (called "chl_sat_df_merged.csv"). This is later used to calculate a satellite chlorophyll-a climatology in "3_stock_bloom_calc.py".

    processed chla-irrad data (subfolder): includes the quality-controlled light diffuse attenuation data coupled with the chlorophyll-a fluorescence data to calculate slope factor corrections (the file is called "processed chla-irrad data.csv").

    processed topography data (subfolder): includes smoothed topography data (file named "ETOPO_2022_v1_60s_N90W180_surface_mod.tiff").

    software:

    0_ftp_argo_data_download.py: This program downloads the Argo data from the Global Data Assembly Center's FTP. Running this program will provide new Argo float profiles. However, there will be new floats and profiles present if downloaded. This will not match the historical record of Argo floats used in this analysis but could be useful for replicating this analysis when more data becomes available. The historical record of BGC-Argo floats are present in "/raw data/bgc-argo float data/" path. If you wish to downloaded other float data, see Gordon et al. (2020), Hamilton and Leidos (2017) and the data from the misclab website (https://misclab.umeoce.maine.edu/floats/).

    1_argo_data_processing.py: This program quality-controls and bins the biogeochemical data into a consistent format. This includes corrections and checks, like the spike/noise test or the non-photochemical quenching correction.

    2_sat_data_processing.py: this program processes the satellite data downloaded from the NASA Ocean Color website.

    3_stock_bloom_calc.py: this is the main program used to described the results of the study. The program takes the processed Argo data and groups it into regions and calculates slope factors, phytoplankton carbon & chlorophyll-a, global stocks, and bloom metrics.

    4_stock_calc_longhurst_province.py: This program repeats the global stocks calculations performed in "3_stock_bloom_calc.py" but bases the grouping on Longhurst Biogeochemical Provinces.

    How to Replicate this Analysis:

    Each program should be run in the order listed above. Path names where the data files have been downloaded will need to be updated in the code.

    To use the exact same Sprof files as used in the paper, skip running "0_ftp_argo_data_download.py" and start with "1_argo_data_processing.py" instead. Use the float data from the folder "bgc-argo float data". The program "0_ftp_argo_data_download.py" downloads the latest data from Argo database, so it is useful for updating the analysis. The program "1_argo_data_processing.py" may also be skipped to save time and the processed BGC-Argo float data may be used instead (see folder named "processed bgc-argo data").

    Similarly, the program "2_sat_data_processing.py" may also be skipped, which otherwise can take multiple hours to process. The raw data is available from the NASA Ocean Color website (https://oceancolor.gsfc.nasa.gov/). The processed data from "2_sat_data_processing.py" is available so this step may be skipped to save time as well.

    The program "3_stock_bloom_calc.py" will require running "ocean_toolbox.py" (see below) in another tab. The portion of the program that involves QC for the irradiance profiles has been commented out to save processing time, and the pre-processed data used in the study has been linked instead (see folder "processed light data"). Similarly, pre-processed topography data is present in this repository. The original Earth Topography data can be accessed at https://www.ncei.noaa.gov/products/etopo-global-relief-model.

    A version of "3_stock_bloom_calc.py" using Longhurst provinces is available for exploring alternative groupings and their effects on stock calculations. See the program named "4_stock_calc_longhurst_province.py". You will need to download the Longhurst biogeochemical provinces from https://www.marineregions.org/.

    To explore the effects of different slope factors, averaging methods, bbp spectral slopes, etc, the user will likely want to make changes to "3_stock_bloom_calc.py". Please do not hesitate to contact the correponding author (Adam Stoer) for guidance or questions.

    ocean_toolbox.py:

    import statsmodels.formula.api as smfimport osimport matplotlib.pyplot as pltimport numpy as npfrom uncertainties import unumpy as unpfrom scipy import stats

    def file_grab(root,find,start): #grabs files by file extensions and location filelst = [] for subdir, dirs, files in os.walk(root): for file in files: filepath = subdir + os.sep + file if filepath.endswith(find): if filepath.startswith(start): filelst.append(filepath) return filelst

    def sep_bbp(data, name_z, name_chla, name_bbp): ''' data: Pandas Dataframe containing the profile data name_z: name of the depth variable in data name_chla: name of the chlorophyll-a variable in data name_bbp: name of the particle backscattering variable in data returns: the data variable with particle backscattering partitioned into phytoplankton (bbpphy) and non-algal particle components (bbpnap). ''' #name_chla = 'chla' #name_z = 'depth' #name_bbp = 'bbp470' dcm = data[data.loc[:,name_chla]==data.loc[:,name_chla].max()][name_z].values[0] # Find depth of deep chla maximum part_prof = data[(data.loc[:,name_bbp]=1), name_z].min() # Find depth where bbp NAP and bbp intersect data.loc[data[name_z]>=z_lim, 'bbp_back'] = data.loc[data[name_z]>=z_lim, name_bbp].tolist() data.loc[data[name_z]z_lim),'bbpphy'] = 0 # Subtract bbp NAP from bbp for bbp from phytoplankton

    return data['bbpphy'], z_lim
    

    def bbp_to_cphy(bbp_data, sf): ''' data: Pandas Dataframe containing the profile data name_bbp: name of the particulate backscattering variable in data name_bbp_err: name of particulate backscattering error variable in data returns: the data variable with particle backscattering [/m] converted into phytoplankton carbon [mg/m^3]. ''' cphy_data = bbp_data.mul(sf)

    return cphy_data
    
  20. F

    Parking lot locations and utilization samples in the Hannover Linden-Nord...

    • data.uni-hannover.de
    geojson, png
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2024). Parking lot locations and utilization samples in the Hannover Linden-Nord area from LiDAR mobile mapping surveys [Dataset]. https://data.uni-hannover.de/dataset/parking-locations-and-utilization-from-lidar-mobile-mapping-surveys
    Explore at:
    geojson, pngAvailable download formats
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Hanover, Linden - Nord
    Description

    Work in progress: data might be changed

    The data set contains the locations of public roadside parking spaces in the northeastern part of Hanover Linden-Nord. As a sample data set, it explicitly does not provide a complete, accurate or correct representation of the conditions! It was collected and processed as part of the 5GAPS research project on September 22nd and October 6th 2022 as a basis for further analysis and in particular as input for simulation studies.

    Vehicle Detections

    Based on the mapping methodology of Bock et al. (2015) and processing of Leichter et al. (2021), the utilization was determined using vehicle detections in segmented 3D point clouds. The corresponding point clouds were collected by driving over the area on two half-days using a LiDAR mobile mapping system, resulting in several hours between observations. Accordingly, these are only a few sample observations. The trips are made in such a way that combined they cover a synthetic day from about 8-20 clock.

    The collected point clouds were georeferenced, processed, and automatically segmented semantically (see Leichter et al., 2021). To automatically extract cars, those points with car labels were clustered by observation epoch and bounding boxes were estimated for the clusters as a representation of car instances. The boxes serve both to filter out unrealistically small and large objects, and to rudimentarily complete the vehicle footprint that may not be fully captured from all sides.

    https://data.uni-hannover.de/dataset/0945cd36-6797-44ac-a6bd-b7311f0f96bc/resource/807618b6-5c38-4456-88a1-cb47500081ff/download/detection_map.png" alt="Overview map of detected vehicles" title="Overview map of detected vehicles"> Figure 1: Overview map of detected vehicles

    Parking Areas

    The public parking areas were digitized manually using aerial images and the detected vehicles in order to exclude irregular parking spaces as far as possible. They were also tagged as to whether they were aligned parallel to the road and assigned to a use at the time of recording, as some are used for construction sites or outdoor catering, for example. Depending on the intended use, they can be filtered individually.

    https://data.uni-hannover.de/dataset/0945cd36-6797-44ac-a6bd-b7311f0f96bc/resource/16b14c61-d1d6-4eda-891d-176bdd787bf5/download/parking_area_example.png" alt="Example parking area occupation pattern" title="Visualization of example parking areas on top of an aerial image [by LGLN]"> Figure 2: Visualization of example parking areas on top of an aerial image [by LGLN]

    Parking Occupancy

    For modelling the parking occupancy, single slots are sampled as center points every 5 m from the parking areas. In this way, they can be integrated into a street/routing graph, for example, as prepared in Wage et al. (2023). Own representations can be generated from the parking area and vehicle detections. Those parking points were intersected with the vehicle boxes to identify occupancy at the respective epochs.

    https://data.uni-hannover.de/dataset/0945cd36-6797-44ac-a6bd-b7311f0f96bc/resource/ca0b97c8-2542-479e-83d7-74adb2fc47c0/download/datenpub-bays.png" alt="Overview map of parking slots' average load" title="Overview map of parking slots' average load"> Figure 3: Overview map of average parking lot load

    However, unoccupied spaces cannot be determined quite as trivially the other way around, since no detected vehicle can result just as from no measurement/observation. Therefore, a parking space is only recorded as unoccupied if a vehicle was detected at the same time in the neighborhood on the same parking lane and therefore it can be assumed that there is a measurement.

    To close temporal gaps, interpolations were made by hour for each parking slot, assuming that between two consecutive observations with an occupancy the space was also occupied in between - or if both times free also free in between. If there was a change, this is indicated by a proportional value. To close spatial gaps, unobserved spaces in the area are drawn randomly from the ten closest occupation patterns around.

    This results in an exemplary occupancy pattern of a synthetic day. Depending on the application, the value could be interpreted as occupancy probability or occupancy share.

    https://data.uni-hannover.de/dataset/0945cd36-6797-44ac-a6bd-b7311f0f96bc/resource/184a1f75-79ab-4d0e-bb1b-8ed170678280/download/occupation_example.png" alt="Example parking area occupation pattern" title="Example parking area occupation pattern"> Figure 4: Example parking area occupation pattern

    References

    • F. Bock, D. Eggert and M. Sester (2015): On-street Parking Statistics Using LiDAR Mobile Mapping, 2015 IEEE 18th International Conference on Intelligent Transportation Systems, Gran Canaria, Spain, 2015, pp. 2812-2818. https://doi.org/10.1109/ITSC.2015.452
    • A. Leichter, U. Feuerhake, and M. Sester (2021): Determination of Parking Space and its Concurrent Usage Over Time Using Semantically Segmented Mobile Mapping Data, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIII-B2-2021, 185–192. https://doi.org/10.5194/isprs-archives-XLIII-B2-2021-185-2021
    • O. Wage, M. Heumann, and L. Bienzeisler (2023): Modeling and Calibration of Last-Mile Logistics to Study Smart-City Dynamic Space Management Scenarios. In 1st ACM SIGSPATIAL International Workshop on Sustainable Mobility (SuMob ’23), November 13, 2023, Hamburg, Germany. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3615899.3627930
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
John D Blair; John D Blair; Austin Hartman; Austin Hartman; Fides Zenk; Fides Zenk; Carol Dalgarno; Carol Dalgarno; Barbara Treutlein; Barbara Treutlein; Rahul Satija; Rahul Satija (2023). Phospho-seq: Integrated, multi-modal profiling of intracellular protein dynamics in single cells [Dataset]. http://doi.org/10.5281/zenodo.7754315
Organization logo

Data from: Phospho-seq: Integrated, multi-modal profiling of intracellular protein dynamics in single cells

Related Article
Explore at:
application/gzip, binAvailable download formats
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
John D Blair; John D Blair; Austin Hartman; Austin Hartman; Fides Zenk; Fides Zenk; Carol Dalgarno; Carol Dalgarno; Barbara Treutlein; Barbara Treutlein; Rahul Satija; Rahul Satija
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Datasets to go along with the publication listed:

full_object.rds: Brain Organoid Phospho-Seq dataset with ATAC, Protein and imputed RNA data

rna_object.rds: Reference whole cell scRNA-Seq object on Brain organoids

multiome_object.rds: Bridge dataset containing RNA and ATAC modalities for Brain organoids

metacell_allnorm.rds: Metacell object for finding gene-peak-protein linkages in Brain organoid dataset

fullobject_fragments.tsv.gz: fragment file to go with the full object

fullobject_fragments.tsv.gz.tbi:index file for the full object fragment file

multiome_fragments.tsv.gz: fragment file to go with the multiome object

multiome_fragments.tsv.gz.tbi:index file for the multiome object fragment file

K562_Stem.rds : object corresponding to the pilot experiment including K562 cells and iPS cells

K562_stem_fragments.tsv.gz: fragment file to go with the K562_stem object

K562_stem_fragments.tsv.gz.tbi: index file for the K562_stem object fragment file

To use the K562 and multiome datasets provided, please use these lines of code to import the object into Signac/Seurat and change the fragment file path to the corresponding downloaded fragment file:

obj <- readRDS("obj.rds")
# remove fragment file information
Fragments(obj) <- NULL
# Update the path of the fragment file 
Fragments(obj) <- CreateFragmentObject(path = "download/obj_fragments.tsv.gz", cells = Cells(obj))

To use the "fullobject" dataset provided, please use these lines of code to import the object into Signac/Seurat and change the fragment file path to the corresponding downloaded fragment file:

#load the stringr package
library(stringr)
#load the object
obj <- readRDS("obj.rds")
# remove fragment file information
Fragments(obj) <- NULL
#Remove unwanted residual information and rename cells
obj@reductions$norm.adt.pca <- NULL
obj@reductions$norm.pca <- NULL
obj <- RenameCells(obj, new.names = str_remove(Cells(obj), "atac_"))
# Update the path of the fragment file 
Fragments(obj) <- CreateFragmentObject(path = "download/obj_fragments.tsv.gz", cells = Cells(obj))

Search
Clear search
Close search
Google apps
Main menu