75 datasets found
  1. g

    Datasets for evaluation of keyword extraction in Russian

    • github.com
    Updated Jun 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Nefedov (2018). Datasets for evaluation of keyword extraction in Russian [Dataset]. https://github.com/mannefedov/ru_kw_eval_datasets
    Explore at:
    Dataset updated
    Jun 11, 2018
    Authors
    Mikhail Nefedov
    Description

    Datasets for evaluation of keyword extraction in Russian

  2. Gazeta Summaries

    • kaggle.com
    zip
    Updated Sep 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilya Gusev (2021). Gazeta Summaries [Dataset]. https://www.kaggle.com/phoenix120/gazeta-summaries
    Explore at:
    zip(193749591 bytes)Available download formats
    Dataset updated
    Sep 5, 2021
    Authors
    Ilya Gusev
    Description

    Context

    This is the first Russian news summarization dataset. A paper about this dataset: https://arxiv.org/pdf/2006.11063.pdf Additional files and notebooks: https://github.com/IlyaGusev/gazeta/ Previous datasets for headline generation: https://github.com/RossiyaSegodnya/ria_news_dataset https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta

    Content

    This is the second version of the dataset. The data structure is pretty straightforward. Every line of a file is a JSON object with 5 fields: URL, title, text, summary, and date. The dataset consists of 74126 examples. The first 60964 examples by date are in the training dataset, the proceeding 6369 examples are in the validation dataset, and the remaining 6793 pairs are in the test dataset.

    Legal issues

    Legal basis for distribution of the dataset: https://www.gazeta.ru/credits.shtml, paragraph 2.1.2. All rights belong to "www.gazeta.ru". This dataset can be removed at the request of the copyright holder. Usage of this dataset is possible only for personal purposes on a non-commercial basis.

  3. Russian ASR Open STT (public phone calls 1 and 2)

    • kaggle.com
    zip
    Updated Apr 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alex cumder (2025). Russian ASR Open STT (public phone calls 1 and 2) [Dataset]. https://www.kaggle.com/datasets/alexcumder/audiosets
    Explore at:
    zip(14556669524 bytes)Available download formats
    Dataset updated
    Apr 7, 2025
    Authors
    alex cumder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Source https://github.com/snakers4/open_stt Include asr_public_phone_calls_1 and asr_public_phone_calls_2 in one directory. Directory asr_public_phone_calls_1/0/ сontains audio and corresponding transcripts. File dataset_target consist two columns "filename" and "text" from directory asr_public_phone_calls_1/0/.

    All files are normalized for easier / faster runtime augmentations and processing as follows: 1)Converted to mono, if necessary; 2)Converted to 16 kHz sampling rate, if necessary; 3)Stored as 16-bit integers;

  4. H

    Ukraine and Russia Conflict Tweet IDs Release v1.3

    • dataverse.harvard.edu
    • dataone.org
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Chen; Emilio Ferrara (2023). Ukraine and Russia Conflict Tweet IDs Release v1.3 [Dataset]. http://doi.org/10.7910/DVN/XZSYQO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Emily Chen; Emilio Ferrara
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Ukraine, Russia
    Description

    The repository contains an ongoing collection of tweets IDs associated with the current conflict in Ukraine and Russia, which we commenced collecting on Februrary 22, 2022. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use. Note that the compressed files must be first uncompressed in order to use included scripts. This dataset is release v1.3 and is not actively maintained -- the actively maintained dataset can be found here: https://github.com/echen102/ukraine-russia. This release contains Tweet IDs collected from 2/22/22 - 1/08/23. Please refer to the README for more details regarding data, data organization and data usage agreement. This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License . By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript: Emily Chen and Emilio Ferrara. 2022. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War Between Ukraine and Russia. arXiv:cs.SI/2203.07488

  5. RusTitW: Russian Language Visual Text Recognition

    • kaggle.com
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita (2024). RusTitW: Russian Language Visual Text Recognition [Dataset]. https://www.kaggle.com/datasets/hardtype/rustitw-russian-language-visual-text-recognition
    Explore at:
    zip(135305919719 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    Nikita
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

    Authors: Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov

    arXiv: https://arxiv.org/abs/2303.16531

    GitHub: github.com/markovivl/SynthText

    📄Abstract

    Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process.

    ⚙️About the data

    • Data is divided into train and test which are also splitted into real and synthetic (synth) examples.
    • For usability each folder contains info.csv file which has the same format for all splits of data.
    • Original labels and information are also preserved and can be found either in info_raw.csv or json_*_*.json files.
    • You can find duplicate images in dataset, which are not filtered from the original data. For example, some of the images are the same but have different resolution.
    • Some images from the train sample can be found in test, which is also from original.

    📍Label format

    [[{'left': 0.10259433962264151,
      'top': 0,
      'width': 0.4056603773584906,
      'height': 0.9303675048355899,
      'label': 'ALL you NEED
    is 20 SECONDS
    of Insane',
      'shape': 'rectangle'},
     {'left': 0.5141509433962265,
      'top': 0.009671179883945842,
      'width': 0.48584905660377353,
      'height': 0.5222437137330754,
      'label': 'COURAGE
    AND I PROMISE YOU
    something GREAT',
      'shape': 'rectangle'},
     {'left': 0.5165094339622641,
      'top': 0.5357833655705996,
      'width': 0.46344339622641517,
      'height': 0.31334622823984526,
      'label': 'will come of it
    Benjmin Mee',
      'shape': 'rectangle'}]]
    

    where: * left - x-axis relative left position of bbox (x_min) * top - y-axis relative top position of bbox (y_min) * width - x-axis relative width of bbox * height - y-axis relative height of bbox * label - text inside bounding box * shape - always 'rectangle'

    💻Display image and bbox:

    import pandas as pd
    import cv2
    import matplotlib.pyplot as plt
    
    
    TRAIN_PATH = 'train/real/'
    train = pd.read_csv(TRAIN_PATH + 'info.csv')
    
    idx = train.sample(1).iloc[0].name
    im = cv2.imread(TRAIN_PATH + train.iloc[idx]['image_path'])
    
    fig, ax = plt.subplots()
    
    # Display the image
    ax.imshow(im)
    
    # Create a Rectangle patch
    bboxes = json.loads(
      train.iloc[idx]['box_and_label']
    )[0]
    
    for bbox in bboxes:
      x = bbox['left']  * train.iloc[idx]['width']
      y = bbox['top']   * train.iloc[idx]['height']
      w = bbox['width']  * train.iloc[idx]['width']
      h = bbox['height'] * train.iloc[idx]['height']
      rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
    
      # Add the patch to the Axes
      ax.add_patch(rect)
    
    plt.title('
    '.join([bbox['label'] for bbox in bboxes]))
    
    plt.show()
    

    🖼️Image examples

    Human-labeled images

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2Fd40f36b2ba3215770d0fc9beab9fc852%2Foutput4.png?generation=1717895975115361&alt=media" alt="image_2">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2F3909319e543566a039378e094a3144c9%2Foutput3.png?generation=1717895989389635&alt=media" alt="image_3">

    * It can be seen that data isn't perfect. The word Лого in the first picture is unlabeled. The second picture is missing the road sign signatures - 40 and 4,5 м.

    Synthetic images

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2Fede8feae4c8e521409a1c8a7a4333a90%2Foutput.png?generation=1717895678470045&alt=media" alt="image_0">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2F508d10c660328c510cdd4fc66c68a5d0%2Foutput1.png?generation=1717895741343654&alt=media" alt="image_1">

  6. Data from: MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation...

    • zenodo.org
    zip
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cagri Toraman; Cagri Toraman; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can (2023). MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection [Dataset]. http://doi.org/10.5281/zenodo.8032136
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cagri Toraman; Cagri Toraman; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is composed of 10,348 tweets: 5,284 for English and 5,064 for Turkish. Tweets in the dataset are human-annotated in terms of "false", "true", or "other". The dataset covers multiple topics: the Russia-Ukraine war, COVID-19 pandemic, Refugees, and additional miscellaneous events. The details can be found at https://github.com/avaapm/mide22

  7. D

    Replication Data for: Analyzing GPT-4 Misinterpretations of Russian...

    • dataverse.no
    • dataverse.azure.uit.no
    txt
    Updated Nov 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timofei Plotnikov; Timofei Plotnikov (2024). Replication Data for: Analyzing GPT-4 Misinterpretations of Russian Grammatical Constructions [Dataset]. http://doi.org/10.18710/8CAPJM
    Explore at:
    txt(309713), txt(39370), txt(442461), txt(51973), txt(87956), txt(3414), txt(480667), txt(188586)Available download formats
    Dataset updated
    Nov 1, 2024
    Dataset provided by
    DataverseNO
    Authors
    Timofei Plotnikov; Timofei Plotnikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2024 - Apr 5, 2024
    Area covered
    Russia
    Dataset funded by
    UiT The Arctic University of Norway
    Description

    GPT-4 interpretations of the dataset of 2,227 examples gathered from Russian Constructicon (https://constructicon.github.io/russian/)

  8. Z

    Database of Russian names, surnames and midnames for gender identification

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Begtin (2020). Database of Russian names, surnames and midnames for gender identification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2747010
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Infoculture
    Authors
    Ivan Begtin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of names, surnames and midnames across the Russian federation used as source to teach algorithms for gender identification by fullname.

    Dataset prepared for MongoDB database. It has MongoDB dump and dump of tables as JSON lines files.

    Used in gender identification and fullname parsing software https://github.com/datacoon/russiannames

    Available under Creative Commons CC-BY SA by default.

  9. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St. Petersburg
    European University at St Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Russia
    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  10. Warships Monitoring - Kerch Strait

    • kaggle.com
    zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petro Ivaniuk (2025). Warships Monitoring - Kerch Strait [Dataset]. https://www.kaggle.com/datasets/piterfm/warships-monitoring-kerch-strait
    Explore at:
    zip(138001037 bytes)Available download formats
    Dataset updated
    Jun 18, 2025
    Authors
    Petro Ivaniuk
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Kerch Strait
    Description

    This dataset describes the russian warships' movements on the Black Sea, the Sea of Azov, and the Mediterranean Sea since August 2022.

    The dataset was created based on the images published on official social media such as Facebook or Telegram of the Ukrainian Navy.

    The paper Створення набору даних «Моніторинг військових кораблів» за допомогою GenAI-підходу(ukrainian) discusses the creation of the "Warships Monitoring" dataset using a Generative AI approach with Gemini-2.0-Flash-Experimental.

    TBD

    Data

    Table1. data_monitoring.csv

    • date - observation time;
    • img_name - image name;
    • black_sea.enemy_ships - number warships in the Black Sea;
    • black_sea.kalibr_carriers - number warships ('Kalibr' carriers) in the Black Sea;
    • black_sea.total_salvo - total salvo in the Black Sea;
    • azov_sea.enemy_ships - number warships in the Sea of Azov;
    • azov_sea.kalibr_carriers - number warships ('Kalibr' carriers) in the Sea of Azov;
    • azov_sea.total_salvo - total salvo in the Sea of Azov;
    • mediterranean_sea.enemy_ships - number warships in the Mediterranean Sea;
    • mediterranean_sea.kalibr_carriers - number warships ('Kalibr' carriers) in the Mediterranean Sea;
    • mediterranean_sea.total_salvo - total salvo in the Mediterranean Sea;
    • kerch_strait_passage.black_sea.total - number of ships passage the Kerch Strait from the Sea of Azov to the Black Sea;
    • kerch_strait_passage.black_sea.moved_towards_bosporus - number of ships passage the Kerch Strait from the Sea of Azov to the Black Sea - Bosporus direction;
    • kerch_strait_passage.azov_sea.total - number of ships passage the Kerch Strait from the Black Sea to the Sea of Azov;
    • kerch_strait_passage.azov_sea.moved_from_strait_bosporus - number of ships passage the Kerch Strait from the Black Sea to the Sea of Azov - Bosporus direction. #### Table2. data_posts.csv
    • id - post id;
    • text - text;
    • date - post time;
    • views - number of views on the day the data was received
    • img_path - image name;
    • reactions - number of reactions on the day the data was received;
    • date_create - download date. #### Table3. metadata_images.csv #### Table4. metadata_images_test.csv #### Folder1. images Input images for creating data_monitoring.csv. #### Folder2.images_test Images subset for testing

    Related Datasets

    Stand With Ukraine

  11. h

    sakha-russian-parallel

    • huggingface.co
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Intelligence Laboratory of the Republic of Sakha (Yakutia) (2025). sakha-russian-parallel [Dataset]. https://huggingface.co/datasets/ailabykt/sakha-russian-parallel
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset authored and provided by
    Artificial Intelligence Laboratory of the Republic of Sakha (Yakutia)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The texts in the sah column were generated using OCR and may contain errors or artifacts. Please take this into account when using the data for training or evaluation. The dataset was aligned using the Lingtrain Aligner library (https://github.com/averkij/lingtrain-aligner), created by @averoo

  12. D

    The Russian Constructicon database

    • dataverse.azure.uit.no
    • dataverse.no
    • +1more
    bin, pdf +2
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Endresen; Anna Endresen; Radovan Bast; Radovan Bast; Laura A. Janda; Laura A. Janda; Valentina Zhukova; Valentina Zhukova; Daria Mordashova; Daria Mordashova; Ekaterina Rakhilina; Ekaterina Rakhilina; Olga Lyashevskaya; Olga Lyashevskaya; Marianne Lund; James D. McDonald; Francis M. Tyers; Francis M. Tyers; Marianne Lund; James D. McDonald (2023). The Russian Constructicon database [Dataset]. http://doi.org/10.18710/3AM2QM
    Explore at:
    bin(4832860), zip(3217382), text/x-python(988), pdf(872391), text/x-python(537)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Anna Endresen; Anna Endresen; Radovan Bast; Radovan Bast; Laura A. Janda; Laura A. Janda; Valentina Zhukova; Valentina Zhukova; Daria Mordashova; Daria Mordashova; Ekaterina Rakhilina; Ekaterina Rakhilina; Olga Lyashevskaya; Olga Lyashevskaya; Marianne Lund; James D. McDonald; Francis M. Tyers; Francis M. Tyers; Marianne Lund; James D. McDonald
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1900 - Dec 10, 2021
    Area covered
    Russia
    Dataset funded by
    The Ministry of Science and Higher Education of the Russian Federation
    The Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (Diku)
    The Ministry of Education of the Republic of Korea and the National Research Foundation of Korea
    Description

    The set of over 2,250 files archived here comprises a database of the Russian Constructicon, an open-access electronic resource freely available at https://constructicon.github.io/russian/. The Russian Constructicon is a searchable database of constructions accompanied with thorough descriptions of their properties and annotated illustrative examples.

  13. Russian ASR Golos

    • kaggle.com
    zip
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alex cumder (2025). Russian ASR Golos [Dataset]. https://www.kaggle.com/datasets/alexcumder/russian-asr-golos
    Explore at:
    zip(18583718298 bytes)Available download formats
    Dataset updated
    Apr 9, 2025
    Authors
    alex cumder
    Description

    Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus. The main project page: Golos GitHub repository. Check License file en_us.pdf

  14. E

    HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937

    • erddap.eurobis.org
    Updated Apr 18, 2005
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lajus, Nicholls (2005). HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937 [Dataset]. https://erddap.eurobis.org/erddap/info/hmap_09/index.html
    Explore at:
    Dataset updated
    Apr 18, 2005
    Dataset authored and provided by
    Lajus, Nicholls
    Time period covered
    Jan 1, 1759 - Jan 1, 1937
    Area covered
    Variables measured
    time, aphia_id, latitude, longitude, BasisOfRecord, YearCollected, ScientificName, InstitutionCode
    Description

    This dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. AccConID=21 AccConstrDescription=This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials. AccConstrDisplay=This dataset is licensed under a Creative Commons Attribution 4.0 International License. AccConstrEN=Attribution (CC BY) AccessConstraint=Attribution (CC BY) Acronym=None added_date=2013-06-11 14:46:43.777000 BrackishFlag=0 CDate=2012-08-02 cdm_data_type=Other CheckedFlag=0 Citation=J. Lajus et al, eds., ‘North Russian Salmon Catch Data, 1615-1937’, in J.H Nicholls (comp.) HMAP Data Pages (https://oceanspast.org/hmap_db.php) Comments=None ContactEmail=None Conventions=COARDS, CF-1.6, ACDD-1.3 CurrencyDate=None DasID=3146 DasOrigin=Data collection DasType=Data DasTypeID=1 DateLastModified={'date': '2025-09-09 01:42:03.812284', 'timezone_type': 1, 'timezone': '+02:00'} DescrCompFlag=0 DescrTransFlag=0 Easternmost_Easting=61.1 EmbargoDate=None EngAbstract=This dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. EngDescr=None FreshFlag=0 geospatial_lat_max=71.2 geospatial_lat_min=60.62 geospatial_lat_units=degrees_north geospatial_lon_max=61.1 geospatial_lon_min=31.27 geospatial_lon_units=degrees_east infoUrl=None InputNotes=None institution=OPI, TCD License=https://creativecommons.org/licenses/by/4.0/ Lineage=Prior to publication data undergo quality control checked which are described in https://github.com/EMODnet/EMODnetBiocheck?tab=readme-ov-file#understanding-the-output MarineFlag=1 modified_sync=2025-09-02 00:00:00 Northernmost_Northing=71.2 OrigAbstract=None OrigDescr=None OrigDescrLang=English OrigDescrLangNL=Engels OrigLangCode=en OrigLangCodeExtended=eng OrigLangID=15 OrigTitle=None OrigTitleLang=None OrigTitleLangCode=None OrigTitleLangID=None OrigTitleLangNL=None Progress=Completed PublicFlag=1 ReleaseDate=Jun 11 2013 12:00AM ReleaseDate0=2013-06-11 RevisionDate=None SizeReference=3193 records sourceUrl=(local files) Southernmost_Northing=60.62 standard_name_vocabulary=CF Standard Name Table v70 StandardTitle=HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937 StatusID=1 subsetVariables=ScientificName,BasisOfRecord,YearCollected,aphia_id TerrestrialFlag=0 time_coverage_end=1937-01-01T01:00:00Z time_coverage_start=1759-01-01T01:00:00Z UDate=2025-03-26 VersionDate=None VersionDay=2 VersionMonth=8 VersionName=1.0 VersionYear=2012 VlizCoreFlag=1 Westernmost_Easting=31.27

  15. Emoji Gestures in Russian Tweets: Moscow

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marina Zhukova; Marina Zhukova (2022). Emoji Gestures in Russian Tweets: Moscow [Dataset]. http://doi.org/10.5281/zenodo.5800200
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 18, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marina Zhukova; Marina Zhukova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Moscow, Russia
    Description

    The dataset consists of 48 838 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), and posted within 50km from Moscow, Russia, in Russian, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by Russian users of the Twitter platform. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.

    The dataset contains 12 columns:

    1. tweet_original

      original text of the tweet

    2. preprocessed

      preprocessed text of the tweet (4 steps)

    3. all_emoji

      lists all emoji in a given tweet

    4. hashtags

      lists all hashtags in a given tweet

    5. user_encoded

      encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location

    6. location_encoded

      location of the user: "moscow", "moscow_region", or "other"

    7. mention_present

      checks whether each tweet contains url

    8. url_present

      checks whether each tweet contains url

    9. preprocess_tweet

      preprocessing step 1: tokenizing mentions, urls, and hashtags

    10. lowercase_tweet

      preprocessing step 2: lowercasing

    11. remove_punct_tweet

      preprocessing step 3: removing punctuation

    12. tokenize_tweet

      preprocessing step 4: tokenizing

    The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/

  16. Hydro-meteorological database for watersheds across the Russia

    • zenodo.org
    zip
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abramov Dmitrii; Abramov Dmitrii; Kurochkina Lyubov; Kurochkina Lyubov (2023). Hydro-meteorological database for watersheds across the Russia [Dataset]. http://doi.org/10.5281/zenodo.7789304
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abramov Dmitrii; Abramov Dmitrii; Kurochkina Lyubov; Kurochkina Lyubov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Russia
    Description

    The presented database is a set of hydrological, meteorological, environmental and geometric values for Russia Federation for the period from 2008 to 2020.

    Database consist of next items:

    • Point geometry for hydrological observation stations from Roshydromet network across Russia
    • Geometry of the catchment for correspond observation station point
    • Daily hydrological values
      • Water level
        • In relative representation (sm)
        • In meters of Baltic system (m)
      • Water discharge
        • as an observed value (qms/s)
        • as a layer (mm/day)
    • Daily meteorological values
    • Set of hydro-environmental characteristics derived from HydroATLAS database

    Each variable derived from the grid data was calculated for each watershed, taking into account the intersection weights of the watershed contour geometry and grid cells.

    Coordinates of hydrological stations were obtained from resource of Federal Agency for Water Resources of Russia Federation—AIS GMVO

    To calculate the contours of the catchment areas, a script was developed that builds the contours in accordance with the rasters of flow directions from MERIT Hydro. To assess the quality of the contour construction, the obtained value of the catchment area was compared with the archival value from the corresponded table from AIS GMVO. The average error in determining the area for 2080 catchments is approximately 2%

    To derive values for different hydro-environmental values from HydroATLAS were developed approach which calculate aggregated values for catchment, leaning on type of variable: qualitative (Land cover classes, Lithological classes etc.) Or quantitive (Air temperature, Snow cover extent etc.). Every quantitive variable were calculated as mode value for intersected sub-basins and target catchment, e.g. most popular attribute from sub-basins will describe whole catchment which are they relating. Quantitative values were calculated as mean value of attribute from each sub-basin. More detail could be found in publication.

    Files are distributed as follows:

    Each file has some connection with the unique identifier of the hydrological observation post. Files in netcdf format (hydrological and meteorological series) are named in response to identifier.

    Every file which describe geometry (point, polygon, static attributes) has and column named gauge_id with same correspondence.

    • attributes/static_data.csv – results from HydroATLAS aggregation
    • geometry/russia_gauges.gpkg – coordinates of hydrological observation stations
      • gauge_idname_runame_engeometry
        049001р. Ковда – пос. Софпорогr.Kovda - pos. SofporogPOINT (31.41892 65.79876)
        149014р. Корпи-Йоки – пос. Пяозерскийr.Korpi-Joki - pos. PjaozerskijPOINT (31.05794 65.77917)
        249017р. Тумча – пос. Алакурттиr.Tumcha - pos. AlakurttiPOINT (30.33082 66.95957)
    • geometry/russia_ws.gpkg – catchments polygon for each hydrological observation stations
      • gauge_idname_runame_ennew_areaais_difgeometry
        09002р. Енисей – г. Кызылr.Enisej - g.Kyzyl115263.9890.230POLYGON ((96.87792 53.72792, 96.87792 53.72708...
        19022р. Енисей – пос. Никитиноr.Enisej - pos. Nikitino184499.1181.373POLYGON ((96.87792 53.72708, 96.88042 53.72708...
        29053р. Енисей – пос. Базаихаr.Enisej - pos.Bazaiha302690.4170.897POLYGON ((92.38292 56.11042, 92.38292 56.10958...
      • Column ais_diff is corresponded to % error in area definition
    • nc_all_q/
      • netcdf files for hydrological observation stations which has no missing values on discharge for 2008-2020 period
    • nc_all_h
      • netcdf files for hydrological observation stations which has no missing values on level for 2008-2020 period
    • nc_concat
      • data for all available geometry provided in dataset

    More details on processing scripts which were used for development of this database can be found in folder of GitHub repository where I store results for my PhD dissertation

  17. d

    Replication Data for: #Navalny’s Death and Russia’s Future:...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chong, Miyoung (2025). Replication Data for: #Navalny’s Death and Russia’s Future: Anti-Authoritarianism and the Politics of Mourning [Dataset]. http://doi.org/10.7910/DVN/8R2I1K
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Chong, Miyoung
    Description

    The data repository includes data and computational codes used for the "#Navalny’s Death and Russia’s Future: Anti-Authoritarianism and the Politics of Mourning" study. https://github.com/madhav28/Navalny-Study

  18. Z

    The SaltWaterDistortion Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daria Senshina; Dmitry Polevoy; Egor Ershov; Irina Kunina (2022). The SaltWaterDistortion Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6475915
    Explore at:
    Dataset updated
    Apr 28, 2022
    Dataset provided by
    Institute for Information Transmission Problems, RAS, Bolshoy Karetny per., 19, Moscow, Russian Federation
    Federal Research Center "Computer Science and Control" RAS, Moscow, Russia
    Evocargo LLC, Moscow, Russia
    Authors
    Daria Senshina; Dmitry Polevoy; Egor Ershov; Irina Kunina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the wide introduction of waterproof standards (IP68) to the mobile phones industry and increasing in the popularity of amateur underwater photography, the questions of correction of different types of geometric distortion are more relevant than ever.

    Despite extensive research being conducted in the areas of radial distortion correction, there are almost no open datasets allowing numerical quality assessment of such algorithms.

    The SWD (Salt Water Distortion) dataset is the new image dataset in order to underwater distortion estimation and correction. Images were collected in water of various salinity (<1%, 13%, 25%, 40%) via two smartfone cameras with different angle of view and focal lengths. New dataset includes 662 underwater photos of calibration chessboard, for each image all corners of the chessboard squares were manually marked (35748 corners in total).

    Dataset description and code is available on https://github.com/Visillect/SaltWaterDistortion.

    For a fast download please use zenodo-get. To install it use the following commands:

    pip install zenodo-get zenodo_get https://zenodo.org/record/6475916 --output-dir=SWD

  19. Z

    Supplementary code and data for the paper: 'The fall of genres that did not...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martynenko, Antonina; Šeļa, Artjoms (2023). Supplementary code and data for the paper: 'The fall of genres that did not happen: formalising history of the "universal" semantics of Russian iambic tetrameter' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7958273
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Martynenko, Antonina; Šeļa, Artjoms
    Description

    The dataset provides preprocessed data and the full code used in the paper 'The fall of genres that did not happen: formalising history of the "universal" semantics of Russian iambic tetrameter'. The code can be also be accessed as rendered notebooks on Github. The dataset is structured as follows:

    data/ : This folder contains preprocessed data,including a sampled corpus of periodicals and a document-term matrix used for topic modelling; scr/ : The code used for the analysis, with separate scripts for figures; plots/ : The figures used in the paper, which correspond to the aforementioned code.

  20. Hydro-meteorological database for watersheds across the CIS

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abramov Dmitrii; Abramov Dmitrii; Kurochkina Lyubov; Kurochkina Lyubov (2023). Hydro-meteorological database for watersheds across the CIS [Dataset]. http://doi.org/10.5281/zenodo.8432070
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abramov Dmitrii; Abramov Dmitrii; Kurochkina Lyubov; Kurochkina Lyubov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The presented database is a set of hydrological, meteorological, environmental and geometric values for Russia Federation for the period from 2008 to 2020.

    Database consist of next items:

    • Point geometry for hydrological observation stations from Roshydromet network across Russia
    • Geometry of the catchment for correspond observation station point
    • Daily hydrological values
      • Water level
        • In relative representation (sm)
        • In meters of Baltic system (m)
      • Water discharge
        • as an observed value (qms/s)
        • as a layer (mm/day)
    • Daily meteorological values
    • Set of hydro-environmental characteristics derived from HydroATLAS database

    Each variable derived from the grid data was calculated for each watershed, taking into account the intersection weights of the watershed contour geometry and grid cells.

    Coordinates of hydrological stations were obtained from resource of Federal Agency for Water Resources of Russia Federation—AIS GMVO

    To calculate the contours of the catchment areas, a script was developed that builds the contours in accordance with the rasters of flow directions from MERIT Hydro. To assess the quality of the contour construction, the obtained value of the catchment area was compared with the archival value from the corresponded table from AIS GMVO. The average error in determining the area for 2080 catchments is approximately 2%

    To derive values for different hydro-environmental values from HydroATLAS were developed approach which calculate aggregated values for catchment, leaning on type of variable: qualitative (Land cover classes, Lithological classes etc.) Or quantitive (Air temperature, Snow cover extent etc.). Every quantitive variable were calculated as mode value for intersected sub-basins and target catchment, e.g. most popular attribute from sub-basins will describe whole catchment which are they relating. Quantitative values were calculated as mean value of attribute from each sub-basin. More detail could be found in publication.

    Files are distributed as follows:

    Each file has some connection with the unique identifier of the hydrological observation post. Files in netcdf format (hydrological and meteorological series) are named in response to identifier.

    Every file which describe geometry (point, polygon, static attributes) has and column named gauge_id with same correspondence.

    • attributes/static_data.csv – results from HydroATLAS aggregation
    • geometry/russia_gauges.gpkg – coordinates of hydrological observation stations
      • gauge_idname_runame_engeometry
        049001р. Ковда – пос. Софпорогr.Kovda - pos. SofporogPOINT (31.41892 65.79876)
        149014р. Корпи-Йоки – пос. Пяозерскийr.Korpi-Joki - pos. PjaozerskijPOINT (31.05794 65.77917)
        249017р. Тумча – пос. Алакурттиr.Tumcha - pos. AlakurttiPOINT (30.33082 66.95957)
    • geometry/russia_ws.gpkg – catchments polygon for each hydrological observation stations
      • gauge_idname_runame_ennew_areaais_difgeometry
        09002р. Енисей – г. Кызылr.Enisej - g.Kyzyl115263.9890.230POLYGON ((96.87792 53.72792, 96.87792 53.72708...
        19022р. Енисей – пос. Никитиноr.Enisej - pos. Nikitino184499.1181.373POLYGON ((96.87792 53.72708, 96.88042 53.72708...
        29053р. Енисей – пос. Базаихаr.Enisej - pos.Bazaiha302690.4170.897POLYGON ((92.38292 56.11042, 92.38292 56.10958...
      • Column ais_diff is corresponded to % error in area definition
    • nc_all_q
      • netcdf files for hydrological observation stations which has no missing values on discharge for 2008-2020 period
    • nc_all_h
      • netcdf files for hydrological observation stations which has no missing values on level for 2008-2020 period
    • nc_all_q_h
      • netcdf files for hydrological observation stations which has no missing values on discharge and level for 2008-2020 period
    • nc_concat
      • data for all available geometry provided in dataset

    More details on processing scripts which were used for development of this database can be found in folder of GitHub repository where I store results for my PhD dissertation

    05.04.2023 – Significant data changes. Removed catchments and related files that have more than ±15% absolute error in calculated area relative to AIS GMVO information. Now these are data for 1886 catchments across the Russia.

    17.05.2023 – Significant data changes. Major review of parsing algorithm for AIS GMVO data. Fixed the way of how 0.0xx values were read. Use previous versions with caution.

    11.10.2023 – Significant data changes. Added 278 catchments for CIS region from GRDC resource. Calculate meteorological and environmental attributes for each catchment. New folder /nc_all_q_h with no missing observations on discharge and level. Now these are data for 2164 catchments across CIS.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mikhail Nefedov (2018). Datasets for evaluation of keyword extraction in Russian [Dataset]. https://github.com/mannefedov/ru_kw_eval_datasets

Datasets for evaluation of keyword extraction in Russian

Explore at:
Dataset updated
Jun 11, 2018
Authors
Mikhail Nefedov
Description

Datasets for evaluation of keyword extraction in Russian

Search
Clear search
Close search
Google apps
Main menu