76 datasets found

g
Datasets for evaluation of keyword extraction in Russian
github.com
giters.com
Updated Jun 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Nefedov (2018). Datasets for evaluation of keyword extraction in Russian [Dataset]. https://github.com/mannefedov/ru_kw_eval_datasets
Explore at:
Dataset updated
Jun 11, 2018
Authors
Mikhail Nefedov
Description
Datasets for evaluation of keyword extraction in Russian
Gazeta Summaries
kaggle.com
zip
Updated Sep 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilya Gusev (2021). Gazeta Summaries [Dataset]. https://www.kaggle.com/phoenix120/gazeta-summaries
Explore at:
zip(193749591 bytes)Available download formats
Dataset updated
Sep 5, 2021
Authors
Ilya Gusev
Description
Context

This is the first Russian news summarization dataset. A paper about this dataset: https://arxiv.org/pdf/2006.11063.pdf Additional files and notebooks: https://github.com/IlyaGusev/gazeta/ Previous datasets for headline generation: https://github.com/RossiyaSegodnya/ria_news_dataset https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta

Content

This is the second version of the dataset. The data structure is pretty straightforward. Every line of a file is a JSON object with 5 fields: URL, title, text, summary, and date. The dataset consists of 74126 examples. The first 60964 examples by date are in the training dataset, the proceeding 6369 examples are in the validation dataset, and the remaining 6793 pairs are in the test dataset.

Legal issues

Legal basis for distribution of the dataset: https://www.gazeta.ru/credits.shtml, paragraph 2.1.2. All rights belong to "www.gazeta.ru". This dataset can be removed at the request of the copyright holder. Usage of this dataset is possible only for personal purposes on a non-commercial basis.
Russia - Ukraine War Tweets
kaggle.com
zip
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Russia - Ukraine War Tweets [Dataset]. https://www.kaggle.com/datasets/thedevastator/invasion-of-ukraine-tweets-and-user-features
Explore at:
zip(19340125 bytes)Available download formats
Dataset updated
Nov 29, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Russia, Ukraine
Description
Russia - Ukraine War Tweets

Tweets about ongoing Russia - Ukraine war

By [source]

About this dataset

This dataset consists of tweets relating to the Russian invasion of Ukraine that were scraped for this study. Only tweets of which user features were available are included in the dataset. The tweets and corresponding user features can be rehydrated using the Twitter API. However, it could be that some tweets or users might be deleted or put on private and are therefore no longer available. Moreover, user and tweet features might change over time

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset consists of tweets relating to the Russian invasion of Ukraine that were scraped for this study. Only tweets of which user features were available are included in the dataset. The tweets and corresponding user features can be rehydrated using the Twitter API. However, it could be that some tweets or users might be deleted or put on private and are therefore no longer available. Moreover, user and tweet features might change over time This dataset can be used to study the change in sentiment, and topics over time as the war continues

Research Ideas

Find out which tweets are most popular among people interested in the Russian invasion of Ukraine

Identify which user attributes are associated with tweets about the Russian invasion of Ukraine

Study the change in sentiment and public opinion on the war as events unfold.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: after_invasion_tweetids.csv | Column name | Description | |:--------------|:-----------------------| | id | The tweet id. (String) |

File: before_invasion_tweetids.csv | Column name | Description | |:--------------|:-----------------------| | id | The tweet id. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
H
Ukraine and Russia Conflict Tweet IDs Release v1.3
dataverse.harvard.edu
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Chen; Emilio Ferrara (2023). Ukraine and Russia Conflict Tweet IDs Release v1.3 [Dataset]. http://doi.org/10.7910/DVN/XZSYQO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/XZSYQO
Dataset updated
Jan 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Emily Chen; Emilio Ferrara
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Russia, Ukraine
Description
The repository contains an ongoing collection of tweets IDs associated with the current conflict in Ukraine and Russia, which we commenced collecting on Februrary 22, 2022. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use. Note that the compressed files must be first uncompressed in order to use included scripts. This dataset is release v1.3 and is not actively maintained -- the actively maintained dataset can be found here: https://github.com/echen102/ukraine-russia. This release contains Tweet IDs collected from 2/22/22 - 1/08/23. Please refer to the README for more details regarding data, data organization and data usage agreement. This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License . By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript: Emily Chen and Emilio Ferrara. 2022. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War Between Ukraine and Russia. arXiv:cs.SI/2203.07488
D
Replication Data for: Analyzing GPT-4 Misinterpretations of Russian...
dataverse.no
dataverse.azure.uit.no
txt
Updated Nov 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timofei Plotnikov; Timofei Plotnikov (2024). Replication Data for: Analyzing GPT-4 Misinterpretations of Russian Grammatical Constructions [Dataset]. http://doi.org/10.18710/8CAPJM
Explore at:
txt(309713), txt(51973), txt(39370), txt(188586), txt(3414), txt(87956), txt(480667), txt(442461)Available download formats
Unique identifier
https://doi.org/10.18710/8CAPJM
Dataset updated
Nov 1, 2024
Dataset provided by
DataverseNO
Authors
Timofei Plotnikov; Timofei Plotnikov
License
https://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8CAPJMhttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8CAPJM
Time period covered
Mar 1, 2024 - Apr 5, 2024
Area covered
Russia
Dataset funded by
UiT The Arctic University of Norway
Description
GPT-4 interpretations of the dataset of 2,227 examples gathered from Russian Constructicon (https://constructicon.github.io/russian/)
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St Petersburg
European University at St. Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Automated WarSpotting Equipment Losses
kaggle.com
zip
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zsolt Lazar (2025). Automated WarSpotting Equipment Losses [Dataset]. https://www.kaggle.com/datasets/zsoltlazar/automated-warspotting-equipment-losses
Explore at:
zip(508774 bytes)Available download formats
Dataset updated
Nov 17, 2025
Authors
Zsolt Lazar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains Russian military equipment losses collected from the open-source WarSpotting API. It is automatically updated multiple times per week using a Python scraper running on GitHub Actions.

The data covers: Full historical scans updated weekly Incremental 30-day scans updated thrice weekly Precise geographic coordinates for equipment loss Equipment type and category details Dates of loss and related metadata

This dataset is designed for researchers, analysts, and developers interested in: Open-source intelligence (OSINT) Conflict monitoring and analysis Machine learning model training Geospatial visualization of battlefield losses

The scraper and automation tools powering this dataset are fully open-source and available on GitHub: https://github.com/lazar-bit/automated-warspotting-scraper
MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cagri Toraman; Cagri Toraman; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can (2023). MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection [Dataset]. http://doi.org/10.5281/zenodo.8032136
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8032136
Dataset updated
Jun 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cagri Toraman; Cagri Toraman; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can; Oguzhan Ozcelik; Furkan Şahinuç; Fazli Can
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is composed of 10,348 tweets: 5,284 for English and 5,064 for Turkish. Tweets in the dataset are human-annotated in terms of "false", "true", or "other". The dataset covers multiple topics: the Russia-Ukraine war, COVID-19 pandemic, Refugees, and additional miscellaneous events. The details can be found at https://github.com/avaapm/mide22
h
ru-image-captions
huggingface.co
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Svetlana Gorovaia (2024). ru-image-captions [Dataset]. https://huggingface.co/datasets/gorovuha/ru-image-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2024
Authors
Svetlana Gorovaia
Description
Image Caprioning for Russian language

This dataset is a Russian part of dinhanhx/crossmodal-3600

Dataset Details

3.11k rows. Two description for each picture. Cracked pictures were deleted from the original source. The main feature is that all the descriptions are written by the native russian speakers.

Paper [https://google.github.io/crossmodal-3600/]

Uses

It is intended to be used for fine-tuning image captioning models.
⁠Audio Deepfake in Russian Language Dataset
kaggle.com
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Keller (2025). ⁠Audio Deepfake in Russian Language Dataset [Dataset]. https://www.kaggle.com/datasets/vladimirkeller/generated-russian-phrases
Explore at:
zip(7880998891 bytes)Available download formats
Dataset updated
May 20, 2025
Authors
Vladimir Keller
Description
Dataset Overview

This dataset is designed for research on audio deepfake detection, focusing specifically on generated speech in Russian. It contains TTS-generated audio, paired with transcriptions, and a mixed set for real vs fake classification tasks.

Purpose

The main goal is to support research on audio deepfake detection in underrepresented languages, especially Russian. The dataset simulates real-world scenarios using multiple state-of-the-art TTS systems to generate fakes and includes clean, real audio data.

Links

Generated Audio

We used three high-quality TTS models to synthesize Russian speech:

XTTS-v2: Cross-lingual, zero-shot voice cloning with multilingual support.

Silero TTS: Lightweight, real-time Russian TTS model.

VITS RU Multispeaker: VITS-based Russian model with speaker variability.

Real Audio

For real human speech, we used a part of SOVA dataset, which contains clean Russian utterances recorded by multiple speakers.
h
parallel_ab-ru
huggingface.co
Updated Jun 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danial Zakaria (2022). parallel_ab-ru [Dataset]. https://huggingface.co/datasets/Nart/parallel_ab-ru
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2022
Authors
Danial Zakaria
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Summary

The Abkhaz Russian parallel corpus dataset is a collection of 205,665 sentences/words extracted from different sources; e-books, web scrapping.

Dataset Creation Source Data

Here is a link to the source on github

Considerations for Using the Data Other Known Limitations

The accuracy of the dataset is around 95% (gramatical, arthographical errors)
Z
Emoji Gestures in Russian Tweets: Moscow
data.niaid.nih.gov
zenodo.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marina Zhukova (2022). Emoji Gestures in Russian Tweets: Moscow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5800199
Explore at:
Dataset updated
May 18, 2022
Dataset provided by
University of California, Santa Barbara
Authors
Marina Zhukova
Area covered
Moscow, Russia
Description
The dataset consists of 48 838 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), and posted within 50km from Moscow, Russia, in Russian, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by Russian users of the Twitter platform. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.

The dataset contains 11 columns:

preprocessed

preprocessed text of the tweet (4 steps)

all_emoji

lists all emoji in a given tweet

hashtags

lists all hashtags in a given tweet

user_encoded

encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location

location_encoded

location of the user: "moscow", "moscow_region", or "other"

mention_present

checks whether each tweet contains mentions

url_present

checks whether each tweet contains url

preprocess_tweet

preprocessing step 1: tokenizing mentions, urls, and hashtags

lowercase_tweet

preprocessing step 2: lowercasing

remove_punct_tweet

preprocessing step 3: removing punctuation

tokenize_tweet

preprocessing step 4: tokenizing

The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/
H
Replication Data for: #Navalny’s Death and Russia’s Future:...
dataverse.harvard.edu
search.dataone.org
Updated Aug 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miyoung Chong (2025). Replication Data for: #Navalny’s Death and Russia’s Future: Anti-Authoritarianism and the Politics of Mourning [Dataset]. http://doi.org/10.7910/DVN/8R2I1K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8R2I1K
Dataset updated
Aug 12, 2025
Dataset provided by
Harvard Dataverse
Authors
Miyoung Chong
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Russia
Description
The data repository includes data and computational codes used for the "#Navalny’s Death and Russia’s Future: Anti-Authoritarianism and the Politics of Mourning" study. https://github.com/madhav28/Navalny-Study
h
Multilingual_Speech_Dataset
huggingface.co
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Smart Systems and Artificial Intelligence, Nazarbayev University (2025). Multilingual_Speech_Dataset [Dataset]. https://huggingface.co/datasets/issai/Multilingual_Speech_Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Institute of Smart Systems and Artificial Intelligence, Nazarbayev University
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multilingual Speech Dataset

Paper: A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English Repository: https://github.com/IS2AI/MultilingualASR Description: This repository provides the dataset used in the paper "A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English". The paper focuses on training a single end-to-end (E2E) ASR model for Kazakh, Russian, and English, comparing monolingual and multilingual approaches… See the full description on the dataset page: https://huggingface.co/datasets/issai/Multilingual_Speech_Dataset.
WikiConv - Russian
figshare.com
txt
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Dixon; Nithum Thain; Dario Taraborelli; Cristian Danescu-Niculescu-Mizil; Jeffrey Sorensen; Yiqing Hua (2023). WikiConv - Russian [Dataset]. http://doi.org/10.6084/m9.figshare.7376015.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7376015.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lucas Dixon; Nithum Thain; Dario Taraborelli; Cristian Danescu-Niculescu-Mizil; Jeffrey Sorensen; Yiqing Hua
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WikiConv (Russian): A Corpus of the Complete Conversational History of a Large Online Collaborative CommunityThis directory contains the WikiConv Corpus which encompasses the full history of conversations on Wikipedia Talk Pages.The project webpage for this work is at: https://github.com/conversationai/wikidetox/tree/master/wikiconvThe dataset and reconstruction process for this corpus has been published in the paper WikiConv: A Corpus of the Complete Conversational History of a Large OnlineCollaborative Community, presented at EMNLP 2018.The work has also been presented at the June 2018 Wikipedia researchshowcase (the first half describes our work, using an earlier version of this dataset to predict conversations going awry.The meta-data in this corpus is governed by the CC0 license v1.0, and the content of the comments is governed by the CC-SA license v3.0.
Z
Database of Russian names, surnames and midnames for gender identification
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Begtin (2020). Database of Russian names, surnames and midnames for gender identification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2747010
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Infoculture
Authors
Ivan Begtin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of names, surnames and midnames across the Russian federation used as source to teach algorithms for gender identification by fullname.

Dataset prepared for MongoDB database. It has MongoDB dump and dump of tables as JSON lines files.

Used in gender identification and fullname parsing software https://github.com/datacoon/russiannames

Available under Creative Commons CC-BY SA by default.
E
HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937
erddap.eurobis.org
Updated Apr 18, 2005
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lajus, Nicholls (2005). HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937 [Dataset]. https://erddap.eurobis.org/erddap/info/hmap_09/index.html
Explore at:
Dataset updated
Apr 18, 2005
Dataset authored and provided by
Lajus, Nicholls
Time period covered
Jan 1, 1759 - Jan 1, 1937
Area covered

Variables measured
time, aphia_id, latitude, longitude, BasisOfRecord, YearCollected, ScientificName, InstitutionCode
Description
This dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. AccConID=21 AccConstrDescription=This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials. AccConstrDisplay=This dataset is licensed under a Creative Commons Attribution 4.0 International License. AccConstrEN=Attribution (CC BY) AccessConstraint=Attribution (CC BY) Acronym=None added_date=2013-06-11 14:46:43.777000 BrackishFlag=0 CDate=2012-08-02 cdm_data_type=Other CheckedFlag=0 Citation=J. Lajus et al, eds., ‘North Russian Salmon Catch Data, 1615-1937’, in J.H Nicholls (comp.) HMAP Data Pages (https://oceanspast.org/hmap_db.php) Comments=None ContactEmail=None Conventions=COARDS, CF-1.6, ACDD-1.3 CurrencyDate=None DasID=3146 DasOrigin=Data collection DasType=Data DasTypeID=1 DateLastModified={'date': '2025-09-09 01:42:03.812284', 'timezone_type': 1, 'timezone': '+02:00'} DescrCompFlag=0 DescrTransFlag=0 Easternmost_Easting=61.1 EmbargoDate=None EngAbstract=This dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. EngDescr=None FreshFlag=0 geospatial_lat_max=71.2 geospatial_lat_min=60.62 geospatial_lat_units=degrees_north geospatial_lon_max=61.1 geospatial_lon_min=31.27 geospatial_lon_units=degrees_east infoUrl=None InputNotes=None institution=OPI, TCD License=https://creativecommons.org/licenses/by/4.0/ Lineage=Prior to publication data undergo quality control checked which are described in https://github.com/EMODnet/EMODnetBiocheck?tab=readme-ov-file#understanding-the-output MarineFlag=1 modified_sync=2025-09-02 00:00:00 Northernmost_Northing=71.2 OrigAbstract=None OrigDescr=None OrigDescrLang=English OrigDescrLangNL=Engels OrigLangCode=en OrigLangCodeExtended=eng OrigLangID=15 OrigTitle=None OrigTitleLang=None OrigTitleLangCode=None OrigTitleLangID=None OrigTitleLangNL=None Progress=Completed PublicFlag=1 ReleaseDate=Jun 11 2013 12:00AM ReleaseDate0=2013-06-11 RevisionDate=None SizeReference=3193 records sourceUrl=(local files) Southernmost_Northing=60.62 standard_name_vocabulary=CF Standard Name Table v70 StandardTitle=HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937 StatusID=1 subsetVariables=ScientificName,BasisOfRecord,YearCollected,aphia_id TerrestrialFlag=0 time_coverage_end=1937-01-01T01:00:00Z time_coverage_start=1759-01-01T01:00:00Z UDate=2025-03-26 VersionDate=None VersionDay=2 VersionMonth=8 VersionName=1.0 VersionYear=2012 VlizCoreFlag=1 Westernmost_Easting=31.27
Z
Supplementary code and data for the paper: 'The fall of genres that did not...
data.niaid.nih.gov
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martynenko, Antonina; Šeļa, Artjoms (2023). Supplementary code and data for the paper: 'The fall of genres that did not happen: formalising history of the "universal" semantics of Russian iambic tetrameter' [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7958273
Explore at:
Dataset updated
Dec 7, 2023
Authors
Martynenko, Antonina; Šeļa, Artjoms
Description
The dataset provides preprocessed data and the full code used in the paper 'The fall of genres that did not happen: formalising history of the "universal" semantics of Russian iambic tetrameter'. The code can be also be accessed as rendered notebooks on Github. The dataset is structured as follows:

data/ : This folder contains preprocessed data,including a sampled corpus of periodicals and a document-term matrix used for topic modelling; scr/ : The code used for the analysis, with separate scripts for figures; plots/ : The figures used in the paper, which correspond to the aforementioned code.
D
The Russian Constructicon database
dataverse.azure.uit.no
dataverse.no
+2more
bin, pdf +2
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Endresen; Anna Endresen; Radovan Bast; Radovan Bast; Laura A. Janda; Laura A. Janda; Valentina Zhukova; Valentina Zhukova; Daria Mordashova; Daria Mordashova; Ekaterina Rakhilina; Ekaterina Rakhilina; Olga Lyashevskaya; Olga Lyashevskaya; Marianne Lund; James D. McDonald; Francis M. Tyers; Francis M. Tyers; Marianne Lund; James D. McDonald (2023). The Russian Constructicon database [Dataset]. http://doi.org/10.18710/3AM2QM
Explore at:
text/x-python(537), text/x-python(988), bin(4832860), pdf(872391), zip(3217382)Available download formats
Unique identifier
https://doi.org/10.18710/3AM2QM
Dataset updated
Sep 28, 2023
Dataset provided by
DataverseNO
Authors
Anna Endresen; Anna Endresen; Radovan Bast; Radovan Bast; Laura A. Janda; Laura A. Janda; Valentina Zhukova; Valentina Zhukova; Daria Mordashova; Daria Mordashova; Ekaterina Rakhilina; Ekaterina Rakhilina; Olga Lyashevskaya; Olga Lyashevskaya; Marianne Lund; James D. McDonald; Francis M. Tyers; Francis M. Tyers; Marianne Lund; James D. McDonald
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 1900 - Dec 10, 2021
Area covered
Russian Federation
Dataset funded by
The Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (Diku)
The Ministry of Education of the Republic of Korea and the National Research Foundation of Korea
The Ministry of Science and Higher Education of the Russian Federation
Description
The set of over 2,250 files archived here comprises a database of the Russian Constructicon, an open-access electronic resource freely available at https://constructicon.github.io/russian/. The Russian Constructicon is a searchable database of constructions accompanied with thorough descriptions of their properties and annotated illustrative examples.
Z
The SaltWaterDistortion Dataset
data.niaid.nih.gov
Updated Apr 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daria Senshina; Dmitry Polevoy; Egor Ershov; Irina Kunina (2022). The SaltWaterDistortion Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6475915
Explore at:
Dataset updated
Apr 28, 2022
Dataset provided by
Evocargo LLC, Moscow, Russia
Federal Research Center "Computer Science and Control" RAS, Moscow, Russia
Institute for Information Transmission Problems, RAS, Bolshoy Karetny per., 19, Moscow, Russian Federation
Authors
Daria Senshina; Dmitry Polevoy; Egor Ershov; Irina Kunina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the wide introduction of waterproof standards (IP68) to the mobile phones industry and increasing in the popularity of amateur underwater photography, the questions of correction of different types of geometric distortion are more relevant than ever.

Despite extensive research being conducted in the areas of radial distortion correction, there are almost no open datasets allowing numerical quality assessment of such algorithms.

The SWD (Salt Water Distortion) dataset is the new image dataset in order to underwater distortion estimation and correction. Images were collected in water of various salinity (<1%, 13%, 25%, 40%) via two smartfone cameras with different angle of view and focal lengths. New dataset includes 662 underwater photos of calibration chessboard, for each image all corners of the chessboard squares were manually marked (35748 corners in total).

Dataset description and code is available on https://github.com/Visillect/SaltWaterDistortion.

For a fast download please use zenodo-get. To install it use the following commands:

pip install zenodo-get zenodo_get https://zenodo.org/record/6475916 --output-dir=SWD

Facebook

Twitter

Click to copy link

Link copied

Cite

Mikhail Nefedov (2018). Datasets for evaluation of keyword extraction in Russian [Dataset]. https://github.com/mannefedov/ru_kw_eval_datasets

Datasets for evaluation of keyword extraction in Russian

Explore at:

Dataset updated

Jun 11, 2018

Authors

Mikhail Nefedov

Description

Datasets for evaluation of keyword extraction in Russian

Clear search

Close search

Google apps

Main menu

Datasets for evaluation of keyword extraction in Russian

Gazeta Summaries

Context

Content

Legal issues

Russia - Ukraine War Tweets

Russia - Ukraine War Tweets

Tweets about ongoing Russia - Ukraine war

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Ukraine and Russia Conflict Tweet IDs Release v1.3

Replication Data for: Analyzing GPT-4 Misinterpretations of Russian...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Automated WarSpotting Equipment Losses

MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection

ru-image-captions

⁠Audio Deepfake in Russian Language Dataset

Dataset Overview

Purpose

Links

Generated Audio

Real Audio

parallel_ab-ru

Emoji Gestures in Russian Tweets: Moscow

Replication Data for: #Navalny’s Death and Russia’s Future:...

Multilingual_Speech_Dataset

WikiConv - Russian

Database of Russian names, surnames and midnames for gender identification

HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937

Supplementary code and data for the paper: 'The fall of genres that did not...

The Russian Constructicon database

The SaltWaterDistortion Dataset

Datasets for evaluation of keyword extraction in Russian