Facebook
TwitterDatasets for evaluation of keyword extraction in Russian
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Authors: Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov
GitHub: github.com/markovivl/SynthText
Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process.
info.csv file which has the same format for all splits of data.info_raw.csv or json_*_*.json files.[[{'left': 0.10259433962264151,
'top': 0,
'width': 0.4056603773584906,
'height': 0.9303675048355899,
'label': 'ALL you NEED
is 20 SECONDS
of Insane',
'shape': 'rectangle'},
{'left': 0.5141509433962265,
'top': 0.009671179883945842,
'width': 0.48584905660377353,
'height': 0.5222437137330754,
'label': 'COURAGE
AND I PROMISE YOU
something GREAT',
'shape': 'rectangle'},
{'left': 0.5165094339622641,
'top': 0.5357833655705996,
'width': 0.46344339622641517,
'height': 0.31334622823984526,
'label': 'will come of it
Benjmin Mee',
'shape': 'rectangle'}]]
where:
* left - x-axis relative left position of bbox (x_min)
* top - y-axis relative top position of bbox (y_min)
* width - x-axis relative width of bbox
* height - y-axis relative height of bbox
* label - text inside bounding box
* shape - always 'rectangle'
import pandas as pd
import cv2
import matplotlib.pyplot as plt
TRAIN_PATH = 'train/real/'
train = pd.read_csv(TRAIN_PATH + 'info.csv')
idx = train.sample(1).iloc[0].name
im = cv2.imread(TRAIN_PATH + train.iloc[idx]['image_path'])
fig, ax = plt.subplots()
# Display the image
ax.imshow(im)
# Create a Rectangle patch
bboxes = json.loads(
train.iloc[idx]['box_and_label']
)[0]
for bbox in bboxes:
x = bbox['left'] * train.iloc[idx]['width']
y = bbox['top'] * train.iloc[idx]['height']
w = bbox['width'] * train.iloc[idx]['width']
h = bbox['height'] * train.iloc[idx]['height']
rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
# Add the patch to the Axes
ax.add_patch(rect)
plt.title('
'.join([bbox['label'] for bbox in bboxes]))
plt.show()
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2Fd40f36b2ba3215770d0fc9beab9fc852%2Foutput4.png?generation=1717895975115361&alt=media" alt="image_2">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2F3909319e543566a039378e094a3144c9%2Foutput3.png?generation=1717895989389635&alt=media" alt="image_3">
* It can be seen that data isn't perfect. The word Лого in the first picture is unlabeled. The second picture is missing the road sign signatures - 40 and 4,5
м.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2Fede8feae4c8e521409a1c8a7a4333a90%2Foutput.png?generation=1717895678470045&alt=media" alt="image_0">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4480292%2F508d10c660328c510cdd4fc66c68a5d0%2Foutput1.png?generation=1717895741343654&alt=media" alt="image_1">
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The repository contains an ongoing collection of tweets IDs associated with the current conflict in Ukraine and Russia, which we commenced collecting on Februrary 22, 2022. To comply with Twitter’s Terms of Service, we are only publicly releasing the Tweet IDs of the collected Tweets. The data is released for non-commercial research use. Note that the compressed files must be first uncompressed in order to use included scripts. This dataset is release v1.3 and is not actively maintained -- the actively maintained dataset can be found here: https://github.com/echen102/ukraine-russia. This release contains Tweet IDs collected from 2/22/22 - 1/08/23. Please refer to the README for more details regarding data, data organization and data usage agreement. This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License . By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript: Emily Chen and Emilio Ferrara. 2022. Tweets in Time of Conflict: A Public Dataset Tracking the Twitter Discourse on the War Between Ukraine and Russia. arXiv:cs.SI/2203.07488
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset consists of tweets relating to the Russian invasion of Ukraine that were scraped for this study. Only tweets of which user features were available are included in the dataset. The tweets and corresponding user features can be rehydrated using the Twitter API. However, it could be that some tweets or users might be deleted or put on private and are therefore no longer available. Moreover, user and tweet features might change over time
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The dataset consists of tweets relating to the Russian invasion of Ukraine that were scraped for this study. Only tweets of which user features were available are included in the dataset. The tweets and corresponding user features can be rehydrated using the Twitter API. However, it could be that some tweets or users might be deleted or put on private and are therefore no longer available. Moreover, user and tweet features might change over time This dataset can be used to study the change in sentiment, and topics over time as the war continues
- Find out which tweets are most popular among people interested in the Russian invasion of Ukraine
- Identify which user attributes are associated with tweets about the Russian invasion of Ukraine
- Study the change in sentiment and public opinion on the war as events unfold.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: after_invasion_tweetids.csv | Column name | Description | |:--------------|:-----------------------| | id | The tweet id. (String) |
File: before_invasion_tweetids.csv | Column name | Description | |:--------------|:-----------------------| | id | The tweet id. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is composed of 10,348 tweets: 5,284 for English and 5,064 for Turkish. Tweets in the dataset are human-annotated in terms of "false", "true", or "other". The dataset covers multiple topics: the Russia-Ukraine war, COVID-19 pandemic, Refugees, and additional miscellaneous events. The details can be found at https://github.com/avaapm/mide22
Facebook
TwitterThis dataset is designed for research on audio deepfake detection, focusing specifically on generated speech in Russian. It contains TTS-generated audio, paired with transcriptions, and a mixed set for real vs fake classification tasks.
The main goal is to support research on audio deepfake detection in underrepresented languages, especially Russian. The dataset simulates real-world scenarios using multiple state-of-the-art TTS systems to generate fakes and includes clean, real audio data.
We used three high-quality TTS models to synthesize Russian speech:
XTTS-v2: Cross-lingual, zero-shot voice cloning with multilingual support.
Silero TTS: Lightweight, real-time Russian TTS model.
VITS RU Multispeaker: VITS-based Russian model with speaker variability.
For real human speech, we used a part of SOVA dataset, which contains clean Russian utterances recorded by multiple speakers.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterImage Caprioning for Russian language
This dataset is a Russian part of dinhanhx/crossmodal-3600
Dataset Details
3.11k rows. Two description for each picture. Cracked pictures were deleted from the original source. The main feature is that all the descriptions are written by the native russian speakers.
Paper [https://google.github.io/crossmodal-3600/]
Uses
It is intended to be used for fine-tuning image captioning models.
Facebook
TwitterGolos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus. The main project page: Golos GitHub repository. Check License file en_us.pdf
Facebook
Twitterhttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8CAPJMhttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8CAPJM
GPT-4 interpretations of the dataset of 2,227 examples gathered from Russian Constructicon (https://constructicon.github.io/russian/)
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The set of over 2,250 files archived here comprises a database of the Russian Constructicon, an open-access electronic resource freely available at https://constructicon.github.io/russian/. The Russian Constructicon is a searchable database of constructions accompanied with thorough descriptions of their properties and annotated illustrative examples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of names, surnames and midnames across the Russian federation used as source to teach algorithms for gender identification by fullname.
Dataset prepared for MongoDB database. It has MongoDB dump and dump of tables as JSON lines files.
Used in gender identification and fullname parsing software https://github.com/datacoon/russiannames
Available under Creative Commons CC-BY SA by default.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 48 838 tweets each of them contains one of the 31 gesture emoji (different hand configurations) and its skin tone modifier options (e.g. 🙏🙏🏿🙏🏾🙏🏽🙏🏼🙏🏻), and posted within 50km from Moscow, Russia, in Russian, during May-August 2021. The dataset can be used to investigate the use of gesture emoji by Russian users of the Twitter platform. Python libraries used for collecting tweets and preprocessing: tweepy, re, preprocessor, emoji, regex, string, nltk.
The dataset contains 12 columns:
tweet_original
original text of the tweet
preprocessed
preprocessed text of the tweet (4 steps)
all_emoji
lists all emoji in a given tweet
hashtags
lists all hashtags in a given tweet
user_encoded
encoded Twitter user name: the first 3 characters of the user name and the first 3 characters of the user's location
location_encoded
location of the user: "moscow", "moscow_region", or "other"
mention_present
checks whether each tweet contains mentions
url_present
checks whether each tweet contains url
preprocess_tweet
preprocessing step 1: tokenizing mentions, urls, and hashtags
lowercase_tweet
preprocessing step 2: lowercasing
remove_punct_tweet
preprocessing step 3: removing punctuation
tokenize_tweet
preprocessing step 4: tokenizing
The further information on the research project can be found here: https://github.com/mzhukovaucsb/emoji_gestures/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The presented database is a set of hydrological, meteorological, environmental and geometric values for Russia Federation for the period from 2008 to 2020.
Database consist of next items:
Each variable derived from the grid data was calculated for each watershed, taking into account the intersection weights of the watershed contour geometry and grid cells.
Coordinates of hydrological stations were obtained from resource of Federal Agency for Water Resources of Russia Federation—AIS GMVO
To calculate the contours of the catchment areas, a script was developed that builds the contours in accordance with the rasters of flow directions from MERIT Hydro. To assess the quality of the contour construction, the obtained value of the catchment area was compared with the archival value from the corresponded table from AIS GMVO. The average error in determining the area for 2080 catchments is approximately 2%
To derive values for different hydro-environmental values from HydroATLAS were developed approach which calculate aggregated values for catchment, leaning on type of variable: qualitative (Land cover classes, Lithological classes etc.) Or quantitive (Air temperature, Snow cover extent etc.). Every quantitive variable were calculated as mode value for intersected sub-basins and target catchment, e.g. most popular attribute from sub-basins will describe whole catchment which are they relating. Quantitative values were calculated as mean value of attribute from each sub-basin. More detail could be found in publication.
Files are distributed as follows:
Each file has some connection with the unique identifier of the hydrological observation post. Files in netcdf format (hydrological and meteorological series) are named in response to identifier.
Every file which describe geometry (point, polygon, static attributes) has and column named gauge_id with same correspondence.
| gauge_id | name_ru | name_en | geometry | |
|---|---|---|---|---|
| 0 | 49001 | р. Ковда – пос. Софпорог | r.Kovda - pos. Sofporog | POINT (31.41892 65.79876) |
| 1 | 49014 | р. Корпи-Йоки – пос. Пяозерский | r.Korpi-Joki - pos. Pjaozerskij | POINT (31.05794 65.77917) |
| 2 | 49017 | р. Тумча – пос. Алакуртти | r.Tumcha - pos. Alakurtti | POINT (30.33082 66.95957) |
| gauge_id | name_ru | name_en | new_area | ais_dif | geometry | |
|---|---|---|---|---|---|---|
| 0 | 9002 | р. Енисей – г. Кызыл | r.Enisej - g.Kyzyl | 115263.989 | 0.230 | POLYGON ((96.87792 53.72792, 96.87792 53.72708... |
| 1 | 9022 | р. Енисей – пос. Никитино | r.Enisej - pos. Nikitino | 184499.118 | 1.373 | POLYGON ((96.87792 53.72708, 96.88042 53.72708... |
| 2 | 9053 | р. Енисей – пос. Базаиха | r.Enisej - pos.Bazaiha | 302690.417 | 0.897 | POLYGON ((92.38292 56.11042, 92.38292 56.10958... |
More details on processing scripts which were used for development of this database can be found in folder of GitHub repository where I store results for my PhD dissertation
Facebook
TwitterThis dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. AccConID=21 AccConstrDescription=This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials. AccConstrDisplay=This dataset is licensed under a Creative Commons Attribution 4.0 International License. AccConstrEN=Attribution (CC BY) AccessConstraint=Attribution (CC BY) Acronym=None added_date=2013-06-11 14:46:43.777000 BrackishFlag=0 CDate=2012-08-02 cdm_data_type=Other CheckedFlag=0 Citation=J. Lajus et al, eds., ‘North Russian Salmon Catch Data, 1615-1937’, in J.H Nicholls (comp.) HMAP Data Pages (https://oceanspast.org/hmap_db.php) Comments=None ContactEmail=None Conventions=COARDS, CF-1.6, ACDD-1.3 CurrencyDate=None DasID=3146 DasOrigin=Data collection DasType=Data DasTypeID=1 DateLastModified={'date': '2025-09-09 01:42:03.812284', 'timezone_type': 1, 'timezone': '+02:00'} DescrCompFlag=0 DescrTransFlag=0 Easternmost_Easting=61.1 EmbargoDate=None EngAbstract=This dataset contains catch data relating to salmon from northern Russia between 1615 and 1937. EngDescr=None FreshFlag=0 geospatial_lat_max=71.2 geospatial_lat_min=60.62 geospatial_lat_units=degrees_north geospatial_lon_max=61.1 geospatial_lon_min=31.27 geospatial_lon_units=degrees_east infoUrl=None InputNotes=None institution=OPI, TCD License=https://creativecommons.org/licenses/by/4.0/ Lineage=Prior to publication data undergo quality control checked which are described in https://github.com/EMODnet/EMODnetBiocheck?tab=readme-ov-file#understanding-the-output MarineFlag=1 modified_sync=2025-09-02 00:00:00 Northernmost_Northing=71.2 OrigAbstract=None OrigDescr=None OrigDescrLang=English OrigDescrLangNL=Engels OrigLangCode=en OrigLangCodeExtended=eng OrigLangID=15 OrigTitle=None OrigTitleLang=None OrigTitleLangCode=None OrigTitleLangID=None OrigTitleLangNL=None Progress=Completed PublicFlag=1 ReleaseDate=Jun 11 2013 12:00AM ReleaseDate0=2013-06-11 RevisionDate=None SizeReference=3193 records sourceUrl=(local files) Southernmost_Northing=60.62 standard_name_vocabulary=CF Standard Name Table v70 StandardTitle=HMAP Dataset 09: North Russian Salmon Catch Data, 1615-1937 StatusID=1 subsetVariables=ScientificName,BasisOfRecord,YearCollected,aphia_id TerrestrialFlag=0 time_coverage_end=1937-01-01T01:00:00Z time_coverage_start=1759-01-01T01:00:00Z UDate=2025-03-26 VersionDate=None VersionDay=2 VersionMonth=8 VersionName=1.0 VersionYear=2012 VlizCoreFlag=1 Westernmost_Easting=31.27
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Kazakh, Russian, and English Glyph Images Dataset is a comprehensive collection designed for researchers, developers, and designers working with multilingual text. This dataset includes 6952 distinct styles, each containing 136 glyphs(~1 000 000 images):
Dataset Highlights: - 6952 Styles: A vast array of styles to ensure diverse representation and versatility. - Multilingual Support: Covers the Kazakh, Russian, and English alphabets, making it ideal for projects requiring support for these languages. - Detailed Glyphs: High-quality images of each glyph in both uppercase and lowercase formats.
The images in this dataset are named using the following format: upper or lower_char_fontname.png. This naming convention makes it easy to identify whether a glyph is uppercase or lowercase, the specific character, and the font used.
Example: upper_A_Arial.png: An image of the uppercase letter 'A' in the Arial font. lower_a_TimesNewRoman.png: An image of the lowercase letter 'a' in the Times New Roman font.
The dataset was made through this project: https://github.com/Gabrielprogramist/FontImageGenerator.git
Facebook
TwitterThe data repository includes data and computational codes used for the "#Navalny’s Death and Russia’s Future: Anti-Authoritarianism and the Politics of Mourning" study. https://github.com/madhav28/Navalny-Study
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multilingual Speech Dataset
Paper: A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English Repository: https://github.com/IS2AI/MultilingualASR Description: This repository provides the dataset used in the paper "A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English". The paper focuses on training a single end-to-end (E2E) ASR model for Kazakh, Russian, and English, comparing monolingual and multilingual approaches… See the full description on the dataset page: https://huggingface.co/datasets/issai/Multilingual_Speech_Dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Document collection scraped from the Russian governmental website kremlin.ru, where all content is licensed under Creative Commons Attribution 4.0.Downloaded on 17 March 2019. Includes all items listed at http://kremlin.ru/events/president/transcripts up to the end of February 2019 (10221 documents).Format:1) Kremlin_transcripts_ru_corpus.rds: a 'corporaexplorerobject' intended to be used with the corporaexplorer R package (https://github.com/kgjerde/corporaexplorer).2) Kremlin_transcripts_ru_df.rds: A regular R data frame with the documents and some metadata.Version 3. Edited 1 November 2019: utf8 encoding fix.
Facebook
TwitterThis dataset tracks the updates made on the dataset "CTLA4 gene polymorphisms are associated with, and linked to, insulin-dependent diabetes mellitus in a Russian population" as a repository for previous versions of the data and metadata.
Facebook
TwitterDatasets for evaluation of keyword extraction in Russian