11 datasets found
  1. c

    TeleScope: A Longitudinal Dataset for Aggregated User Interactions and...

    • datacatalogue.cessda.eu
    • search.gesis.org
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gangopadhyay, Susmita; Dessi, Danilo; Dimitrov, Dimitar; Dietze, Stefan (2025). TeleScope: A Longitudinal Dataset for Aggregated User Interactions and Information Dissemination on Telegram [Dataset]. http://doi.org/10.7802/2825
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    University of Sharjah
    Heinrich Heine University Düsseldorf
    GESIS – Leibniz Institute for the Social Sciences
    Authors
    Gangopadhyay, Susmita; Dessi, Danilo; Dimitrov, Dimitar; Dietze, Stefan
    Measurement technique
    Web Crawling; Web Scraping
    Description

    TeleScope is an extensive dataset suite that comprises metadata for about 500K Telegram channels and downloaded message metadata from all 71K public channels within this 500k channels accounting for about 120M crawled messages. In addition to metadata, TeleScope suite provides enrichments like language detection and active periods for each channel and telegram entity extracted from messages. It also comprises channel connections and user interaction data built using Telegram’s message-forwarding feature to study multiple use cases including information spread and message-forwarding patterns. The dataset is designed for diverse applications, independent of specific research objectives, and sufficiently versatile to facilitate the replication of social media studies comparable to those conducted on platforms like X (former Twitter).

    Further information on the content of the files can be found in the file TeleScope_readme_v1-0-0.txt (see 'Technical Report').

    keywords: Computational Social Science; Information Science, Web and Social Media; text analysis; text processing; text communication; social media; Online discourse; Information Dissemination; Information Analysis

  2. German QAnon Telegram Dataset

    • figshare.com
    zip
    Updated Nov 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    W.F. Thomas (2021). German QAnon Telegram Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.16879513.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 18, 2021
    Dataset provided by
    figshare
    Authors
    W.F. Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set consist of public Telegram channels, concentrated on German-language discussions of QAnon.

    The date range of the data is from the creation of the channel to 01 July 2021.

    To collect the data, I first downloaded the chat history of 3 channels (listed under "Primary"), counted the number of forwarded messages from other channels/accounts, and selected the top 5 most-forwarded-from channels/accounts from my Primary level, and used those most-forwarded-from channels/accounts as my Secondary level.

    I then repeated the process for the Secondary level, downloaded the chat histories and determining for the Secondary level the most-forwarded-from channels/accounts - the top 5 for each channel/account in the Secondary level became the Tertiary level.

    I repeated this for the members of the Tertiary level, downloading their chat histories and determining what channels/groups were forwarded into the Tertiary level, but stopped the process there. For the visualization, I used the unique channels/accounts as nodes and the forwarding of a message as an edge connecting nodes.

    Also included in this data set are the full text histories of the channels I collected data from, in the "Corpus" folder. The text of the messages were extracted from the JSON files of the chat history, leaving only the content of the messages.

    My own analysis of this dataset has been basic, but I hope other researchers find this data useful.WF Thomaswfthomas@protonmail.comwww.wfthomas.com2021USE WITH ATTRIBUTION ONLY

  3. MultiSocial

    • zenodo.org
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
    Explore at:
    Dataset updated
    May 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.

    Disclaimer

    Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

    Data Source

    The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

    1. Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

    2. Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

    3. Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

    4. Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

    5. WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

    From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

    The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

    The dataset has the following fields:

    • 'text' - a text sample,

    • 'label' - 0 for human-written text, 1 for machine-generated text,

    • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

    • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

    • 'language' - the ISO 639-1 language code identifying the detected language of the given text,

    • 'length' - word count of the given text,

    • 'source' - a string identifying the source dataset / platform of the given text,

    • 'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

    ToDo Statistics (under construction)

  4. TUApps

    • zenodo.org
    zip
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). TUApps [Dataset]. http://doi.org/10.5281/zenodo.11201267
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To research the illegal activities of underground apps on Telegram, we have created a dataset called TUApps. TUApps is a progressively growing dataset of underground apps, collected from September 2023 to February 2024, consisting of a total of 1,000 underground apps and 200 million messages distributed across 71,332 Telegram channels.
    In the process of creating this dataset, we followed strict ethical standards to ensure the lawful use of the data and the protection of user privacy. The dataset includes the following files:
    (1) dataset.zip: We have packaged the underground app samples. The naming of Android app files is based on the SHA256 hash of the file, and the naming of iOS app files is based on the SHA256 hash of the publishing webpage.
    (2) code.zip: We have packaged the code used for crawling data from Telegram and for performing data analysis.
    (3) message.zip: We have packaged the messages crawled from Telegram, the files are named after the names of the channels in Telegram.
    Availability of code and messages
    Upon acceptance of our research paper, the dataset containing user messages and the code used for data collection and analysis will only be made available upon request to researchers who agree to adhere to strict ethical principles and maintain the confidentiality of the data.

  5. Z

    Dataset on the online cryptocurrency discussion on Twitter, Telegram, and...

    • data.niaid.nih.gov
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cresci, Stefano (2022). Dataset on the online cryptocurrency discussion on Twitter, Telegram, and Discord [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3895020
    Explore at:
    Dataset updated
    Nov 22, 2022
    Dataset provided by
    Nizzoli, Leonardo
    Tardelli, Serena
    Ferrara, Emilio
    Avvenuti, Marco
    Tesconi, Maurizio
    Cresci, Stefano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Dataset is described in Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020), a study that aims to map and assess the extent of cryptocurrency manipulations within and across the online ecosystems of Twitter, Telegram, and Discord. Starting from tweets mentioning cryptocurrencies, we leveraged and followed invite URLs from platform to platform, building the invite-link network, in order to study the invite link diffusion process.

    Please, refer to the paper below for more details.

    Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S., Tesconi, M. & Ferrara, E. (2020). Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020).

    This dataset is composed of:

    ~16M tweet ids shared between March and May 2019, mentioning at least one of the 3,822 cryptocurrencies (cashtags) provided by the CryptoCompare public API;

    ~13k nodes of the invite-link network, i.e., the information about the Telegram/Discord channels and Twitter users involved in the cryptocurrency discussion (e.g., id, name, audience, invite URL);

    ~62k edges of the invite-link network, i.e., the information about the flow of invites (e.g., source id, target id, weight).

    With such information, one can easily retrieve the content of channels and messages through Twitter, Telegram, and Discord public APIs.

    Please, refer to the README file for more details about the fields.

  6. p

    Bahrain Telegram Data

    • listtodata.com
    .csv, .xls, .txt
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    List to Data (2025). Bahrain Telegram Data [Dataset]. https://listtodata.com/bahrain-telegram-data
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset authored and provided by
    List to Data
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2025 - Dec 31, 2025
    Area covered
    Philippines, Bahrain, Belgium
    Variables measured
    phone numbers, Email Address, full name, Address, City, State, gender,age,income,ip address,
    Description

    Bahrain telegram data provides trustworthy and validated leads from List To Data that are designed to help your telemarketing efforts. Our database contains Bahraini telegram users’ current phone numbers, guaranteeing correct and current data. You can quickly establish a connection with potential clients, improve interaction, and maximize your marketing efforts using this data. You may rely on our superior leads to assist you in improving communication and commercial outcomes. Bahrain telegram screening data will provide you with up-to-date and accurate telegram phone number leads. The following telegram information will be provided: All numbers are open in telegram Gender, Age, Telegram username, Last activity date, Industry calcification. Bahrain tg powder assists in improving your telemarketing operations. List to Data offers reliability and authenticity. You may easily interact with potential clients, increase engagement, and enhance marketing performance using trustworthy phone numbers.

  7. MentalRiskES corpus

    • zenodo.org
    • investigacion.ujaen.es
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alba María Mármol-Romero; Alba María Mármol-Romero; Adrián Moreno-Muñoz; Adrián Moreno-Muñoz; Flor Miriam Plaza del Arco; Flor Miriam Plaza del Arco; María Dolores Molina-González; María Dolores Molina-González; Maria-Teresa Martin-Valdivia; Maria-Teresa Martin-Valdivia; Alfonso Ureña López; Alfonso Ureña López; Arturo Montejo-Ráez; Arturo Montejo-Ráez (2025). MentalRiskES corpus [Dataset]. http://doi.org/10.5281/zenodo.15275274
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alba María Mármol-Romero; Alba María Mármol-Romero; Adrián Moreno-Muñoz; Adrián Moreno-Muñoz; Flor Miriam Plaza del Arco; Flor Miriam Plaza del Arco; María Dolores Molina-González; María Dolores Molina-González; Maria-Teresa Martin-Valdivia; Maria-Teresa Martin-Valdivia; Alfonso Ureña López; Alfonso Ureña López; Arturo Montejo-Ráez; Arturo Montejo-Ráez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MentalRiskES is a new dataset about mental disorders in Spanish. The dataset is divided into three distinct mental disorders:

    • Eating Disorder
    • Depression
    • Anxiety

    Each dataset contains a set of subjects and their message thread in a Telegram social network chat.

    How is constructed?
    Public groups on the Telegram social network were accessed, and conversations were extracted from them. This data was processed, and we kept only the text messages, excluding images, audio, etc. In order to carry out the annotation, a subset of messages was extracted from each subject. This message thread was annotated by 10 different annotators through the Prolific platform and made use of the Doccano annotation platform.

    In this way, we associated a user ID with some tags that emerged after averaging the annotators' decisions. The labels available for each set are:

    • Eating Disorder: suffer (s), control (c)
    • Depression: suffer + in favour (sf), suffer + against (sa), suffer + other (so), control (c)
    • Anxiety: suffer (s), control (c)

    Labels
    The values available in Anxiety files are:

    • bs (binary suffer): 1 if the subject suffers and 0 if not according to the frequency of the labels (in case of a tie it is marked as suffers)
    • bc (binary control): 1 if the subject does not suffer and 0 if they do according to the frequency of the labels (in case of a tie it is marked as suffers)
    • rbs (regression binary suffer): number of times the subject has been marked as suffering among the total number of scorers, i.e., 10
    • rbc (regression binary control): number of times the subject has been marked as not suffering among the total number of scorers, i.e., 10

    The values available in the Depression and Eating Disorders files are:

    • bs (binary suffer): 1 if the subject suffers and 0 if not, according to the frequency of the labels (in case of a tie it is marked as suffers)
    • bsf (binary suffer favour): 1 if the subject suffers and is in favour,r and 0 if not according to the frequency of the labels
    • bsa (binary suffer against): 1 if the subject suffers and is against, and 0 if not according to the frequency of the labels
    • bso (binary suffer other): 1 if the subject suffers and is neither in favour nor against and 0 if not according to the frequency of the labels
    • bc (binary control): 1 if the subject does not suffer and 0 if they do according to the frequency of the labels (in case of a tie it is marked as suffers)
    • rbs (regression binary suffer): number of times the subject has been marked as suffering among the total number of scorers, i.e., 10
    • rbc (regression binary control): number of times the subject has been marked as not suffering among the total number of scorers, i.e., 10
    • rsf (regression suffer favour): number of times the subject has been marked as suffering and in favour among the total number of scorers, i.e., 10
    • rsa (regression suffer against): number of times the subject has been marked as suffering and against and in favour among the total number of scorers, i.e., 10
    • rso (regression suffer other): number of times the subject has been marked as suffering and is neither in favour nor against among the total number of scorers, i.e., 10
    • rc (regression control): number of times the subject has been marked as not suffering among the total number of scorers, i.e., 10 (Note that it is equal to 'rbc')
      So, the labels 'rbs' and 'rbc' must sum to 1, and the labels 'rsf','rsa', 'rso' and 'rc' must sum to 1 too.

    Preprocessing
    The same corpus is found with emojis or without emojis; that is to say, in the folder 'processed' is the corpus with emojis in text format, while in the folder 'raw' is the corpus with emojis in original format.

    MentalRiskES evaluation campaign
    MentalRiskES is a shared task organized at IberLEF. The aim of this task is to promote the early detection of mental risk disorders in Spanish. In this task we made use of the corpusMentalRiskES, the partitions used are available in the folder MentalRiskES2023edition.zip provided in git (https://github.com/sinai-uja/corpusMentalRiskES). To cite the task: Mármol-Romero, A. M., Moreno-Muñoz, A., Plaza-del-Arco, F. M., Molina-González, M. D., Martín-Valdivia, M. T., Ureña-López, L. A., & Montejo-Raéz, A. (2023). Overview of MentalriskES at IberLEF 2023: Early Detection of Mental Disorders Risk in Spanish. Procesamiento del Lenguaje Natural, 71, 329-350.

  8. Data from: Messages from SA covid vax chat Telegram channel

    • zenodo.org
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca Pointer; Rebecca Pointer; Peter van Heusden; Peter van Heusden (2025). Messages from SA covid vax chat Telegram channel [Dataset]. http://doi.org/10.25379/uwc.26965024.v1
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rebecca Pointer; Rebecca Pointer; Peter van Heusden; Peter van Heusden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. SQLite format database containing messages from SA COVID VAX CHAT from 10 May 2021 to 22 May 2022. Original user IDs and user names have been replaced with anonymous IDs.
    2. Messages in CSV format filtered to remove spam and with themes annotated, from May 2021 to end 2022.
    3. Spam messages (messages showing up more than 100 times in the message dataset) in JSON format - one JSON record pre line (thus JSON-L format).

  9. e

    COVID-19 Coronavirus data - weekly (from 17 December 2020)

    • data.europa.eu
    csv, excel xlsx, html +3
    Updated Dec 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Disease Prevention and Control (2020). COVID-19 Coronavirus data - weekly (from 17 December 2020) [Dataset]. https://data.europa.eu/data/datasets/covid-19-coronavirus-data-weekly-from-17-december-2020?locale=en
    Explore at:
    html, csv, json, unknown, xml, excel xlsxAvailable download formats
    Dataset updated
    Dec 17, 2020
    Dataset authored and provided by
    European Centre for Disease Prevention and Control
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains a weekly situation update on COVID-19, the epidemiological curve and the global geographical distribution (EU/EEA and the UK, worldwide).

    Since the beginning of the coronavirus pandemic, ECDC’s Epidemic Intelligence team has collected the number of COVID-19 cases and deaths, based on reports from health authorities worldwide. This comprehensive and systematic process was carried out on a daily basis until 14/12/2020. See the discontinued daily dataset: COVID-19 Coronavirus data - daily. ECDC’s decision to discontinue daily data collection is based on the fact that the daily number of cases reported or published by countries is frequently subject to retrospective corrections, delays in reporting and/or clustered reporting of data for several days. Therefore, the daily number of cases may not reflect the true number of cases at EU/EEA level at a given day of reporting. Consequently, day to day variations in the number of cases does not constitute a valid basis for policy decisions.

    ECDC continues to monitor the situation. Every week between Monday and Wednesday, a team of epidemiologists screen up to 500 relevant sources to collect the latest figures for publication on Thursday. The data screening is followed by ECDC’s standard epidemic intelligence process for which every single data entry is validated and documented in an ECDC database. An extract of this database, complete with up-to-date figures and data visualisations, is then shared on the ECDC website, ensuring a maximum level of transparency.

    ECDC receives regular updates from EU/EEA countries through the Early Warning and Response System (EWRS), The European Surveillance System (TESSy), the World Health Organization (WHO) and email exchanges with other international stakeholders. This information is complemented by screening up to 500 sources every day to collect COVID-19 figures from 196 countries. This includes websites of ministries of health (43% of the total number of sources), websites of public health institutes (9%), websites from other national authorities (ministries of social services and welfare, governments, prime minister cabinets, cabinets of ministries, websites on health statistics and official response teams) (6%), WHO websites and WHO situation reports (2%), and official dashboards and interactive maps from national and international institutions (10%). In addition, ECDC screens social media accounts maintained by national authorities on for example Twitter, Facebook, YouTube or Telegram accounts run by ministries of health (28%) and other official sources (e.g. official media outlets) (2%). Several media and social media sources are screened to gather additional information which can be validated with the official sources previously mentioned. Only cases and deaths reported by the national and regional competent authorities from the countries and territories listed are aggregated in our database.

    Disclaimer: National updates are published at different times and in different time zones. This, and the time ECDC needs to process these data, might lead to discrepancies between the national numbers and the numbers published by ECDC. Users are advised to use all data with caution and awareness of their limitations. Data are subject to retrospective corrections; corrected datasets are released as soon as processing of updated national data has been completed.

    If you reuse or enrich this dataset, please share it with us.

  10. d

    FileMarket | Text Recognition Data | 50,000 Images | Computer Vision Data |...

    • datarade.ai
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | Text Recognition Data | 50,000 Images | Computer Vision Data | AI Model Training Data | Textual data | Annotated Imagery Data [Dataset]. https://datarade.ai/data-products/filemarket-text-recognition-data-50-000-images-computer-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Belarus, Nigeria, Finland, South Sudan, Faroe Islands, Bulgaria, Zimbabwe, Bhutan, United Kingdom, Seychelles
    Description

    Annotated Imagery Data

    FileMarket provides a robust Annotated Imagery Data set designed to meet the diverse needs of various computer vision and machine learning tasks. This dataset is part of our extensive offerings, which also include Textual Data, Object Detection Data, Large Language Model (LLM) Data, and Deep Learning (DL) Data. Each category is meticulously crafted to ensure high-quality and comprehensive datasets that empower AI development.

    Specifications:

    Data Size: 50,000 images Collection Environment: The images cover a wide array of real-world scenarios, including shop signs, stop boards, posters, tickets, road signs, comics, cover pictures, prompts/reminders, warnings, packaging instructions, menus, building signs, and more. Diversity: The dataset spans 5 languages and includes images from various natural scenes captured at multiple photographic angles (looking up, looking down, eye-level). Devices Used: Images are captured using cellphones and cameras, reflecting real-world usage. Image Parameters: All images are provided in .jpg format, and the corresponding annotation files are in .json format. Annotation Details: The dataset includes line-level quadrilateral bounding box annotations and text transcriptions. Accuracy: The error margin for each vertex of the quadrilateral bounding box is within 5 pixels, ensuring bounding box accuracy of at least 97%. The text transcription accuracy also meets or exceeds 97%. Unique Data Collection Method: FileMarket utilizes a community-driven approach to collect data, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that our datasets are diverse, real-world applicable, and ethically sourced, with full participant consent. This approach allows us to provide datasets that are both comprehensive and reflective of real-world scenarios, ensuring that your AI models are trained on the most relevant and diverse data available.

    By integrating our unique data collection method with the specialized categories we offer, FileMarket is committed to providing high-quality data solutions that support and enhance your AI and machine learning projects.

  11. o

    OpenDevelopment

    • data.opendevelopmentmekong.net
    Updated Mar 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). OpenDevelopment [Dataset]. https://data.opendevelopmentmekong.net/dataset/tech-salon-1-increasing-your-safety-on-telegram
    Explore at:
    Dataset updated
    Mar 29, 2023
    Description

    This slide presentation is prepared for the monthly Tech Salon organized by KAWSANG to share the technology solution for the ICT challenges civil society organizations face under the Civil Society Strengthening (CSS) project. The presentation gave instructions on how to create a user account, privacy, safety, passcode, devices, encryption, and the functionality of the Telegram Channel and Telegram Group.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gangopadhyay, Susmita; Dessi, Danilo; Dimitrov, Dimitar; Dietze, Stefan (2025). TeleScope: A Longitudinal Dataset for Aggregated User Interactions and Information Dissemination on Telegram [Dataset]. http://doi.org/10.7802/2825

TeleScope: A Longitudinal Dataset for Aggregated User Interactions and Information Dissemination on Telegram

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 23, 2025
Dataset provided by
University of Sharjah
Heinrich Heine University Düsseldorf
GESIS – Leibniz Institute for the Social Sciences
Authors
Gangopadhyay, Susmita; Dessi, Danilo; Dimitrov, Dimitar; Dietze, Stefan
Measurement technique
Web Crawling; Web Scraping
Description

TeleScope is an extensive dataset suite that comprises metadata for about 500K Telegram channels and downloaded message metadata from all 71K public channels within this 500k channels accounting for about 120M crawled messages. In addition to metadata, TeleScope suite provides enrichments like language detection and active periods for each channel and telegram entity extracted from messages. It also comprises channel connections and user interaction data built using Telegram’s message-forwarding feature to study multiple use cases including information spread and message-forwarding patterns. The dataset is designed for diverse applications, independent of specific research objectives, and sufficiently versatile to facilitate the replication of social media studies comparable to those conducted on platforms like X (former Twitter).

Further information on the content of the files can be found in the file TeleScope_readme_v1-0-0.txt (see 'Technical Report').

keywords: Computational Social Science; Information Science, Web and Social Media; text analysis; text processing; text communication; social media; Online discourse; Information Dissemination; Information Analysis

Search
Clear search
Close search
Google apps
Main menu