CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Albania telegram data provides an accurate TG phone number list—contact information for active telegram users. If you want to sell your items using telegram marketing campaigns, you may utilize our Albania telegram list. Data from 2025 is fresh and up to date. This tag data does not generate additional sales. It cannot be sold. The database is accurate and authentic. Albania telegram screening data will provide you with live and accurate telegram phone number leads. The Albania telegram dataset includes the following data: All number is open in telegram Gender age tg users name last activity date industry calcification. Albania tg powder might help you increase your business sales. Telegram is becoming one of the most effective tools for direct marketing. The offered data provides you with an accurate and up-to-date tg powder database. We also provide after-sales assistance to meet your company’s needs. Check out our packages here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To research the illegal activities of underground apps on Telegram, we have created a dataset called TUApps. TUApps is a progressively growing dataset of underground apps, collected from September 2023 to February 2024, consisting of a total of 1,000 underground apps and 200 million messages distributed across 71,332 Telegram channels.
In the process of creating this dataset, we followed strict ethical standards to ensure the lawful use of the data and the protection of user privacy. The dataset includes the following files:
(1) dataset.zip: We have packaged the underground app samples. The naming of Android app files is based on the SHA256 hash of the file, and the naming of iOS app files is based on the SHA256 hash of the publishing webpage.
(2) code.zip: We have packaged the code used for crawling data from Telegram and for performing data analysis.
(3) message.zip: We have packaged the messages crawled from Telegram, the files are named after the names of the channels in Telegram.
Availability of code and messages
Upon acceptance of our research paper, the dataset containing user messages and the code used for data collection and analysis will only be made available upon request to researchers who agree to adhere to strict ethical principles and maintain the confidentiality of the data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Denmark telegram data includes 100% accurate contact information. If you want to promote your business or product and build your marketing campaign, you can use this Denmark telegram data without any hesitation. This database includes active telegram user contact information. If you use this tg data, you can receive a return on investment (ROI). List to Data generates new and active leads, which drive corporate success. Denmark telegram screening data offers the latest and most reliable leads for telegram phone numbers. The information will be provided as follows: All numbers are open in telegram Gender, Age, Telegram username, Last activity date, Industry calcification. Denmark tg powder pertains to telegram data originating from Denmark. This information offers valuable insights into the behavior of Danish consumers, enabling businesses to customize their marketing strategies effectively.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains venting content scraped from the Ethiopian-based Telegram channel Vent Here. It has been pre-processed to remove non-English entries, emojis, and unwanted prefixes, with sentiment and emotion labels added for each entry.
This dataset provides valuable insights into emotional expression and sentiment in online communities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Dataset is described in Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020), a study that aims to map and assess the extent of cryptocurrency manipulations within and across the online ecosystems of Twitter, Telegram, and Discord. Starting from tweets mentioning cryptocurrencies, we leveraged and followed invite URLs from platform to platform, building the invite-link network, in order to study the invite link diffusion process.
Please, refer to the paper below for more details.
Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S., Tesconi, M. & Ferrara, E. (2020). Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access (2020).
This dataset is composed of:
~16M tweet ids shared between March and May 2019, mentioning at least one of the 3,822 cryptocurrencies (cashtags) provided by the CryptoCompare public API;
~13k nodes of the invite-link network, i.e., the information about the Telegram/Discord channels and Twitter users involved in the cryptocurrency discussion (e.g., id, name, audience, invite URL);
~62k edges of the invite-link network, i.e., the information about the flow of invites (e.g., source id, target id, weight).
With such information, one can easily retrieve the content of channels and messages through Twitter, Telegram, and Discord public APIs.
Please, refer to the README file for more details about the fields.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1. SQLite format database containing messages from SA COVID VAX CHAT from 10 May 2021 to 22 May 2022. Original user IDs and user names have been replaced with anonymous IDs.
2. Messages in CSV format filtered to remove spam and with themes annotated, from May 2021 to end 2022.
3. Spam messages (messages showing up more than 100 times in the message dataset) in JSON format - one JSON record pre line (thus JSON-L format).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MentalRiskES is a new dataset about mental disorders in Spanish. The dataset is divided into three distinct mental disorders:
Each dataset contains a set of subjects and their message thread in a Telegram social network chat.
How is constructed?
Public groups on the Telegram social network were accessed, and conversations were extracted from them. This data was processed, and we kept only the text messages, excluding images, audio, etc. In order to carry out the annotation, a subset of messages was extracted from each subject. This message thread was annotated by 10 different annotators through the Prolific platform and made use of the Doccano annotation platform.
In this way, we associated a user ID with some tags that emerged after averaging the annotators' decisions. The labels available for each set are:
Labels
The values available in Anxiety files are:
The values available in the Depression and Eating Disorders files are:
Preprocessing
The same corpus is found with emojis or without emojis; that is to say, in the folder 'processed' is the corpus with emojis in text format, while in the folder 'raw' is the corpus with emojis in original format.
MentalRiskES evaluation campaign
MentalRiskES is a shared task organized at IberLEF. The aim of this task is to promote the early detection of mental risk disorders in Spanish. In this task we made use of the corpusMentalRiskES, the partitions used are available in the folder MentalRiskES2023edition.zip provided in git (https://github.com/sinai-uja/corpusMentalRiskES). To cite the task: Mármol-Romero, A. M., Moreno-Muñoz, A., Plaza-del-Arco, F. M., Molina-González, M. D., Martín-Valdivia, M. T., Ureña-López, L. A., & Montejo-Raéz, A. (2023). Overview of MentalriskES at IberLEF 2023: Early Detection of Mental Disorders Risk in Spanish. Procesamiento del Lenguaje Natural, 71, 329-350.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Albania telegram data provides an accurate TG phone number list—contact information for active telegram users. If you want to sell your items using telegram marketing campaigns, you may utilize our Albania telegram list. Data from 2025 is fresh and up to date. This tag data does not generate additional sales. It cannot be sold. The database is accurate and authentic. Albania telegram screening data will provide you with live and accurate telegram phone number leads. The Albania telegram dataset includes the following data: All number is open in telegram Gender age tg users name last activity date industry calcification. Albania tg powder might help you increase your business sales. Telegram is becoming one of the most effective tools for direct marketing. The offered data provides you with an accurate and up-to-date tg powder database. We also provide after-sales assistance to meet your company’s needs. Check out our packages here.