The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our Twitter dataset for diverse applications to enrich business strategies and market insights. Analyzing this dataset provides a comprehensive understanding of social media trends, empowering organizations to refine their communication and marketing strategies. Access the entire dataset or customize a subset to fit your needs. Popular use cases include market research to identify trending topics and hashtags, AI training by reviewing factors such as tweet content, retweets, and user interactions for predictive analytics, and trend forecasting by examining correlations between specific themes and user engagement to uncover emerging social media preferences.
*** Fake News on Twitter ***
These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:
1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.
2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."
3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.
4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.
5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.
The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).
DD
DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:
The structure of excel files for each dataset is as follow:
Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:
User ID (user who has posted the current tweet/retweet)
The description sentence in the profile of the user who has published the tweet/retweet
The number of published tweet/retweet by the user at the time of posting the current tweet/retweet
Date and time of creation of the account by which the current tweet/retweet has been posted
Language of the tweet/retweet
Number of followers
Number of followings (friends)
Date and time of posting the current tweet/retweet
Number of like (favorite) the current tweet had been acquired before crawling it
Number of times the current tweet had been retweeted before crawling it
Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)
The source (OS) of device by which the current tweet/retweet was posted
Tweet/Retweet ID
Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)
Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)
Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)
Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)
State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):
r : The tweet/retweet is a fake news post
a : The tweet/retweet is a truth post
q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it
n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)
DG
DG for each fake news contains two files:
A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)
A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)
Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.
The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.
The XRAY database table contains selected parameters from almost all HEASARC X-ray catalogs that have source positions located to better than a few arcminutes. The XRAY database table was created by copying all of the entries and common parameters from the tables listed in the Component Tables section. The XRAY database table has many entries but relatively few parameters; it provides users with general information about X-ray sources, obtained from a variety of catalogs. XRAY is especially suitable for cone searches and cross-correlations with other databases. Each entry in XRAY has a parameter called 'database_table' which indicates from which original database the entry was copied; users can browse that original table should they wish to examine all of the parameter fields for a particular entry. For some entries in XRAY, some of the parameter fields may be blank (or have zero values); this indicates that the original database table did not contain that particular parameter or that it had this same value there. The HEASARC in certain instances has included X-ray sources for which the quoted value for the specified band is an upper limit rather than a detection. The HEASARC recommends that the user should always check the original tables to get the complete information about the properties of the sources listed in the XRAY master source list. This master catalog is updated periodically whenever one of the component database tables is modified or a new component database table is added. This is a service provided by NASA HEASARC .
The Annual Respondents Database X (ARDx) has been created to allow users of Annual Respondents Database (ARD) (held at the UK Data Archive under SN 6644) to continue analysis even though the Annual Business Inquiry (ABI) which was used to create ARD ceased in 2008. ARDx contains harmonised variables from 1997 to 2020.
ARDx is created from two ONS surveys, the Annual Business Inquiry (ABI; 1998-2008, held at the UK Data Archive under SN 6644) and the Annual Business Survey (ABS; 2009 onwards, held at the UK Data Archive under SN 7451). The ABI has an employment survey (ABI1) and a second survey for financial information (ABI2). ABS only collects financial data, and so is supplemented with employment data from the Business Register and Employment Survey (BRES; 2009 onwards, held at the UK Data Archive under SN 7463).
ARDx consists of six types of files: 'respondent files' which have reported and derived information from survey questionnaire responses; and 'universe files' which contain limited information on all business that are within scope of the ABI/ABS. These files are provided at both the Reporting Unit and Local Unit levels. There are also 'register panel' and "capital stock" files.
Linking to other business studies
These data contain Inter-Departmental Business Register (IDBR) reference numbers. These are anonymous but unique reference numbers assigned to business organisations. Their inclusion allows researchers to combine different business survey sources together. Researchers may consider applying for other business data to assist their research.
For the fifth edition (December 2023), ARDx Version 4.0 for 1997-2020 has been provided, replacing Version 3. Coverage has thus been expanded to include 1997 and 2015-2020.
Note to users
Due to the limited nature of the documentation available for ARDx, users are advised to consult the documentation for the Annual Business Survey (UK Data Archive SN 7451) for detailed information about the data.
For Secure Lab projects applying for access to this study as well as to SN 6697 Business Structure Database and/or SN 7683 Business Structure Database Longitudinal, only postcode-free versions of the data will be made available.
The XRAY database table contains selected parameters from almost all HEASARC X-ray catalogs that have source positions located to better than a few arcminutes. The XRAY database table was created by copying all of the entries and common parameters from the tables listed in the Component Tables section. The XRAY database table has many entries but relatively few parameters; it provides users with general information about X-ray sources, obtained from a variety of catalogs. XRAY is especially suitable for cone searches and cross-correlations with other databases. Each entry in XRAY has a parameter called 'database_table' which indicates from which original database the entry was copied; users can browse that original table should they wish to examine all of the parameter fields for a particular entry. For some entries in XRAY, some of the parameter fields may be blank (or have zero values); this indicates that the original database table did not contain that particular parameter or that it had this same value there. The HEASARC in certain instances has included X-ray sources for which the quoted value for the specified band is an upper limit rather than a detection. The HEASARC recommends that the user should always check the original tables to get the complete information about the properties of the sources listed in the XRAY master source list. This master catalog is updated periodically whenever one of the component database tables is modified or a new component database table is added. This is a service provided by NASA HEASARC .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This publication introduces a novel dataset of 403 diplomatic X/Twitter accounts belonging to the Russian government (primarily the Russian Foreign Ministry) and accompanying metadata. These accounts have become a known vector in the spread of false and misleading information around the Russian invasion of Ukraine, however, given new restrictions on the accessibility of the X/Twitter API and visibility of users' following lists, the vast majority of these accounts are no longer easily discoverable by researchers. The primary aim behind the publication of this dataset is to provide a comprehensive resource for further analysis of this disinformation vector.
This record provides raw and post-processed data used in the associated paper "A technique for in-situ displacement and strain measurement with laboratory-scale X-ray Computed Tomography." The codes used are provided in a separate software publication "SerialTrackXR", also referenced. The data consist of 3D X-ray computed tomography (X-ray CT) scans, projection images, and load/displacement data of two additively manufactured tensile test coupons made from IN718 with different processing conditions. The 3D images were collected in-situ at progressively increasing levels of applied displacement. The projection images track this displacement both total and as surface maps. Load/displacement data from the load frame used to apply displacement are also provided. A displacement tracking validation dataset, consisting of known rigid body displacements imposed on a nominally un-deformed third test specimen is also included.The X-ray CT data are rather large, each .tiff stack being about 2 GB; the displacement and strain map files are also >1 GB. Other data are relatively smaller. The dataset consists of 170 files, totaling 69.9 GB (as noted above, much of this space is in 3D images and raw data for the 3D images - most users will not need to interact with those raw data).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Medical Imaging (CT-Xray) Colorization New Dataset š©ŗš»š¼ļø This dataset provides a collection of medical imaging data, including both CT (Computed Tomography) and X-ray images, with an added focus on colorization techniques. The goal of this dataset is to facilitate the enhancement of diagnostic processes by applying various colorization techniques to grayscale medical images, allowing researchers and machine learning models to explore the effects of color in radiology.
Key Features:
CT and X-ray Images š„: Contains both CT scans and X-ray images, widely used in medical diagnostics.
Colorized Medical Images š: Each image has been colorized using advanced methods to improve visual interpretation and analysis, including details that might not be immediately obvious in grayscale images.
New Dataset š: This dataset is newly created to provide high-quality colorized medical imaging, ideal for training AI models in medical image analysis and enhancing diagnostic accuracy.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15408835%2F4bfb7257cf09b0a118808b289c6c3ed4%2Fmotion_image.gif?generation=1742292037458801&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15408835%2F20c64287d3b580a36bf8f948f82dbb6b%2Fmotion_image2.gif?generation=1742292060396551&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15408835%2Fdb91cac64f5a6a9100ac117fc8a55ee5%2Fmotion_image4.gif?generation=1742292150147491&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15408835%2F8624a8cab05645e3a5f02a2c1e3e9e3f%2Fmotion_image3.gif?generation=1742292165846162&alt=media" alt="">
Methods Used for Colorization: Basic Color Map Application šØ: Applying standard color maps to highlight structures in CT and X-ray images. Adaptive Histogram Equalization (CLAHE) š: Adaptive enhancement to improve contrast and highlight important features, especially in medical contexts. Contrast Stretching š: Adjusting image intensity to enhance visual details and improve diagnostic quality. Gaussian Blur š: Applied to reduce noise, offering a smoother image for better processing. Edge Detection (Canny) āØ: Detecting edges and contours, useful for identifying specific features in medical scans. Random Color Palettes šØ: Using randomized color schemes for unique visual representations. Gamma Correction š: Adjusting image brightness to reveal more information hidden in the shadows. LUT (Lookup Table) Color Mapping š”: Applying predefined color lookups for visually appealing representations. Alpha Blending š¶: Blending colorized regions based on certain thresholds to highlight structures or anomalies. 3D Rendering šŗ: For creating 3D-like visualizations from 2D scans. Heatmap Visualization š„: Highlighting areas of interest, such as anomalies or tumors, using heatmap color gradients. Interactive Segmentation š±ļø: Interactive visualizations that help in segmenting regions of interest in medical images. Applications š„š” This dataset has numerous applications, particularly in the field of medical image analysis, AI development, and diagnostic improvement. Some of the major applications include:
Medical Diagnostics Enhancement š:
Colorization can aid radiologists in interpreting CT and X-ray images by making abnormalities more visible. Helps in visualizing tumors, fractures, or other anomalies, especially in cases where grayscale images are hard to interpret. AI and Machine Learning for Healthcare š¤:
Used for training deep learning models in image segmentation, detection, and classification of diseases (e.g., cancer detection). AI models can be trained on these colorized images to improve accuracy in diagnostic tools, leading to early disease detection. Medical Image Enhancement š¼ļø:
Enables improved contrast, better detail visibility, and highlighting of specific anatomical regions using color. Colorization may improve the accuracy of radiological assessments by allowing professionals to more easily spot abnormalities and changes over time. Data Augmentation for Model Training š:
The colorized images can serve as an additional data source for training AI models, increasing model robustness through synthetic data generation. Various colorization methods (like heatmaps and random palettes) can be used to augment image variations, improving model performance under different conditions. Visualizing Anomalies for Anomaly Detection š„:
Heatmap visualization helps detect subtle and hidden anomalies by coloring the areas of interest with intensity, enabling faster identification of potential issues. Edge detection and segmentation techniques enhance the ability to detect the edges and boundaries of tumors, fractures, and other critical features. 3D Image Rendering for Detailed Analysis š§ :
3D rend...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.
This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.
This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.
Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! š¤
Columns Description: tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.
tweet_created: Timestamp. Time of the tweet's creation.
tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).
text: String. The raw payload text from the tweet.
lang: String. Short name for the Tweet text's language.
user_id: Integer. Twitter's unique user id.
user_name: String. The author's public name on Twitter.
user_username: String. The author's Twitter account username (@example)
user_location: String. The author's public location.
user_description: String. The author's public profile's bio.
user_created: Timestamp. Timestamp of user's Twitter account creation.
user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction
user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction
user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.
user_verified: Boolean. True if the user is verified (blue mark).
source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far). #ChatGPT 1000 Daily š¦ Tweets retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.
like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.
reply_count: Integer. Number of reply messages to the Tweet.
impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.
More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
CC0
Original Data Source: https://www.kaggle.com/datasets/edomingo/chatgpt-1000-daily-tweets
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Fa7cbc12303b5d44a16d50000c9a3963f%2F_495cdf89-be9d-4adb-a65b-182741600a86-small2.jpeg?generation=1738141970159170&alt=media" alt="">
Unless you've been living under a rock, just these past few weeks, a startup company based in China, DeepSeek, has released their models that has taken the AI-tech giants (OpenAI, Meta, Anthropic, Alibaba, etc) by surprise and would potentially disrupt and shake the foundations of the AI industry. Here's a summary
What is DeepSeek and why is it disrupting the AI sector?
Chinese startup DeepSeek's launch of its latest AI models, which it says are on a par or better than industry-leading models in the United States at a fraction of the cost, is threatening to upset the technology world order. The company has attracted attention in global AI circles after writing in a paper last month that the training of DeepSeek-V3 required less than $6 million worth of computing power from Nvidia H800 chips.
And here's one of the many YouTube news videos discussing it
https://www.youtube.com/watch?v=WEBiebbeNCA" alt="">
Summary by MorganB in a series of Tweets
This dataset contains tweets and reactions about DeepSeek and the models they released, as well as other closely related keywords, such as NVIDIA, OPENAI, ANTHROPIC, META, LLAMA, etc.
There may be some tweets included that are totally unrelated to AI and Deepseek, as they contains some of the keywords that I used.
I signed up for a trial with https://twitterapi.io/ , check it out!
Generated with Bing image generator
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a contextual music dataset labeled with the listening situation associated with each stream. Each stream is composed of the user, track, and device data labelled with a situation. The dataset is collected from Deezer for the period of August 2019 from France and Brazil. The dataset is composed of 3 subsets of situations corresponding to 4, 8, and 12 different situations. The situations are extracted based on keyword matching with the associated playlist title in the Deezer catalog. The full set of situational tags are: "work, gym, party, sleep | morning, run, night, dance | car, train, relax, club".
Each instance contains the track/user/deviice triplets, and a situational tag indicating that this user listens to the track in the associated situation wth the corresponding data recieved from the device. The device data contain: "linear-time, linear-day, circular-time X, circular-time Y,circular-day X, circular-day Y, device-type, network-type". The users are represented as embeddings based on their listening history computed through the matrix factorization of the user/track matrix. Additionally, the users are also represented with their demographic data of : "age, country, gender".
The creation of the dataset and our experimental results are described in the paper: Karim M. Ibrahim, Elena V. Epure, Geoffroy Peeters, and Gaƫl Richard. "Audio Autotagging as Proxy for Contextual MusicRecommendation" [Under Revision]. The source code of the paper is available here: https://github.com/KarimMibrahim/Situational_Session_Generator.git
The dataset is composed of the media_id which is the ID of the track in the Deezer catalog. The 30 seconds track previews used to train the model in the paper can be accessed through the Deezer API: https://developers.deezer.com/api. Each user is represented with an anonymized user_id which is associated with the user embedding available in the user_embeddings.npy file. Note: The index of the embeddings in the user_embeddings arrary corresponds to the user_id, i.e. user_id = 100 have its embeddings at user_embeddings[100].
Finally, the dataset also contains the splits used in our experiments. Our splits were conditioned by one of three conditions: ColdTrack (no overlap of tracks between the splits), ColdUser (no overlap of users between the splits), and WarmCase (overlaps allowed). Each condition is split into 4 subsets for cross-validation marked with a "fold" number in each condition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1. Introduction
The file āgen_dd_channel.zipā is a package of a wideband multiple-input multiple-output (MIMO) stored radio channel model at 140 GHz in indoor hall, outdoor suburban, residential and urban scenarios. The package consists of 1) measured wideband double-directional multipath data sets estimated from radio channel sounding and processed through measurement-based ray-launching and 2) MATLAB code sets that allows users to generate wideband MIMO radio channels with various antenna array types, e.g., uniform planar and circular arrays at link ends.
2. What does this package do?
Outputs of the channel model
The MATLAB file āChannelGeneratorDD_hexax.mā gives the following variables, among others. The .m file also gives optional figures illustrating antennas and radio channel responses.
Variables |
Descriptions |
CIR |
MIMO channel impulse responses |
CFR |
MIMO channel frequency responses |
Inputs to the channel model
In order for the MATLAB file āChannelGeneratorDD_hexax.mā to run properly, the following inputs are required.
Directory |
Descriptions |
data_030123_double_directional_paths |
Double-directional multipath data, measured and complemented by ray-launching tool, for various cellular sites. |
Userās parameters
When using āChannelGeneratorDD_hexax.mā, the following choices are available.
Features |
Choices |
Channel model types for transfer function generation |
|
Antenna / beam shapes |
|
List of files in the dataset
MATLAB codes that implement the channel model
The MATLAB files consist of the following files.
File and directory names |
Descriptions |
readme_100223.txt |
Readme file; please read it before using the files |
ChannelGeneratorDD_hexax.m |
Main code to run; a code to integrate antenna arrays and double-directional path data to derive MIMO radio channels. No need to see/edit other files. |
gen_pathDD.m, randl.m, randLoc.m |
Sub-routines used in ChannelGeneratorDD_hexax.m; no need of modifications. |
Hexa-X channel generator DD_presentation.pdf |
User manual of ChannelGeneratorDD_hexax.m. |
Measured multipath data
The directory "data_030123_double_directional_paths" in the package contains the following files.
Filenames |
Descriptions |
readme_100223.txt |
Readme file; please read it before using the files |
RTdata_[scenario]_[date].mat |
Containing double-directional multipath parameters at 140 GHz in the specified scenario, estimated from radio channel sounding and ray-tracing. |
description_of_data_dd_[scenario].pdf |
Explaining data formats, the measurement site and sample results. |
References
Details of the data set are available in the following two documents:
The stored channel models
A. Nimr (ed.), "Hexa-X Deliverable D2.3 Radio models and enabling techniques towards ultra-high data rate links and capacity in 6G," April 2023, available: https://hexa-x.eu/deliverables/
@misc{Hexa-XD23,
author = {{A. Nimr (ed.)}},
title = {{Hexa-X Deliverable D2.3 Radio models and enabling techniques towards ultra-high data rate links and capacity in 6G}},
year = {2023},
month = {Apr.},
howpublished = {https://hexa-x.eu/deliverables/},
}
Derivation of the data, i.e., radio channel sounding and measurement-based ray-launching
M. F. De Guzman and K. Haneda, "Analysis of wave-interacting objects in indoor and outdoor environments at 142 GHz," IEEE Transactions on Antennas and Propagation, vol. 71, no. 12, pp. 9838-9848, Dec. 2023, doi: 10.1109/TAP.2023.3318861
@ARTICLE{DeGuzman23_TAP,
author={De Guzman, Mar Francis and Haneda, Katsuyuki},
journal={IEEE Transactions on Antennas and Propagation},
title={Analysis of Wave-Interacting Objects in Indoor and Outdoor Environments at 142 {GHz}},
year={2023},
volume={71},
number={12},
pages={9838-9848},
}
Finally, the code ārandl.mā are from the following MATLAB Central File Exchange.
Hristo Zhivomirov (2023). Generation of Random Numbers with Laplace Distribution (https://www.mathworks.com/matlabcentral/fileexchange/53397-generation-of-random-numbers-with-laplace-distribution), MATLAB Central File Exchange. Retrieved February 15, 2023.
Data usage terms
Any usage of the data must be upon consent on the following conditions:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The database includes five datasets. Three datasets were extracted from a dataset published by X (Twitter Transparancy websites) that includes tweets from malicious accounts trying to manipulate public opinion in the Kingdom of Saudi Arabia. We focused on sports and banking topics when extracting data. Although the propagandist tweets were published by malicious accounts, as X (Twitter) stated, the tweets at their level were not classified as propaganda or not. Propagandists usually mix propaganda and non-propaganda tweets in an attempt to hide their identities. Therefore, it was necessary to classify their tweets as propaganda or not, based on the propaganda technique used. Since the datasets are very large, we annotated a sample of 2,100 tweets. As for reliable account data, we were keen to identify reliable Saudi sources. Then, their tweets that discussed the same topics discussed by the malicious users were crawled. There are two datasets for reliable users in sports and banking topics. The dataset is made up of 16,355,558 tweets from propagandist users and 156,524 tweets from reliable users for the time period of January 1, 2019, to December 31, 20202.
This is a synthetic dataset that can be used by users that are interested in benchmarking methods of explainable artificial intelligence (XAI) for geoscientific applications. The dataset is specifically inspired from a climate forecasting setting (seasonal timescales) where the task is to predict regional climate variability given global climate information lagged in time. The dataset consists of a synthetic input X (series of 2D arrays of random fields drawn from a multivariate normal distribution) and a synthetic output Y (scalar series) generated by using a nonlinear function F: R^d -> R.
The synthetic input aims to represent temporally independent realizations of anomalous global fields of sea surface temperature, the synthetic output series represents some type of regional climate variability that is of interest (temperature, precipitation totals, etc.) and the function F is a simplification of the climate system.
Since the nonlinear function F that is used to generate the output given the input is known, we also derive and provide the attribution of each output value to the corresponding input features. Using this synthetic dataset users can train any AI model to predict Y given X and then implement XAI methods to interpret it. Based on the āground truthā of attribution of F the user can assess the faithfulness of any XAI method.
NOTE: the spatial configuration of the observations in the NetCDF database file conform to the planetocentric coordinate system (89.5N - 89.5S, 0.5E - 359.5E), where longitude is measured in the positive heading east from the prime meridian.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25663426%2Fef0839f1c6342b2f89b87d08acfb4b74%2Fpeertube_graph(1).png?generation=1746770713374326&alt=media" alt="Peertube "follow" graph">
Above is the Peertube "follow" graph. The colours correspond to the language of the server (purple: unknown, green: French, blue: English, black: German, orange: Italian, grey: others).
Decentralized machine learning---where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages---is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset covering seven social networks from the Fediverse, crawled weekly on a weekly basis.
We refer to our paper for a detailed presentation of the graphs: [SOON]
We implemented a simple Python API to interact easily with the dataset: https://pypi.org/project/fedivertex/
pip3 install fedivertex
This package automatically downloads the dataset and generate NetworkX graphs.
from fedivertex import GraphLoader
loader.list_graph_types("mastodon")
# List available graphs for a given software, here federation and active_user
G = loader.get_graph(software = "mastodon", graph_type = "active_user", index = 0, only_largest_component = True)
# G contains the Networkx graph of the giant component of the active users graph at the 1st date of collection
We also provide a Kaggle notebook demonstrating simple operations using this library: https://www.kaggle.com/code/marcdamie/exploratory-graph-data-analysis-of-fedivertex
The dataset contains graphs crawled on a daily basis on 7 social networks from the Fediverse. Each graph quantifies/characterizes the interaction differently depending on the information provided by the public API of these networks.
We present briefly the graph below (NB: the term "instance" refers to servers on the Fediverse):
These graphs provide diverse perspectives on the Fediverse as they capture more or less subtle phenomenon. For example, "federation" graphs are dense, while "intra-instance" graphs are sparse. We have performed a detailed exploratory data analysis in this notebook.
Our CSV files are formatted so that they can be directly imported into Gephi for graph visualization. Find below an example Gephi visualization of the Misskey "active users" graph (without the misskey.io
node). The colours correspond to the language of the server (purple:Unknown, red: Japanese, brown: Korean, blue: English, yellow: Chinese).
 dataset have been discontinued as of Dec. 31, 2019, and users are strongly encouraged to shift to the successor IMERG datasets (doi: 10.5067/GPM/IMERG/3B-HH-E/06, 10.5067/GPM/IMERG/3B-HH-L/06).
These data were output from the TRMM Multi-satellite Precipitation Analysis (TMPA), the Near Real-Time (RT) processing stream. The latency was about seven hours from the observation time, although processing issues may delay or prevent this schedule. Users should be mindful that the price for the short latency of these data is the reduced quality as compared to the research quality product 3B42. This particular dataset is an intermediate variable (VAR) rainrate IR estimate.
Data files start with a header consisting of a 2880-byte record containing ASCII characters. The header line makes the file nearly self-documenting, in particular spelling out the variable and version names, and giving the units of the variables.
Immediately after the header follow 3 data fields, "precip", "error","# pixels", with byte count correspondingly 1382400,1382400,691200. First two are 2-byte integers, and the third is 1-byte. All fields are 1440x480 grid boxes (0-360E,60N-S). The first grid box center is at (0.125E,59.875N). The grid increments most rapidly to the east. Grid boxes without valid data are filled with the (2-byte integer) "missing" value -31999. Valid estimates are only provided in the band 50N-S. This binary data sets are in IEEE "big-endian" floating-point format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 1 row and is filtered where the books is The Contax RTS & Yashica SLR book : for Contax RTS, Yashica FR, DR1, FR11, FX-1, FX2̧, TL-Electro, Electro-X, Electro AX & TL Super users. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2022 Brazilian Presidential Election
This dataset contains 13,910,048 tweets from 1,346,340 users, extracted using 157 search terms over 56 different days between January 1st and June 21st, 2023.
All tweets in this dataset are in Brazilian Portuguese.
Data Usage
The dataset contains textual data from tweets, making it suitable for various NLP analyses, such as sentiment analysis, bias or stance detection, and toxic language detection. Additionally, users and tweets can be linked to create social graphs, enabling Social Network Analysis (SNA) to study polarization, communities, and other social dynamics.
Extraction Method
This data set was extracted using Twitter's (now X) official APIāwhen Academic Research API access was still availableāfollowing the pipeline:
Further Information
For more details, visit:
DOI: 10.5281/zenodo.14834434
The number of Twitter users in the United States was forecast to continuously increase between 2024 and 2028 by in total 4.3 million users (+5.32 percent). After the ninth consecutive increasing year, the Twitter user base is estimated to reach 85.08 million users and therefore a new peak in 2028. Notably, the number of Twitter users of was continuously increasing over the past years.User figures, shown here regarding the platform twitter, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of Twitter users in countries like Canada and Mexico.