Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time spent on Threads has dropped sharply since launch
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here is a breakdown of how quickly people signed up for Threads...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this post, we will break down all of the latest Threads statistics and give you some insight into what the future looks like with Twitter vs Threads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Evolution of the Manosphere Across the Web
We make available data related to subreddit and standalone forums from the manosphere.
We also make available Perspective API annotations for all posts.
You can find the code in GitHub.
Please cite this paper if you use this data:
@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }
We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:
{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.
Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.
Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.
No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.
I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.
Tallcels are fakecels and they all can (and should) suck my cock.
If I were 17cm taller my life would be a heaven and I would be the happiest man alive.
Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }
We here describe the .sqlite and .ndjson files that contain the data from the following forums.
(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/
The files are in folders /sqlite/ and /ndjson.
2.1 .sqlite
All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:
idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:
"type": (list) in some forums you can add a descriptor such as
[RageFuel] to each topic, and you may also have special
types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread;
"link": (str) link to the thread;
"author_topic": (str) username that created the thread;
"replies": (int) number of replies, may differ from number of
posts due to difference in crawling date;
"views": (int) number of views;
"subforum": (str) name of the subforum;
"collected": (bool) indicates if raw posts have been collected;
"crawled_idx_at": (str) datetime of the collection.
processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:
"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.
raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:
"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.
2.2 .ndjson
Each line consists of a json object representing a different comment with the following fields:
"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.
We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output
{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }
A nice way to read some of the files of the dataset is using SqliteDict, for example:
from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")
for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass
Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:
channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.
author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.
These are used in the paper for the migration analyses.
Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.
6.1 incels
Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.
types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.
quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.
6.2 LoveShy
Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.
types: no types were parsed. There are some rules in the forum, but not significant.
quotes: quotes were obtained from exact text+author match, or author match + a jaccard
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
### Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims
The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims".
A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report.
If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias."
The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.
The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.
# Content:
This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias.
1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians.
1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks.
1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews.
1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines.
1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.
# File Description:
The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:
'TweetID': Represents the tweet ID.
'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.
'Text': Represents the full text of the tweet (not pre-processed).
'CreateDate': Represents the date the tweet was created.
'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).
'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).
'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.
# Licences
Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)
# Acknowledgements
We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young.
This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".
Abstract:
Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.
Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch
Dataset structure
The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.
We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.
Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:
– Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;
– On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).
Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.
User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Health Insurance Marketplace Public Use Files contain data on health and dental plans offered to individuals and small businesses through the US Health Insurance Marketplace.
To help get you started, here are some data exploration ideas:
See this forum thread for more ideas, and post there if you want to add your own ideas or answer some of the open questions!
This data was originally prepared and released by the Centers for Medicare & Medicaid Services (CMS). Please read the CMS Disclaimer-User Agreement before using this data.
Here, we've processed the data to facilitate analytics. This processed version has three components:
The original versions of the 2014, 2015, 2016 data are available in the "raw" directory of the download and "../input/raw" on Kaggle Scripts. Search for "dictionaries" on this page to find the data dictionaries describing the individual raw files.
In the top level directory of the download ("../input" on Kaggle Scripts), there are six CSV files that contain the combined at across all years:
Additionally, there are two CSV files that facilitate joining data across years:
The "database.sqlite" file contains tables corresponding to each of the processed CSV files.
The code to create the processed version of this data is available on GitHub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Net-Receivables Time Series for Meta Platforms Inc.. Meta Platforms, Inc. engages in the development of products that enable people to connect and share with friends and family through mobile devices, personal computers, virtual reality and mixed reality headsets, augmented reality, and wearables worldwide. It operates through two segments, Family of Apps (FoA) and Reality Labs (RL). The FoA segment offers Facebook, which enables people to build community through feed, reels, stories, groups, marketplace, and other; Instagram that brings people closer through instagram feed, stories, reels, live, and messaging; Messenger, a messaging application for people to connect with friends, family, communities, and businesses across platforms and devices through text, audio, and video calls; Threads, an application for text-based updates and public conversations; and WhatsApp, a messaging application that is used by people and businesses to communicate and transact in a private way. The RL segment provides virtual, augmented, and mixed reality related products comprising consumer hardware, software, and content that help people feel connected, anytime, and anywhere. The company was formerly known as Facebook, Inc. and changed its name to Meta Platforms, Inc. in October 2021. The company was incorporated in 2004 and is headquartered in Menlo Park, California.
These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study was designed to understand the economic and social structure of the market for stolen data on-line. This data provides information on the costs of various forms of personal information and cybercrime services, the payment systems used, social organization and structure of the market, and interactions between buyers, sellers, and forum operators. The PIs used this data to assess the economy of stolen data markets, the social organization of participants, and the payment methods and services used. The study utilized a sample of approximately 1,900 threads generated from 13 web forums, 10 of which used Russian as their primary language and three which used English. These forums were hosted around the world, and acted as online advertising spaces for individuals to sell and buy a range of products. The content of these forums were downloaded and translated from Russian to English to create a purposive, yet convenient sample of threads from each forum. The collection contains 1 SPSS data file (ICPSR Submission Economic File SPSS.sav) with 39 variables and 13,735 cases and 1 Access data file (Social Network Analysis File Revised 04-11-14.mdb) with a total of 16 data tables and 199 variables. Qualitative data used to examine the associations and working relationships present between participants at the micro and macro-level are not available at this time.
The materials and datasets accompanying the paper “Fostering Constructive Online News Discussions: The Role of Sender Anonymity and Message Subjectivity in Shaping Perceived Polarization, Disinhibition, and Participation Intention in a Representative Sample of Online Commenters”. In this paper we report on an experiment in which we aimed to reduce perceived polarization and increase intention to join online news discussions through manipulating sender anonymity and message subjectivity (i.e., explicit acknowledgements that a statement represents the writer’s perspective, e.g., “I think that is not true”). The data files are not stored in TiU Dataverse but are accessible via the LISS Data Archive. Data filesDataset_raw – SPSS raw datafile Dataset_restructured_coding incl – SPSS restructured data file from variables to cases, coding of participants’ comments has been included as an additional variable Dataset_backstructured_for MEMORE – SPSS backstructured data file from cases to variables in order to conduct the mediation analysis in MEMORE Coding participant comments – Excell file with the coding of participants comments by the R script, including the manual checking SPSS Syntax – SPSS syntax with which the variables were constructed in the Dataset R Script – R script for all the analyses, except the mediation because that was conducted in SPSS Supplemental material Questionnaire Design lists of stimuli Stimuli lists (1-4) Dutch words and phrases for automated subjectivity coding Structure data package From the raw dataset, we made the restructured dataset which also includes the calculated variables, see the SPSS Syntax. This structured dataset was the basis for the analyses in R. The backstructured dataset is based on the restructured dataset and needed for conducting the repeated measures mediation with SPSS MEMORE. The coding dataset was also analyzed in R, and provides the input for the column “CodingComments” in the restructured dataset. Method: Survey through the LISS panel Universe: The sample consisted of 302 participants, but after removing the 8 participants that had not completed the survey, the final sample consisted of 294 participants (Mage = 54.80, SDage = 15.53, range = 17 – 88 years; 55.4% male and 44.6% female). 3.1% of the sample completed only primary education, 25.6% reported high school as their highest completed education, 31.1% had attained secondary vocational education, 25.6% finished higher professional education, and 14.7% had a University degree as their highest qualification. Notably, whereas we preselected participants on their online activity, 49.7% of the sample indicated that they do not respond to online news articles anymore, suggesting that actual participation in online discussions fluctuates over time. Of the people that do react, 54.1% also engages in discussions in online news article threads. Of those, 8.8% discusses almost never, 45% multiple times per year, 35% multiple times per month, 10% multiple times per week, and 1.3% multiple times per day. Country/Nation: The Netherlands
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
This is the dataset that is used in the Telstra Network Disruptions competition that ran between Nov 2015 and Feb 2016 This competition provided a very nice and small dataset that allows many aspects of predictive modelling:
This dataset is re-uploaded since the original competition did not feature kernels, and it is made available here give people a chance to practice their data science/predictive modelling skill with this nice little dataset
The goal of the problem is to predict Telstra network's fault severity at a time at a particular location based on the log data available. Each row in the main dataset (train.csv, test.csv) represents a location and a time point. They are identified by the "id" column, which is the key "id" used in other data files.
Fault severity has 3 categories: 0,1,2 (0 meaning no fault, 1 meaning only a few, and 2 meaning many).
Different types of features are extracted from log files and other sources: event_type.csv, log_feature.csv, resource_type.csv, severity_type.csv.
Note: “severity_type” is a feature extracted from the log files (in severity_type.csv). Often this is a severity type of a warning message coming from the log. "severity_type" is categorical. It does not have an ordering. “fault_severity” is a measurement of actual reported faults from users of the network and is the target variable (in train.csv).
This dataset is made available entirely for educational use only, it is shared by Telstra and Kaggle for the original competition, and is subjected to their permission of usage.
How far up can you get in the post-deadline LB? :)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
1. Specialized mechanical structures produced by organisms provide crucial fitness advantages. The energetic cost associated with producing these structural materials and the resulting energetic trade-off with growth, however, is rarely quantified. We evaluate resource allocation to structural material production within the context of an energetic framework by combining an experimental manipulation with an energetic model.
2. Mytilid bivalves produce byssus, a network of collagen-like threads that tethers individuals to hard substrate. We hypothesized that a manipulation that induces the production of more byssal threads would result in increased energetic cost and decreased growth.
3. In month-long field experiments in spring and autumn, we severed byssal threads across a range of frequencies (never, weekly, daily), and measured shell and tissue growth. We then quantified the costs associated with the production of byssal threads using a Scope for Growth model.
4. We found that byssal thread removal increased byssal thread production and decreased growth. The cost calculated per byssal thread was similar in the spring and autumn (~1 J/thread), but energy budget calculations differed by season, and depended on thread quantity and seasonal differences in assumptions of metabolic costs.
5. This work demonstrates that the cost of producing a structural material has a substantial effect on mussel energetic state. The energetic cost of producing byssal threads was 2-8% percent of the energy budget in control groups that had low byssal thread production, and increased 6 to 11-fold (up to 47%) in mussels induced to produce threads daily.
6. We propose that characterizing the trade-off between the cost of biomaterial production and growth has implications for understanding the role of trade-offs in adaptive evolution, and improved natural resource management and conservation practices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fixed-Asset-Turnover Time Series for Meta Platforms Inc.. Meta Platforms, Inc. engages in the development of products that enable people to connect and share with friends and family through mobile devices, personal computers, virtual reality and mixed reality headsets, augmented reality, and wearables worldwide. It operates through two segments, Family of Apps (FoA) and Reality Labs (RL). The FoA segment offers Facebook, which enables people to build community through feed, reels, stories, groups, marketplace, and other; Instagram that brings people closer through instagram feed, stories, reels, live, and messaging; Messenger, a messaging application for people to connect with friends, family, communities, and businesses across platforms and devices through text, audio, and video calls; Threads, an application for text-based updates and public conversations; and WhatsApp, a messaging application that is used by people and businesses to communicate and transact in a private way. The RL segment provides virtual, augmented, and mixed reality related products comprising consumer hardware, software, and content that help people feel connected, anytime, and anywhere. The company was formerly known as Facebook, Inc. and changed its name to Meta Platforms, Inc. in October 2021. The company was incorporated in 2004 and is headquartered in Menlo Park, California.
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.
Historical daily stock prices (open, high, low, close, volume)
Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)
Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)
Feature engineering based on financial data and technical indicators
Sentiment analysis data from social media and news articles
Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)
Stock price prediction
Portfolio optimization
Algorithmic trading
Market sentiment analysis
Risk management
Researchers investigating the effectiveness of machine learning in stock market prediction
Analysts developing quantitative trading Buy/Sell strategies
Individuals interested in building their own stock market prediction models
Students learning about machine learning and financial applications
The dataset may include different levels of granularity (e.g., daily, hourly)
Data cleaning and preprocessing are essential before model training
Regular updates are recommended to maintain the accuracy and relevance of the data
List of public parking spaces for people with disabilities in the city of Zaragoza. Source of information: Urban Mobility Service Tools for developers: The data are available in the following formats: http://www.zaragoza.es/georref/rdf/thread/parkingPeopleDisability_GeoRSS equipment coordinates in UTM format http://www.zaragoza.es/georref/rdf/thread/parkingPeopleDisability_Equipment?srsname=wgs84 geoRSS coordinates in WGS84 format http://www.zaragoza.es/georref/json/thread/parkingPeopleDisability_geoJSON equipment coordinates in UTM format http://www.zaragoza.es/georref/json/thread/parkingPeopleDisability_Equipment?srsname=wgs84 geoJSON coordinates in WGS84 format
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description: This dataset presents an aggregated collection of discussion threads from a variety of stock market-related subreddits, compiled into a single .json file. It offers a comprehensive overview of community-driven discussions, opinions, analyses, and sentiments about various aspects of the stock market. This dataset is a valuable resource for understanding diverse perspectives on different stocks and investment strategies.
The single .json file contains aggregated data from the following subreddits: | Subreddit Name | Subreddit Name | Subreddit Name | Subreddit Name | | --- | --- | --- | --- | | r/AlibabaStock | r/IndiaInvestments | r/StockMarket | | r/amcstock | r/IndianStockMarket | r/StocksAndTrading | | r/AMD_Stock | r/investing_discussion | r/stocks | | r/ATERstock | r/investing | r/StockTradingIdeas | | r/ausstocks | r/pennystocks | r/teslainvestorsclub | | r/BB_Stock | r/realestateinvesting | r/trakstocks | | r/Bitcoin | r/RobinHoodPennyStocks | r/UKInvesting | | r/Canadapennystocks | r/SOSStock | r/ValueInvesting | | r/CanadianInvestor | r/STOCKMARKETNEWS | |
Dataset Format: - The dataset is in .json format, facilitating easy parsing and analysis. - Each entry in the file represents a distinct post or thread, complete with details such as title, score, number of comments, body, creation date, and comments.
Potential Applications: - Sentiment analysis across different investment communities. - Comparative analysis of discussions and trends across various stocks and sectors. - Behavioral analysis of investors in different market scenarios.
Caveats: - The content is user-generated and may contain biases or subjective opinions. - The data reflects specific time periods and may not represent current market sentiments or trends.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Altman-Zscore Time Series for Meta Platforms Inc.. Meta Platforms, Inc. engages in the development of products that enable people to connect and share with friends and family through mobile devices, personal computers, virtual reality and mixed reality headsets, augmented reality, and wearables worldwide. It operates through two segments, Family of Apps (FoA) and Reality Labs (RL). The FoA segment offers Facebook, which enables people to build community through feed, reels, stories, groups, marketplace, and other; Instagram that brings people closer through instagram feed, stories, reels, live, and messaging; Messenger, a messaging application for people to connect with friends, family, communities, and businesses across platforms and devices through text, audio, and video calls; Threads, an application for text-based updates and public conversations; and WhatsApp, a messaging application that is used by people and businesses to communicate and transact in a private way. The RL segment provides virtual, augmented, and mixed reality related products comprising consumer hardware, software, and content that help people feel connected, anytime, and anywhere. The company was formerly known as Facebook, Inc. and changed its name to Meta Platforms, Inc. in October 2021. The company was incorporated in 2004 and is headquartered in Menlo Park, California.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regional water strategies (RWS) set the direction for water planning and management at a regional scale over the next 20-40 years. There are 14 regional water strategies (including Greater Sydney Water Strategy), tailored to the unique challenges and needs of each region. They have been developed in partnership with water service providers, local councils, communities, Aboriginal people and other stakeholders across NSW.\r \r The boundaries of regional water strategy areas define regions in NSW for which regional water strategies are prepared. The boundaries are based on several factors, including:\r \r - surface water hydrology\r - statutory instruments, such as water sharing plans and water resource plans\r - economic, social and cultural factors\r - government strategic plans\r \r The boundaries of regional water strategy (RWS) areas mostly, but not exclusively, align with groups of water sharing plan boundaries for surface water sources:\r \r - In coastal areas, RWS boundaries align with Water Sharing Plan boundaries\r - In inland areas, RWS boundaries align with Water Resource Plan boundaries. \r \r The NSW Murray RWS also includes the area for the Wentworth weir pool. Its boundary is further defined using the Bureau of Meteorology’s (BoM) geofabric AHGF Catchment layer to include the catchments that incorporate the Wentworth Weir pool. The Fish River–Wywandy RWS boundary was also further defined using the BoM geofabric, local council boundaries and National Parks Estate boundaries.\r \r -----------------------------------\r \r Note: If you would like to ask a question, make any suggestions, or tell us how you are using this dataset, please visit the NSW Water Hub which has an online forum you can join.\r
Many families, despite need and eligibility, struggle to meet program deadlines to retain critical benefits. When families fail to complete program recertification on time, they lose needed support. While scholars have tested behavioral theories like chunking, implementation intention, and loss framing to promote program uptake, less is known about how well-designed communications can promote continuity through successful recertification, especially where recertification entails significant administrative burden. Further, scant evidence guides how best to frame recertification deadlines. In a randomized trial with government partners (n = 3,539), we find that sending a reminder letter informed by these behavioral theories increased the number of families maintaining participation by 14 percent. Further, anchoring people to a deadline month may suffice to thread the motivational needle: overcoming procrastination without lowering self-efficacy by anchoring them to a specific day. Adopting the most effective letter in Washington, DC would lead 766 more families to participate uninterrupted each year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time spent on Threads has dropped sharply since launch