CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes sanitized password frequency lists collected from Yahoo inMay 2011. For details of the original collection experiment, please see:Bonneau, Joseph. "The science of guessing: analyzing an anonymized corpus of 70 million passwords." IEEE Symposium on Security & Privacy, 2012.http://www.jbonneau.com/doc/B12-IEEESP-analyzing_70M_anonymized_passwords.pdfThis data has been modified to preserve differential privacy. For details ofthis modification, please see:Jeremiah Blocki, Anupam Datta and Joseph Bonneau. "Differentially Private Password Frequency Lists." Network & Distributed Systems Symposium (NDSS), 2016.http://www.jbonneau.com/doc/BDB16-NDSS-pw_list_differential_privacy.pdfEach of the 51 .txt files represents one subset of all users' passwords observedduring the experiment period. "yahoo-all.txt" includes all users; every otherfile represents a strict subset of that group.Each file is a series of lines of the format:FREQUENCY #OBSERVATIONS...with FREQUENCY in descending order. For example, the file:3 12 11 3would represent a the frequency list (3, 2, 1, 1, 1), that is, one passwordobserved 3 times, one observed twice, and three separate passwords observedonce each.
This dataset includes daily historical price data for Bitcoin (BTC-USD) from 2014 to 2025, obtained through web scraping from the Yahoo Finance page using Selenium. The primary data source can be accessed at Yahoo Finance - Bitcoin Historical Data . The dataset contains daily information such as opening price (Open), highest price (High), lowest price (Low), closing price (Close), adjusted closing price (Adj Close), and trading volume (Volume).
About Bitcoin: Bitcoin (BTC) is the world's first decentralized digital currency, introduced in 2009 by an anonymous creator known as Satoshi Nakamoto. It operates on a peer-to-peer network powered by blockchain technology, enabling secure, transparent, and trustless transactions without the need for intermediaries like banks. Bitcoin's limited supply of 21 million coins and its growing adoption have made it a popular asset for investment, trading, and as a hedge against inflation.
We are excited to share this dataset and look forward to seeing the insights it can provide. We hope it will inspire collaboration and innovation within the community. By leveraging this daily data, we can explore trends, develop predictive models, and design innovative trading strategies that deepen our understanding of Bitcoin's market behavior. Together, we can unlock new opportunities and contribute to the collective advancement of cryptocurrency research and analysis.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for CommonCatalog CC-BY-NC
This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets.
Dataset Details
Dataset Description
We provide captions synthetic captions to approximately 100 million high… See the full description on the dataset page: https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by-nc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Recycling Initiatives: This model can be used in smart waste segregation systems to automatically identify and sort different types of plastic bottles, cans, and other recyclables. This could save significant manual labor and increase overall recycling efficiency.
Retail Inventory Management: The model could be used in supermarkets or stores to autonomously monitor their inventory. By identifying different types of bottles and other items, the system could keep track of what's in stock and needs replenishment, especially within grocery stores or beverage industry retailers.
Pollution Monitoring: Environmental organizations could use this model for monitoring plastic pollution in public spaces, oceans, or beaches. By recognizing specific brands and kinds of bottles, data could be accumulated to hold companies accountable for their environmental footprints.
Brand Strategy Analysis: Companies could use this model to analyze the presence and positioning of their products in various scenarios (like events, homes, public spaces). They could track consumption patterns, target demographics, and even assess the impact of branding campaigns.
Customized Beverage Vending Machines: Vending machines could use this model to provide a unique user experience. Instead of standard buttons, users could hold up the bottle or can they want, and the machine could recognize the object and dispense the corresponding beverage.
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
The largest reported data leakage as of January 2025 was the Cam4 data breach in March 2020, which exposed more than 10 billion data records. The second-largest data breach in history so far, the Yahoo data breach, occurred in 2013. The company initially reported about one billion exposed data records, but after an investigation, the company updated the number, revealing that three billion accounts were affected. The National Public Data Breach was announced in August 2024. The incident became public when personally identifiable information of individuals became available for sale on the dark web. Overall, the security professionals estimate the leakage of nearly three billion personal records. The next significant data leakage was the March 2018 security breach of India's national ID database, Aadhaar, with over 1.1 billion records exposed. This included biometric information such as identification numbers and fingerprint scans, which could be used to open bank accounts and receive financial aid, among other government services.
Cybercrime - the dark side of digitalization As the world continues its journey into the digital age, corporations and governments across the globe have been increasing their reliance on technology to collect, analyze and store personal data. This, in turn, has led to a rise in the number of cyber crimes, ranging from minor breaches to global-scale attacks impacting billions of users – such as in the case of Yahoo. Within the U.S. alone, 1802 cases of data compromise were reported in 2022. This was a marked increase from the 447 cases reported a decade prior. The high price of data protection As of 2022, the average cost of a single data breach across all industries worldwide stood at around 4.35 million U.S. dollars. This was found to be most costly in the healthcare sector, with each leak reported to have cost the affected party a hefty 10.1 million U.S. dollars. The financial segment followed closely behind. Here, each breach resulted in a loss of approximately 6 million U.S. dollars - 1.5 million more than the global average.
Do people share their feelings of guilt with others and, if so, what are the reasons for doing or not doing this? Even though social sharing of negative emotional experiences, such as regret, has been extensively studied, not much is known on the sharing of guilt. We report three studies on the sharing of guilt. In Study 1, we re-analyzed data about sharing guilt experiences posted on a social website called “Yahoo Answers,” and found that people share intrapersonal as well as interpersonal guilt experiences with others online. Study 2 found that the main motivations of sharing guilt (compared with the sharing of regret) were “venting”, “clarification and meaning”, and “gaining advice”. Study 3 found that people were more likely to share experiences of interpersonal guilt and more likely to keep experiences of intrapersonal guilt to themselves. Together, these studies contribute to the understanding of the social sharing of the emotion guilt. Additional documentation and metadata can be found in the files Data Report Chapter 5XLZ.pdf, Documentation of all author responsibilities.pdf, and the metadata files in the rawdata folders. This research has preregistered all materials, hypothesis and sample size through: https://aspredicted.org/blind.php?x=md5f3b (For Study 2); https://aspredicted.org/blind.php?x=ay7vk9 (For Study 3). The present data package includes Raw data files (Raw data + metadata information, both in EXCEL), Syntax file (SPSS) and Materials (questionnaires in pdf from MTurk).
$TQQQ is a fund that tracks 3x the NASDAQ.
Data from May 28, 2018 - May 28, 2023
TQQQ
Data has 7 columns: Date, Open, High, Low, Close, Adj. Close, Volume.
Data rows that may interfere are dividends and stock splits.
TQQQ_Dividends
Data has 2 columns: Date, Dividends.
TQQQ_Stock_Splits
Data has 2 columns: Data, Stock Splits.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F2b87409e296a59d20dab602e6501f340%2Ffile9e063b84e35.gif?generation=1707771596337465&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F9d574862156fdd14299b6bcdf1d7c0e8%2Ffile9e048912e2.gif?generation=1707771713059014&alt=media" alt="">
Exchange-Traded Funds (ETFs) have gained significant popularity in recent years as a low-cost alternative to Mutual Funds. This dataset, compiled from Yahoo Finance, offers a comprehensive overview of the US funds market, encompassing 23,783 Mutual Funds and 2,310 ETFs.
Data
The dataset provides a wealth of information on each fund, including:
General fund aspects: total net assets, fund family, inception date, expense ratios, and more. Portfolio indicators: cash allocation, sector weightings, holdings diversification, and other key metrics. Historical returns: year-to-date, 1-year, 3-year, and other performance data for different time periods. Financial ratios: price/earnings ratio, Treynor and Sharpe ratios, alpha, beta, and ESG scores. Applications
This dataset can be leveraged by investors, researchers, and financial professionals for a variety of purposes, including:
Investment analysis: comparing the performance and characteristics of Mutual Funds and ETFs to make informed investment decisions. Portfolio construction: using the data to build diversified portfolios that align with investment goals and risk tolerance. Research and analysis: studying market trends, fund behavior, and other factors to gain insights into the US funds market. Inspiration and Updates
The dataset was inspired by the surge of interest in ETFs in 2017 and the subsequent shift away from Mutual Funds. The data is sourced from Yahoo Finance, a publicly available website, ensuring transparency and accessibility. Updates are planned every 1-2 semesters to keep the data current and relevant.
Conclusion
This comprehensive dataset offers a valuable resource for anyone seeking to gain a deeper understanding of the US funds market. By providing detailed information on a wide range of funds, the dataset empowers investors to make informed decisions and build successful investment portfolios.
Access the dataset and unlock the insights it offers to make informed investment decisions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
ETFs represent a cheap alternative to Mutual Funds and they are growing fast in the last decade. Is the 2017 hype around ETFs confirmed by good returns in 2018? Updated version relates to the October 2021 financial values.
The file contains 24,821 Mutual Funds and 1,680 ETFs with general aspects (as Total Net Assets, management company and size), portfolio indicators (as cash, stocks, bonds, and sectors), returns (as year_to_date, 2020-11) and financial ratios (as price/earning, Treynor and Sharpe ratios, alpha, and beta).
Data has been scraped from the publicly available website https://finance.yahoo.com.
Datasets allow for multiple comparisons regarding portfolio decisions from investment managers in Mutual Funds and portfolio restrictions to the indexes in ETFs. The inspiration comes from the 2017 hype regarding ETFs, that convinced many investors to buy shares of Exchange Traded Funds rather than Mutual Funds. Datasets will be updated every one or two semesters, hopefully with additional information scraped from Morningstar.com.
Data from General Electric Company, the data has been recorded by Yahoo, You can use this data for regression problem.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package.
It contains prices for up to 01 of April 2020. If you need more up to date data, just fork and re-run data collection script also available from Kaggle.
The date for every symbol is saved in CSV format with common fields:
All that ticker data is then stored in either ETFs or stocks folder, depending on a type. Moreover, each filename is the corresponding ticker symbol. At last, symbols_valid_meta.csv
contains some additional metadata for each ticker such as full name.
A cryptocurrency, crypto-currency, or crypto is a collection of binary data which is designed to work as a medium of exchange. Individual coin ownership records are stored in a ledger, which is a computerized database using strong cryptography to secure transaction records, to control the creation of additional coins, and to verify the transfer of coin ownership. Cryptocurrencies are generally fiat currencies, as they are not backed by or convertible into a commodity. Some crypto schemes use validators to maintain the cryptocurrency. In a proof-of-stake model, owners put up their tokens as collateral. In return, they get authority over the token in proportion to the amount they stake. Generally, these token stakes get additional ownership in the token overtime via network fees, newly minted tokens, or other such reward mechanisms.
Cryptocurrency does not exist in physical form (like paper money) and is typically not issued by a central authority. Cryptocurrencies typically use decentralized control as opposed to a central bank digital currency (CBDC). When a cryptocurrency is minted or created prior to issuance or issued by a single issuer, it is generally considered centralized. When implemented with decentralized control, each cryptocurrency works through distributed ledger technology, typically a blockchain, that serves as a public financial transaction database
A cryptocurrency is a tradable digital asset or digital form of money, built on blockchain technology that only exists online. Cryptocurrencies use encryption to authenticate and protect transactions, hence their name. There are currently over a thousand different cryptocurrencies in the world, and many see them as the key to a fairer future economy.
Bitcoin, first released as open-source software in 2009, is the first decentralized cryptocurrency. Since the release of bitcoin, many other cryptocurrencies have been created.
This Dataset is a collection of records of 3000+ Different Cryptocurrencies. * Top 395+ from 2021 * Top 3000+ from 2023
https://i.imgur.com/qGVJaHl.png" alt="">
This Data is collected from: https://finance.yahoo.com/. If you want to learn more, you can visit the Website.
Cover Photo by Worldspectrum: https://www.pexels.com/photo/ripple-etehereum-and-bitcoin-and-micro-sdhc-card-844124/
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The following information can also be found at https://www.kaggle.com/davidwallach/financial-tweets. Out of curosity, I just cleaned the .csv files to perform a sentiment analysis. So both the .csv files in this dataset are created by me.
Anything you read in the description is written by David Wallach and using all this information, I happen to perform my first ever sentiment analysis.
"I have been interested in using public sentiment and journalism to gather sentiment profiles on publicly traded companies. I first developed a Python package (https://github.com/dwallach1/Stocker) that scrapes the web for articles written about companies, and then noticed the abundance of overlap with Twitter. I then developed a NodeJS project that I have been running on my RaspberryPi to monitor Twitter for all tweets coming from those mentioned in the content section. If one of them tweeted about a company in the stocks_cleaned.csv file, then it would write the tweet to the database. Currently, the file is only from earlier today, but after about a month or two, I plan to update the tweets.csv file (hopefully closer to 50,000 entries.
I am not quite sure how this dataset will be relevant, but I hope to use these tweets and try to generate some sense of public sentiment score."
This dataset has all the publicly traded companies (tickers and company names) that were used as input to fill the tweets.csv. The influencers whose tweets were monitored were: ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'Carl_C_Icahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith'
The data used here is gathered from a project I developed : https://github.com/dwallach1/StockerBot
I hope to develop a financial sentiment text classifier that would be able to track Twitter's (and the entire public's) feelings about any publicly traded company (and cryptocurrency)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains daily stock data for Meta Platforms, Inc. (META), formerly Facebook Inc., from May 19, 2012, to January 20, 2025. It offers a comprehensive view of Meta’s stock performance and market fluctuations during a period of significant growth, acquisitions, and technological advancements. This dataset is valuable for financial analysis, market prediction, machine learning projects, and evaluating the impact of Meta’s business decisions on its stock price.
The dataset includes the following key features:
Open: Stock price at the start of the trading day. High: Highest stock price during the trading day. Low: Lowest stock price during the trading day. Close: Stock price at the end of the trading day. Adj Close: Adjusted closing price, accounting for corporate actions like stock splits, dividends, and other financial adjustments. Volume: Total number of shares traded during the trading day.
Date: The date of the trading day, formatted as YYYY-MM-DD. Open: The stock price at the start of the trading day. High: The highest price reached by the stock during the trading day. Low: The lowest price reached by the stock during the trading day. Close: The stock price at the end of the trading day. Adj Close: The adjusted closing price, which reflects corporate actions like stock splits and dividend payouts. Volume: The total number of shares traded on that specific day.
This dataset was sourced from reliable public APIs such as Yahoo Finance or Alpha Vantage. It is provided for educational and research purposes and is not affiliated with Meta Platforms, Inc. Users are encouraged to adhere to the terms of use of the original data provider.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTdaUO-YhK8CrvMPmhrVjs_5dc-qNrpZsb-d6MHT1z4WCfgcME5BhE49Gc6oGlvd8vfHts&usqp=CAU" alt="">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Used data from Yahoo Finance to get daily data for Opening & Closing Price, Highest & Lowest Prices, Volume of the S&P 500 index.
Code: Github Used the yfinance library (github) to import data from yahoo finance directly. Some processing of data was done.
All but a few open prices were missing between 1962-01-01 and 1982-04-10. For these, it was assumed that open price is equal to closing price of previous trading day.
Volume figures until 1949-12-13 are not available.
Some earlier years have less than expected calendar dates | Year with less than expected trading days| Number of Trading Days Recorded | | ---| --- | |1927| 1 | |1928| 195 | | 1929 | 199 | | 1930 | 155 | | 1931 | 183 | | 1932 | 169 | | 1933 | 136 | | 1934 | 91 | | 1935 | 83 | | 1936 | 107 | | 1937 | 83 | | 1938 | 57 | | 1939 | 27 | | 1940 | 8 | | 1941 | 6 | | 1942 | 16 | | 1943 | 7 | | 1944 | 6 | | 1945 | 42 | | 1946 | 48 | | 1947 | 18 | | 1948 | 16 | | 1949 | 1 | | 1968 | 226 |
1. percentage Gain/Loss (calculated by taking the percentage difference between closing prices of 2 consecutive trading days)
2. price variation percentage: (High-Low)/Closing
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains daily OHLCV data for ~ 2000 Indian Stocks listed on the National Stock Exchange for all time. The columns are multi-index columns, so this needs to be taken into account when reading and using the data. Source : Yahoo Finance Type: All files are CSV format. Currency : INR
All the tickers have been collected from here : https://www.nseindia.com/market-data/securities-available-for-trading
If using pandas
, the following function is a utility to read any of the CSV files:
```
import pandas as pd
def read_ohlcv(filename):
"read a given ohlcv data file downloaded from yfinance"
return pd.read_csv(
filename,
skiprows=[0, 1, 2], # remove the multiindex rows that cause trouble
names=["Date", "Close", "High", "Low", "Open", "Volume"],
index_col="Date",
parse_dates=["Date"],
)
Analyzing stock price is interesting.
Data from yahoo.com/finance AMD and GOOGLE historical price 5/22/2009 ~ 5/03/2017 daily price and volume. There are 7 columns; Date, open, high, low, close, volume, adj close (2001, 7) each of stock
Yahoo/finance
I want to find relationship between volume and price.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes sanitized password frequency lists collected from Yahoo inMay 2011. For details of the original collection experiment, please see:Bonneau, Joseph. "The science of guessing: analyzing an anonymized corpus of 70 million passwords." IEEE Symposium on Security & Privacy, 2012.http://www.jbonneau.com/doc/B12-IEEESP-analyzing_70M_anonymized_passwords.pdfThis data has been modified to preserve differential privacy. For details ofthis modification, please see:Jeremiah Blocki, Anupam Datta and Joseph Bonneau. "Differentially Private Password Frequency Lists." Network & Distributed Systems Symposium (NDSS), 2016.http://www.jbonneau.com/doc/BDB16-NDSS-pw_list_differential_privacy.pdfEach of the 51 .txt files represents one subset of all users' passwords observedduring the experiment period. "yahoo-all.txt" includes all users; every otherfile represents a strict subset of that group.Each file is a series of lines of the format:FREQUENCY #OBSERVATIONS...with FREQUENCY in descending order. For example, the file:3 12 11 3would represent a the frequency list (3, 2, 1, 1, 1), that is, one passwordobserved 3 times, one observed twice, and three separate passwords observedonce each.