https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package.
It contains prices for up to 01 of April 2020. If you need more up to date data, just fork and re-run data collection script also available from Kaggle.
The date for every symbol is saved in CSV format with common fields:
All that ticker data is then stored in either ETFs or stocks folder, depending on a type. Moreover, each filename is the corresponding ticker symbol. At last, symbols_valid_meta.csv
contains some additional metadata for each ticker such as full name.
Context The StockNet dataset, introduced by Xu and Cohen at ACL 2018, is a benchmark for measuring the effectiveness of textual information in stock market prediction. While the original dataset provides valuable price and news data, it requires significant pre-processing and feature engineering to be used effectively in advanced machine learning models.
This dataset was created to bridge that gap. We have taken the original data for 87 stocks and performed extensive feature engineering, creating a rich, multi-modal feature repository.
A key contribution of this work is a preliminary statistical analysis of the news data for each stock. Based on the consistency and volume of news, we have categorized the 87 stocks into two distinct groups, allowing researchers to choose the most appropriate modeling strategy:
joint_prediction_model_set: Stocks with rich and consistent news data, ideal for building complex, single models that analyze all stocks jointly.
panel_data_model_set: Stocks with less consistent news data, which are better suited for traditional panel data analysis.
Content and File Structure The dataset is organized into two main directories, corresponding to the two stock categories mentioned above.
1.joint_prediction_model_set This directory contains stocks suitable for sophisticated, news-aware joint modeling.
-Directory Structure: This directory contains a separate sub-directory for each stock suitable for joint modeling (e.g., AAPL/, MSFT/, etc.).
-Folder Contents: Inside each stock's folder, you will find a set of files, each corresponding to a different category of engineered features. These files include:
-News Graph Embeddings: A NumPy tensor file (.npy) containing the encoded graph embeddings from daily news. Its shape is (Days, N, 128), where N is the number of daily articles.
-Engineered Features: A CSV file containing fundamental features derived directly from OHLCV data (e.g., intraday_range, log_return).
-Technical Indicators: A CSV file with a wide array of popular technical indicators (e.g., SMA, EMA, MACD, RSI, Bollinger Bands).
-Statistical & Time Features: A CSV file with rolling statistical features (e.g., volatility, skew, kurtosis) over an optimized window, plus cyclical time-based features.
-Advanced & Transformational Features: A CSV file with complex features like lagged variables, wavelet transform coefficients, and the Hurst Exponent.
2.panel_data_model_set This directory contains stocks that are more suitable for panel data models, based on the statistical properties of their associated news data.
-Directory Structure: Similar to the joint prediction set, this directory also contains a separate sub-directory for each stock in this category.
-Folder Contents: Inside each stock's folder, you will find the cleaned and structured price and news text data. This facilitates the application of econometric models or machine learning techniques designed for panel data, where observations are tracked for the same subjects (stocks) over a period of time.
-Further Information: For a detailed breakdown of the statistical analysis used to separate the stocks into these two groups, please refer to the data_preview.ipynb notebook located in the TRACE_ACL18_raw_data directory.
Methodology The features for the joint_prediction_model_set were generated systematically for each stock:
-News-to-Graph Pipeline: Daily news headlines were processed to extract named entities. These entities were then used to query Wikidata and build knowledge subgraphs. A Graph Convolutional Network (GCN) model encoded these graphs into dense vectors.
-Feature Engineering: All other features were generated from the raw price and volume data. The process included basic calculations, technical analysis via pandas-ta, generation of statistical and time-based features, and advanced transformations like wavelet analysis.
Acknowledgements This dataset is an extension and transformation of the original StockNet dataset. We extend our sincere gratitude to the original authors for their contribution to the field.
Original Paper: "StockNet: A Probing Task for Measuring Stock Market Prediction" by Yumeng Xu and Mohit Bansal. (ACL 2018).
Original Data Repository: https://github.com/yumoxu/stocknet-dataset
Inspiration This dataset opens the door to numerous exciting research questions:
-Can you build a single, powerful joint model using the joint_prediction_model_set to predict movements for all stocks simultaneously?
-How does a sophisticated joint model compare against a traditional panel data model trained on the panel_data_model_set?
-What is the lift in predictive power from using news-based graph embeddings versus using only technical indicators?
-Can you apply transfer learning or multi-task learning, using the feature-rich joint set to improve predictions for the panel set?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SUMMARY & CONTEXTThis dataset aims to provide a comprehensive, rolling 20-year history of the constituent stocks and their corresponding weights in India's Nifty 50 index. The data begins on January 31, 2008, and is actively maintained with monthly updates. After hitting the 20-year mark, as new monthly data is added, the oldest month's data will be removed to maintain a consistent 20-year window. This dataset was developed as a foundational feature for a graph-based model analyzing the market structure of the Indian stock market. Unlike typical snapshots that only show the current 50 stocks, this dataset is a survivorship bias-free compilation that includes all stocks that have been part of the Nifty 50 index during this period. The data has been meticulously cleaned and adjusted for corporate actions, making it a robust feature set for financial analysis and quantitative modeling.DATA SOURCE & FREQUENCYPrimary Source: All raw data is sourced from the official historical data reports published by Nifty Indices (niftyindices.com), ensuring the highest level of accuracy.Data Frequency: The data is recorded on a monthly and event-driven basis. It includes end-of-month (EOM) weights but also captures intra-month data points for any date on which the Nifty 50 index was reshuffled or rebalanced. For periods between these data points, the weights can be considered static.METHODOLOGY & DATA INTEGRITYThe dataset was constructed based on official Nifty 50 rebalancing announcements. It relies on the observed assumption that on most reshuffles, the weights of stocks that aren’t being reshuffled stay almost the same before and after the change. Significant effort was made to handle exceptions and complex corporate actions:Corporate Actions: Adjustments were systematically made for major events like mergers (HDFC/HDFCBANK), demergers (Reliance/JIOFIN, ITC/ITCHOTELS), and dual listings (TATAMOTORS/TATAMTRDVR).Rebalancing Extrapolation: In cases where EOM weights did not align with beginning-of-month (BOM) realities post-reshuffle, a logarithmic-linear extrapolation method was used to estimate the weights of incoming/outgoing stocks.2013 Rebalancing Exception: For the second half rebalancing of 2013, due to significant discrepancies, all 50 stocks' weights were recalculated using the extrapolation method instead of carrying over previous values.Weight Normalization: On any given date, the sum of all 50 constituent weights is normalized to equal 100%. The weights are provided with a precision of up to 5 decimal places, and the sum for all observations is validated to a strict tolerance of 1e-6.TICKER & NAMING CONVENTIONSFor consistency across the time series, several historical stock tickers have been mapped to their modern or unified equivalents:INFOSYSTCH -> INFYHEROHONDA -> HEROMOTOCOBAJAJ-AUTO -> BAJAUTOSSTL -> VEDLREL -> RELINFRAZOMATO -> ETERNALCONTENTS & FILE STRUCTUREThis dataset is distributed as a collection of files. The primary data is contained in weights.csv, with several supplementary files provided for context, validation, and analysis.weights.csv: The main data file.Layout: This file is in a standard CSV format. The first row contains the headers, with DATE in the first column and stock tickers in the subsequent columns. Each row corresponds to a specific date.Values: The cells contain the stock's weight (as a percentage) in the Nifty 50 index on a given date. A value of 0 indicates the stock was not an index constituent at that time.sectors.csv: A helper file that maps each stock ticker to its corresponding industry sector.summary.csv: A simple summary file containing the first and last observed dates for each stock, along with a count of its non-zero weight observations.validate.py: A Python script to check weights.csv for data integrity issues (e.g., ensuring daily weights sum to 100).validation_report.txt: The output report generated by validate.py, showing the results of the latest data validation checks.analysis.ipynb: A Jupyter Notebook providing sample analyses that can be performed using this dataset, such as visualizing sector rotation and calculating HHI score over time.README.md: This file, containing the complete documentation for the dataset.CHANGELOG.md: A file for tracking all updates and changes made to the dataset over time.LICENSE.txt: The full legal text of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license, which is applicable to this dataset.POTENTIAL USE CASESAnalyzing historical sector rotation and weight concentration in the Indian market.Building features for quantitative models that aim to predict market movements.Backtesting investment strategies benchmarked against the Nifty 50.ACKNOWLEDGEMENTS & CITATIONThis dataset was created by Sukrit Bera. A permanent, versioned archive of this dataset is available on Figshare. If you use this dataset in your research, please use the following official citation, which includes the permanent DOI:Bera, S. (2025). Historical Nifty 50 Constituent Weights (Rolling 20-Year Window) [Data set]. figshare. https://doi.org/10.6084/m9.figshare.30217915LICENSINGThis dataset is made available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. The license selected in the metadata dropdown (CC BY 4.0) is the closest available option on this platform. The full terms of the applicable CC BY-NC-SA 4.0 license is available HERE, as well as in the uploaded LICENSE.txt file in the dataset. The CC BY-NC-SA 4.0 license DOES NOT permit commercial use. This dataset is FREE for academic and non-commercial research with attribution. If you wish to use this dataset for commercial purposes, please contact Sukrit Bera at sukritb2005@gmail.com to negotiate a separate, commercial license.DATA DICTIONARYColumn Name: DATEData Type: DateDescription: The date of the weight recording. This is the first column.Column Name: [Stock Ticker]Data Type: floatDescription: The percentage weight of the stock (e.g., 'RELIANCE', 'TCS') in the Nifty 50 index. A value of 0 indicates it was not an index constituent on that date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. The Multi-aspect Integrated Migration Indicators (MIMI) dataset is a new dataset to be exploited in migration studies as a concrete example of this new approach. It includes both official data about bidirectional human migration (traditional flow and stock data) with multidisciplinary variables and original indicators, including economic, demographic, cultural and geographic indicators, together with the Facebook Social Connectedness Index (SCI). It is built by gathering, embedding and integrating traditional and novel variables, resulting in this new multidisciplinary dataset that could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.
Thanks to this variety of knowledge, experts from several research fields (demographers, sociologists, economists) could exploit MIMI to investigate the trends in the various indicators, and the relationship among them. Moreover, it could be possible to develop complex models based on these data, able to assess human migration by evaluating related interdisciplinary drivers, as well as models able to nowcast and predict traditional migration indicators in accordance with original variables, such as the strength of social connectivity. Here, the SCI could have an important role. It measures the relative probability that two individuals across two countries are friends with each other on Facebook, therefore it could be employed as a proxy of social connections across borders, to be studied as a possible driver of migration.
All in all, the motivations for building and releasing the MIMI dataset lie in the need of new perspectives, methods and analyses that can no longer prescind from taking into account a variety of new factors. The heterogeneous and multidimensional sets of data present in MIMI offer an all-encompassing overview of the characteristics of human migration, enabling a better understanding and an original potential exploration of the relationship between migration and non-traditional sources of data.
The MIMI dataset is made up of one single CSV file that includes 28,821 rows (records/entries) and 876 columns (variables/features/indicators). Each row is identified uniquely by a pairs of countries, built from the joining of the two ISO-3166 alpha-2 codes for the origin and destination country, respectively. The dataset contains as main features the country-to-country bilateral migration flows and stocks, together with multidisciplinary variables measuring cultural, demographic, geographic and economic variables for the two countries, together with the Facebook strength of connectedness of each pair.
Related paper: Goglia, D., Pollacci, L., Sirbu, A. (2022). Dataset of Multi-aspect Integrated Migration Indicators. https://doi.org/10.5281/zenodo.6500885
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Global Industry Classification Standard (GICS) is an industry taxonomy developed in 1999 by MSCI and Standard & Poor's (S&P) for use by the global financial community. The GICS structure consists of
The system is similar to ICB (Industry Classification Benchmark), a classification structure maintained by FTSE Group.
GICS is used as a basis for S&P and MSCI financial market indexes in which each company is assigned to a sub-industry, and to an industry, industry group, and sector, by its principal business activity.
"GICS" is a registered trademark of McGraw Hill Financial and MSCI Inc.
The GICS schema follows this hierarchy:
- Sector
- Industry Group
- Industry
- Sub-industry
That is, a sector is composed by industry groups, which are composed by industries which are composed by sub-industries.
Each item in the hierarchy has an id. Each ids are prefixed by the id of the parent in the hierarchy and generally the number of the ids are increased by 5 or 10. For example the Sector Industrials has the id 20
, the Industry group Capital Goods
has the id is prefixed by that 20
, resulting in 2010
.
The dataset is composed by CSV files (currently 2 files). Each representing a different version of the GICS classification.
For each file the columns are:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Actually, I prepare this dataset for students on my Deep Learning and NLP course.
But I am also very happy to see kagglers play around with it.
Have fun!
Description:
There are two channels of data provided in this dataset:
News data: I crawled historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01)
Stock data: Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01)
I provided three data files in .csv format:
RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.
DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.
Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".
=========================================
To my students:
I made this a binary classification task. Hence, there are only two labels:
"1" when DJIA Adj Close value rose or stayed as the same;
"0" when DJIA Adj Close value decreased.
For task evaluation, please use data from 2008-08-08 to 2014-12-31 as Training Set, and Test Set is then the following two years data (from 2015-01-02 to 2016-07-01). This is roughly a 80%/20% split.
And, of course, use AUC as the evaluation metric.
=========================================
+++++++++++++++++++++++++++++++++++++++++
To all kagglers:
Please upvote this dataset if you like this idea for market prediction.
If you think you coded an amazing trading algorithm,
friendly advice
do play safe with your own money :)
+++++++++++++++++++++++++++++++++++++++++
Feel free to contact me if there is any question~
And, remember me when you become a millionaire :P
Note: If you'd like to cite this dataset in your publications, please use:
Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved [Date You Retrieved This Data] from https://www.kaggle.com/aaron7sun/stocknews.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction In the course of researching the common ownership hypothesis, we found a number of issues with the Thomson Reuters (TR) "S34" dataset used by many researchers and frequently accessed via Wharton Research Data Services (WRDS). WRDS has done extensive work to improve the database, working with other researchers that have uncovered problems, specifically fixing a lack of records of BlackRock holdings. However, even with the updated dataset posted in the summer of 2018, we discovered a number of discrepancies when accessing data for constituent firms of the S&P 500 Index. We therefore set out to separately create a dataset of 13(f) holdings from the source documents, which are all public and available electronically from the Securities and Exchange Commission (SEC) website. Coverage is good starting in 1999, when electronic filing became mandatory. However, the SEC's Inspector General issued a critical report in 2010 about the information contained in 13(f) filings. The process: We gathered all 13(f) filings from 1999-2017 here. The corpus is over 318,000 filings and occupies ~25GB of space if unzipped. (We do not include the raw filings here as they can be downloaded from EDGAR). We wrote code to parse the filings to extract holding information using regular expressions in Perl. Our target list of holdings was all public firms with a market capitalization of at least $10M. From the header of the file, we first extract the filing date, reporting date, and reporting entity (Central Index Key, or CIK, and CIKNAME). Beginning with the September 30 2013 filing date, all filings were in XML format, which made parsing fairly straightforward, as all values are contained in tags. Prior to that date, the filings are remarkable for the heterogeneity in formatting. Several examples are linked to below. Our approach was to look for any lines containing a CUSIP code that we were interested in, and then attempting to determine the "number of shares" field and the "value" field. To help validate the values we extracted, we downloaded stock price data from CRSP for the filing date, as that allows for a logic check of (price * shares) = value. We do not claim that this will exhaustively extract all holding information. We can provide examples of filings that are formatted in such a way that we are not able to extract the relevant information. In both XML and non-XML filings, we attempt to remove any derivative holdings by looking for phrases such as OPT, CALL, PUT, WARR, etc. We then perform some final data cleaning: in the case of amended filings, we keep an amended level of holdings if the amended report a) occurred within 90 days of the reporting date and b) the initial filing fails our logic check described above. The resulting dataset has around 48M reported holdings (CIK-CUSIP) for all 76 quarters and between 4,000 and 7,000 CUSIPs and between 1,000 and 4,000 investors per quarter. We do not claim that our dataset is perfect; there are undoubtedly errors. As documented elsewhere, there are often errors in the actual source documents as well. However, our method seemed to produce more reliable data in several cases than the TR dataset, as shown in Online Appendix B of the related paper linked above. Included Files Perl Parsing Code (find_holdings_snp.pl). For reference, only needed if you wish to re-parse original filings. Investor holdings for 1999-2017: lightly cleaned. Each CIK-CUSIP-rdate is unique. Over 47M records. The fields are CIK: the central index key assigned by the SEC for this investor. Mapping to names is available below. CUSIP: the identity of the holdings. Consult the SEC's 13(f) listings to identify your CUSIPs of interest. shares: the number of shares reportedly held. Merging in CRSP data on shares outstanding at the CUSIP-Month level allows one to construct \beta. We make no distinction for the sole/shared/none voting discretion fields. If a researcher is interested, we did collect that starting in mid-2013, when filings are in XML format. rdate: reporting date (end of quarter). 8 digit, YYYYMMDD. fdate: filing date. 8 digit, YYYYMMDD. ftype: the form name. Notes: we did not consolidate separate BlackRock entities (or any other possibly related entities). If one wants to do so, use the CIK-CIKname mapping file below. We drop any CUSIP-rdate observation where any investor in that CUSIP reports owning greater than 50% of shares outstanding (even though legitimate cases exist - see, for example, Diamond Offshore and Loews Corporation). We also drop any CUSIP-rdate observation where greater than 120% of shares outstanding are reported to be held by 13(f) investors. Cases where the shares held are listed as zero likely mean the investor filing lists a holding for the firm but that our code could not find the number of shares due to the formatting of the file. We leave these in the data so that any researchers that find a zero know to go back to that source filing to manually gather the holdings for the securities they are interested in. Processed 13f holdings (airlines.parquet, cereal.parquet, out_scrape.parquet). These are used in our related AEJ:Microeconomics paper. The files contain all firms within the airline industry, RTE cereal industry, and all large cap firms (a superset of the S&P 500) respectively. These are a merged version of the scrape_parsed.csv file described above, that include the shares outstanding and percent ownership used to calculate measures of common ownership. These are distributed as brotli compressed Apache Parquet (binary) files. This preserves date information correctly. mgrno: manager number (which is actually CIK in the scraped data) rdate: reporting date ncusip: cusip rrdate: reportaing date in stata format mgrname: manager name shares: shares sole: shares with sole authority shared: shares with shared authority none: shares with no authority isbr/isfi/iss/isba/isvg: is this blackrock, statestreet, vanguard, barclay, fidelity numowners: how many owners prc: price at reporting date shares_out: shares outstanding at reporting date value: reported value in 13(f) beta: shares/shares_out permno: permno Profit weight values (i.e. \kappa) for all firms in the sample. (public_scrape_kappas_XXXX.parquet). Each file represents one year of data and is around 200MB and distributed as a compressed (brotli) parquet file. Fields are simply CUSIP_FROM, CUSIP_TO, KAPPA, QUARTER. Note that these have not been adjusted for multi-class share firms, insider holdings, etc. If looking at a particular market, some additional data cleaning on the investor holdings (above) followed by recomputing profit weights is recommended. For this, we did merge the separate BlackRock entities prior to computing \kappa. CIKmap.csv (~250K observations) Mapping is from CIK-rdate to CIKname. Use this if you want to consolidate holdings across reporting entities or explore the identities of reporting firms. In the case of amended filings that use different names than original ones, we keep the earliest name. Example of Parsing Challenge Prior to the XML era, filings were far from uniform, which creates a notable challenge for parsing them for holdings. In the examples directory we include several example text files of raw 13f filings. Example 1 is a "well behaved" filing, with CUSIP, followed by value, followed by number of shares, as recommended by the SEC. Example 2 shows a case where the ordering is changed: CUSIP, then shares, then value. The column headers show "item 5" coming before "item 4". Example 3 shows a case of a fixed width table, which in principle could be parsed very easily using the tags at the top, although not all filings consistently use these tags. Example 4 shows a case with a fixed width table, with no tag for the CUSIP column. Also, notice that if the firm holds more than 10M shares of a firm, that number occupies the entire width of the column and there is no longer a column separator (i.e. Cisco Systems on line 374). Example 5 shows a comma-separated table format. Example 6 shows a case of changing the column ordering, but also adding an (unrequired) column for share price. Example 7 shows a case where the table is split across subsequent pages, and so the CUSIP appears on a different line than the number of shares.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Stock market prediction remains active research in a quest to inform investors on how to trade (buy/sell) at the most opportune time. The prevalent methods used by stock market players in trying to predict the likely future trade prices are either technical, fundamental or time series analysis. This research wanted to try out machine learning methods, in contrast to the existing prevalent methods. Artificial neural networks (ANNs) tend to be the preferred machine learning method for this type of application. However, ANNs require some historical data to learn from, in order to do predictions. The research used an ANN model to test the hypothesis that the next day price (prediction) can be determined from the stock prices of the immediate last five days.
The final ANN model used for the tests was a feedforward multi-layer perceptron (MLP) with error backpropagation, using sigmoid activation function, with network configuration 5:21:21:1. The data period used was a 5-year dataset (2008 to 2012), with 80% of the data (4-year data) used for training and the balance 20% used for testing (last 1-year data).
The original raw data for Nairobi Securities Exchange (NSE) was scrapped from a publicly available and accessible website of a stock market analysis company in Kenya (Synergy, 2020). This data was first exported to a spreadsheet, then cleaned off headers and other redundant information, leaving only the data with stock name, date of trade and the related data such as volumes, low prices, high prices and adjusted prices. The data was further sorted by the stock names and then the trading dates. The data dimension was finally reduced to only what was needed for the research, which was the stock name, the date of trade and the adjusted price (average trade price). This final dataset was in CSV format, as hereby presented.
The research tested three NSE stocks with the mean absolute percentage error (MAPE) ranging between 0.77% to 1.91%, over the 3-month testing period, while the root mean squared error (RMSE) ranged between 1.83 and 3.07.
This raw data can be used to train and test any machine learning model that requires training and testing data. The data can also be used to validate and reproduce the results already presented in this research. There could be slight variance between what is obtained when reproducing the results, due to the differences in the final exact weights that the trained ANN model eventually achieves. However, these differences should not be significant.
List of data files on this dataset: stock01_NSE_01jan2008_to_31dec2012_Kakuzi.csv stock02_NSE_01jan2008_to_31dec2012_StandardBank.csv stock03_NSE_01jan2008_to_31dec2012_KenyaAirways.csv stock04_NSE_01jan2008_to_31dec2012_BamburiCement.csv stock05_NSE_01jan2008_to_31dec2012_Kengen.csv stock06_NSE_01jan2008_to_31dec2012_BAT.csv
References: Synergy Systems Ltd. (2020). MyStocks. Retrieved March 9, 2020, from http://live.mystocks.co.ke/
Maseyk et al_BiodivConserv_Data&RScripts1. R code DataPrep (R script for data compilation and file preparation); 2. R code LMM and graphs (R script for Linear Mixed Models and plotting); 3. Masterfile.csv (raw data file); 4. Abandoned.csv, Mowed.csv and Grazed.csv (input data by management type); 5. Count.csv, Cover.csv, Evar.csv, InvSimpson.csv (input data by metric).Final Data and R code.zip
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
List of companies in the S&P 500 (Standard and Poor's 500). The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in the US (top 500 by market cap). The ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This repository/dataset provides a suite of Python scripts to generate a simulated relational database for inventory management processes and transform this data into object-centric event logs (OCEL) suitable for advanced process mining analysis. The primary goal is to offer a synthetic yet realistic dataset that facilitates research, development, and application of object-centric process mining techniques in the domain of inventory control and supply chain management. The generated event logs capture common inventory operations, track stock level changes, and are enriched with key inventory management parameters (like EOQ, Safety Stock, Reorder Point) and status-based activity labels (e.g., indicating understock or overstock situations).
Overview: Inventory management is a critical business process characterized by the interaction of various entities such as materials, purchase orders, sales orders, plants, suppliers, and customers. Traditional process mining often struggles to capture these complex interactions. Object-Centric Process Mining (OCPM) offers a more suitable paradigm. This project provides the tools to create and explore such data.
The workflow involves:
pm4py
library.Contents:
The repository contains the following Python scripts:
01_generate_simulation.py
:
inventory_management.db
.Materials
, SalesOrderDocuments
, SalesOrderItems
, PurchaseOrderDocuments
, PurchaseOrderItems
, PurchaseRequisitions
, GoodsReceiptsAndIssues
, MaterialStocks
, MaterialDocuments
, SalesDocumentFlows
, and OrderSuggestions
.02_database_to_ocel_csv.py
:
inventory_management.db
.ocel_inventory_management.csv
.MAT
(Material), PLA
(Plant), PO_ITEM
(Purchase Order Item), SO_ITEM
(Sales Order Item), CUSTOMER
, SUPPLIER
.ocel:activity
, ocel:timestamp
, ocel:type:
).03_ocel_csv_to_ocel.py
:
ocel_inventory_management.csv
.pm4py
to convert the CSV event log into the standard OCEL XML format (ocel_inventory_management.xml
).04_postprocess_activities.py
:
inventory_management.db
to calculate inventory parameters:
ocel_inventory_management.csv
.ocel:activity
label (e.g., "Goods Issue (Understock)").MAT_PLA
(Material-Plant combination) for easier status tracking.post_ocel_inventory_management.csv
.05_ocel_csv_to_ocel.py
:
post_ocel_inventory_management.csv
.pm4py
to convert this enriched CSV event log into the standard OCEL XML format (post_ocel_inventory_management.xml
).Generated Dataset Files (if included, or can be generated using the scripts):
inventory_management.db
: The SQLite database containing the simulated raw data.ocel_inventory_management.csv
: The initial OCEL in CSV format.ocel_inventory_management.xml
: The initial OCEL in standard OCEL XML format.post_ocel_inventory_management.csv
: The post-processed and enriched OCEL in CSV format.post_ocel_inventory_management.xml
: The post-processed and enriched OCEL in standard OCEL XML format.How to Use:
sqlite3
(standard library), pandas
, numpy
, pm4py
.python 01_generate_simulation.py
(generates inventory_management.db
)python 02_database_to_ocel_csv.py
(generates ocel_inventory_management.csv
from the database)python 03_ocel_csv_to_ocel.py
(generates ocel_inventory_management.xml
)python 04_postprocess_activities.py
(generates post_ocel_inventory_management.csv
using the database and the initial CSV OCEL)python 05_ocel_csv_to_ocel.py
(generates post_ocel_inventory_management.xml
)Potential Applications and Research: This dataset and the accompanying scripts can be used for:
Keywords: Object-Centric Event Log, OCEL, Process Mining, Inventory Management, Supply Chain, Simulation, Synthetic Data, SQLite, Python, pandas, pm4py, Economic Order Quantity (EOQ), Safety Stock (SS), Reorder Point (ROP), Stock Status Analysis.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I got all these .csv files using pandas data reader
but getting every single kospi data through pandas data reader
is annoying. so I decided to share this files.
kospi.csv contains average kospi price. you can use this for checking whether if korean stock is day-off or not. xxxxxx.csv contains each single price records. xxxxxx is it's unique ticker.
Date
format - \d{4}-\d{2}-\d{2}
Open
format - \d{1,}\.\d{1}
High
format - \d{1,}\.\d{1}
Low
format - \d{1,}\.\d{1}
Close
format - \d{1,}\.\d{1}
Adj Close
format - \d{1,}\.\d{1}
Volume
format - \d+
blog post which describes how i got these data's. you might need this to update csv files.
git repository git repository
Good luck.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The folder "code and data" contains the code for data processing and empirical results. It includes two folders, data is used to store data, and model is used to store running python and R code.
1.Data Description: 1.1.The folder "TENET network data at each time point" stores the adjacency matrix and other data of each time node in the TENET network. It is called in "Network topology analysis.R". 1.2.Ping An Bank Investor Sentiment (Bayesian Machine Learning).csv is Ping An Bank's investor sentiment data based on machine learning methods 1.3.Ping An Bank Investor Sentiment (Financial Dictionary).csv is Ping An Bank's investor sentiment data based on Financial Dictionary methods 1.4.Ping An Bank Investor Sentiment (Pre-trained Deep Learning (ERNIE)).csv is Ping An Bank's investor sentiment data based on ERNIE model. 1.5aligned_sentiment_indices.csv stores variables related to market sentiment, among which ISI, CICSI and Confidence index are derived from the CSMAR database, and BI is the investor sentiment index calculated by ERNIE based on Baidu AI platform. 1.6 The IIC.csv file contains data on tail risk spillovers within the financial sector. 1.7 The DS.csv file contains data on tail risk spillovers between any financial sector of a financial institution and any other financial sector. 1.8 The BIC.csv file contains data on how much risk each sector spillsover to others. 1.9 The BIC_receive.csv contains data on how much risk each sector receives from others. 1.10 The three files HHI.csv, NAS.csv, and AS.csv store network topology indicator data. 1.11 The code number.xlsx store the stock codes and abbreviations of all financial institutions. 1.12 The Stock Market Value.csv is the market value data of financial institutions, which is used to identify Systemically Important Financial Institutions (Härdle et al. (2016)).
2.Figure: 2.1Figure 1 can be obtained through the ''Sentiment Comparison of Three Approaches for Individual Financial Institutions.py''. 2.2Figure 2 can be obtained via ''Comparison of Market sentiment.py''. 2.3Figures 3 can be obtained through ''Change in average λ for systematic risk (compare to inclusion of sentiment variables).py''. 2.4Figure 4 requires you to choose to run ''Comparison of elemental standardisation treatments for TENET.py''. 2.5Figure 5 requires you to choose to run ''Comparison of average λ and spillover intensity.py''. 2.6Figure 6-11 are obtained by running ''Network topology analysis.R''.The same procedure is also run for Tables 5 and 6 concerning the rankings of risk emitters and receivers. 2.7Figure 12 is obtained by running ''Evolution of Cross-Sector Tail Risk Spillovers and Spill-Ins.py''. 2.8Figure 13 is obtained by running ''Tail risk spillovers between any financial sector of a financial institution and any other financial sector.py''. 2.9Figure 14 is obtained by running ''Tail Risk Spillovers within the Financial Sector.py''.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data consists of transaction data for 10 equities from the Johannesburg Stock Exchange. The data consists of five trading days ranging from 2019-06-24 to 2019-06-28. The data has been processed to only contain transactions. Furthermore, transactions with the same time stamp have been aggregated using a volume weighted average so that there is only one trade per time stamp. Missing data is indicated with NaN's.The 10 equities included are: FirstRand Limited (FSR), Shoprite Holdings Ltd (SHP), Absa Group Ltd (ABG), Nedbank Group Ltd (NED), Standard Bank Group Ltd (SBK), Sasol Ltd (SOL), Mondi Plc (MNP), Anglo American Plc (AGL), Naspers Ltd (NPN) and British American Tobacco Plc (BTI).The data structure in each csv file is 10 columns which contain the trading information for the assets traded. The transaction data are in chronological order. The three files have the exact same structure with each file containing information for the transaction tuple: price, time and volume.The data should only be used to aid the reproducibility for the paper:The Epps effect under alternative sampling schemes. The steps to reproduce the results can be found in our GitHub site: https://github.com/CHNPAT005/PCRBTG-VT.The research focuses on investigating the Epps effect under different definitions of time.The work is funded by the South African Statistical Association. The original data was sourced from Bloomberg Pro. The code for the research is done using Julia Pro.
List of companies in the NYSE, and other exchanges.
Data and documentation are available on NASDAQ's official webpage. Data is updated regularly on the FTP site.
The file used in this repository: ...
This dataset provides allometrically-estimated carbon stocks of 9,947,310,221 tree crowns derived from 50-cm resolution satellite images within the 0 to 1000 mm/year precipitation zone of Africa north of the equator and south of the Sahara Desert. These data are presented in GeoPackage (.gpkg) format and are summarized in Cloud-Optimized GeoTIFF (COG) format. An interactive viewer application developed to display these carbon estimates at the individual tree level across the study area is available at: https://trees.pgc.umn.edu/app. The analysis utilized 326,523 Maxar multispectral satellite images collected between 2002 to 2021 for the early dry season months of November to March to identify tree crowns. Metadata from satellite image processing across the study area are presented in Shapefile (.shp) format. Additionally, field measurements from destructive harvests used to derive allometry equations are contained in comma-separated values (*.csv) files. These data demonstrate a new tool for studying discrete semi-arid carbon stocks at the tree level with immediate applications provided by the viewer application. Uncertainty of carbon estimates are +/- 19.8%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets used in the paper "Production scheduling with stock- and staff-related restrictions". The folder "Instances/" stores all instance files for both low- and high-demand instances as per the description in the associated paper. The folder "Solutions/" stores 10 solution files per instance obtained by means of a special-purpose Late Acceptance Hill Climbing Metaheuristic. Meanwhile, the folder "Validator/" contains a ".jar" file which can be executed to validate solutions to the instances in this dataset. All folders also contain an associated "README.txt" file explaining how to use the files inside them.
The file "table_avgs.txt" is a CSV containing the complete average results per instance which were summarized in the corresponding paper. Meanwhile, the file "table_costs.txt" is a CSV with the cost of each solution in the "Solutions/" folder for each execution.
Instance names are formatted as T_D_R_B, where T: is either the letter "L" or "H" standing for "low-" and "high-demand" instances, respectively. D: is the number of days in the time horizon of the instance. R: the number of requests to be served within the time horizon (it is not necessarily true that all R can be served in a feasible solution). B: the length, in minutes, of a block (micro-period within a day where production of one item type at full capacity may take place). Each day is formed by a series of micro-periods.
For further details concerning the instances, the interested reader is referred to the paper "Production scheduling with stock- and staff-related restrictions".
Detailed data for long- and short-term debt stocks and service payments. Data are available for major country group, individual countries, and territories. The Creditor Reporting System (CRS) is an information system comprised of data on Official Development Assistance (ODA) and Official Aid (OA). The system has been in existence since 1967 and is sponsored jointly by the OECD and the World Bank, and operated by the OECD. A subset of the CRS consists of individual grant and loan commitments (between 6000-30000 transactions a year) submitted by DAC donors (22 Members) and multilateral institutions on a regular basis. Reporters are asked to supply (in their national currency), detailed financial information on the commitment (to the developing country) such as: terms of repayment (for loans), tying status and sector allocation. The secretariat converts the amounts of the projects into US dollars using the annual average exchange rates. 11 data files (number of logical records varies; csv (comma-separated) format); accompanying documentation (1 PDF file)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the historical stock price data for Amazon.com, Inc. (AMZN), one of the largest and most influential technology companies in the world. The data has been sourced directly from Yahoo Finance, a widely trusted provider of financial market data. It spans a significant time range, enabling users to analyze Amazon’s market performance over the years, observe long-term trends, and identify key events in the company’s history.
The dataset is structured as a CSV file, with each row representing a single trading day. The following columns are included:
This dataset is suitable for a wide range of financial, academic, and data science projects, such as:
Open AccessPlant and soil data from the last year of the biodiversity experimentData from: Wen-feng Cong, Jasper van Ruijven, Liesje Mommer, Gerlinde De Deyn, Frank Berendse and Ellis Hoffland. (2014) Plant species richness promotes soil carbon and nitrogen stocks in grasslands without legumes. Data were collected in the 11-year grassland biodiversity experiment in Wageningen, the Netherlands, in 2010 and 2011. Abbreviated headlines are as follows: “”BLK”= block; “PT”= plot; 'SR' = plant species richness; “MI” = monoculture identity (Ac = Agrostis capillaris; Ao = Anthoxanthum odoratum; Cj = Centaurea jacea; Fr = Festuca rubra; Hl = Holcus lanatus; Lv = Leucanthemum vulgare; Pl = Plantago lanceolata; Ra = Rumex acetosa); 'AAB' = average aboveground biomass from 2000 to 2010 (g m-2); 'RB' = standing root biomass (g fresh weight m-2) up to 50 cm depth in June 2010; 'CS' = soil carbon stocks (g C m-2) in April 2011; 'NS' = soil nitrogen stocks (g N m-2) in April 2011. 'CD' = soil organic carbon decomposition (mg CO2-C kg-1 soil) measured in soil collected in April 2011; 'NM' = potential net N mineralization rate (µg N kg-1 soil day-1) measured in soil collected in April 2011.data file.csv
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package.
It contains prices for up to 01 of April 2020. If you need more up to date data, just fork and re-run data collection script also available from Kaggle.
The date for every symbol is saved in CSV format with common fields:
All that ticker data is then stored in either ETFs or stocks folder, depending on a type. Moreover, each filename is the corresponding ticker symbol. At last, symbols_valid_meta.csv
contains some additional metadata for each ticker such as full name.