Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains bitcoin transfer transactions extracted from the Bitcoin Mainnet blockchain.
Part1 is available at https://zenodo.org/deposit/7157356 Part3 is available at https://zenodo.org/deposit/7158133 Part4 is available at https://zenodo.org/deposit/7158328
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
btc-tx-
where
For example file btc-tx-100000-149999-aa.bz2 and the rest of the parts if any contain transactions from
block 100000 to block 149999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
BLOCK TIME FORMAT:
The block time file has the following format:
IMPORTANT NOTE:
Public Bitcoin Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as https://btcscan.org . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything.
NOTE:
If you use this dataset, please do not forget to add the DOI number to the citation.
If you use our dataset in your research, please also cite our paper: https://link.springer.com/chapter/10.1007/978-3-030-94590-9_14
@incollection{kilicc2022analyzing, title={Analyzing Large-Scale Blockchain Transaction Graphs for Fraudulent Activities}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and {\c{S}}en, Alper}, booktitle={Big Data and Artificial Intelligence in Digital Finance}, pages={253--267}, year={2022}, publisher={Springer, Cham} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this project, we aimed to map the visualisation design space of visualisation embedded in right-to-left (RTL) scripts. We aimed to expand our knowledge of visualisation design beyond the dominance of research based on left-to-right (LTR) scripts. Through this project, we identify common design practices regarding the chart structure, the text, and the source. We also identify ambiguity, particularly regarding the axis position and direction, suggesting that the community may benefit from unified standards similar to those found on web design for RTL scripts. To achieve this goal, we curated a dataset that covered 128 visualisations found in Arabic news media and coded these visualisations based on the chart composition (e.g., chart type, x-axis direction, y-axis position, legend position, interaction, embellishment type), text (e.g., availability of text, availability of caption, annotation type), and source (source position, attribution to designer, ownership of the visualisation design). Links are also provided to the articles and the visualisations. This dataset is limited for stand-alone visualisations, whether they were single-panelled or included small multiples. We also did not consider infographics in this project, nor any visualisation that did not have an identifiable chart type (e.g., bar chart, line chart). The attached documents also include some graphs from our analysis of the dataset provided, where we illustrate common design patterns and their popularity within our sample.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a range of directed signed networks (signed digraphs) from social domain. The data come from 9 different sources and in total there are 29 network files. There are two temporal networks and one multilayer network in this dataset. Each network is provided in two formats: edgelist (.csv) and .gml format.This dataset is provided under a CC BY-NC-SA Creative Commons v 4.0 license (Attribution-NonCommercial-ShareAlike). This means that other individuals may remix, tweak, and build upon these data non-commercially, as long as they provide citations to this data repository (https://doi.org/10.6084/m9.figshare.12152628) and the reference article listed below (https://doi.org/10.1038/s41598-020-71838-6), and license the new creations under the identical terms.For more information about the data, one may refer to the article below:Samin Aref, Ly Dinh, Rezvaneh Rezapour, and Jana Diesner. "Multilevel Structural Evaluation of Signed Directed Social Networks based on Balance Theory" Scientific Reports (2020) https://doi.org/10.1038/s41598-020-71838-6
This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain.
Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward transactions are not currently included in the dataset.
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
eth-tx-
where
For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from
block 1000000 to block 1099999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20
contract) are indicated by token symbol. For example GUSD is Gemini USD stable
coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens.
decoder-error.txt FILE:
This file contains transactions (block no, tx no, tx hash) on each line that produced
error while decoding calldata. These transactions are not present in the data files.
er20tokens.json FILE:
This file contains the list of popular ERC20 token contracts whose transfer/transferFrom
transactions appear in the data files.
-------------------------------------------------------------------------------------------
[
{
"address": "0xdac17f958d2ee523a2206206994597c13d831ec7",
"decdigits": 6,
"symbol": "USDT",
"name": "Tether-USD"
},
{
"address": "0xB8c77482e45F1F44dE1745F52C74426C631bDD52",
"decdigits": 18,
"symbol": "BNB",
"name": "Binance"
},
{
"address": "0x2af5d2ad76741191d15dfe7bf6ac92d4bd912ca3",
"decdigits": 18,
"symbol": "LEO",
"name": "Bitfinex-LEO"
},
{
"address": "0x514910771af9ca656af840dff83e8264ecf986ca",
"decdigits": 18,
"symbol": "LNK",
"name": "Chainlink"
},
{
"address": "0x6f259637dcd74c767781e37bc6133cd6a68aa161",
"decdigits": 18,
"symbol": "HT",
"name": "HuobiToken"
},
{
"address": "0xf1290473e210b2108a85237fbcd7b6eb42cc654f",
"decdigits": 18,
"symbol": "HEDG",
"name": "HedgeTrade"
},
{
"address": "0x9f8f72aa9304c8b593d555f12ef6589cc3a579a2",
"decdigits": 18,
"symbol": "MKR",
"name": "Maker"
},
{
"address": "0xa0b73e1ff0b80914ab6fe0444e65848c4c34450b",
"decdigits": 8,
"symbol": "CRO",
"name": "Crypto.com"
},
{
"address": "0xd850942ef8811f2a866692a623011bde52a462c1",
"decdigits": 18,
"symbol": "VEN",
"name": "VeChain"
},
{
"address": "0x0d8775f648430679a709e98d2b0cb6250d2887ef",
"decdigits": 18,
"symbol": "BAT",
"name": "Basic-Attention"
},
{
"address": "0xc9859fccc876e6b4b3c749c5d29ea04f48acb74f",
"decdigits": 0,
"symbol": "INO",
"name": "INO-Coin"
},
{
"address": "0x8e870d67f660d95d5be530380d0ec0bd388289e1",
"decdigits": 18,
"symbol": "PAX",
"name": "Paxos-Standard"
},
{
"address": "0x17aa18a4b64a55abed7fa543f2ba4e91f2dce482",
"decdigits": 18,
"symbol": "INB",
"name": "Insight-Chain"
},
{
"address": "0xc011a72400e58ecd99ee497cf89e3775d4bd732f",
"decdigits": 18,
"symbol": "SNX",
"name": "Synthetix-Network"
},
{
"address": "0x1985365e9f78359a9B6AD760e32412f4a445E862",
"decdigits": 18,
"symbol": "REP",
"name": "Reputation"
},
{
"address": "0x653430560be843c4a3d143d0110e896c2ab8ac0d",
"decdigits": 16,
"symbol": "MOF",
"name": "Molecular-Future"
},
{
"address": "0x0000000000085d4780B73119b644AE5ecd22b376",
"decdigits": 18,
"symbol": "TUSD",
"name": "True-USD"
},
{
"address": "0xe41d2489571d322189246dafa5ebde1f4699f498",
"decdigits": 18,
"symbol": "ZRX",
"name": "ZRX"
},
{
"address": "0x8ce9137d39326ad0cd6491fb5cc0cba0e089b6a9",
"decdigits": 18,
"symbol": "SXP",
"name": "Swipe"
},
{
"address": "0x75231f58b43240c9718dd58b4967c5114342a86c",
"decdigits": 18,
"symbol": "OKB",
"name": "Okex"
},
{
"address": "0xa974c709cfb4566686553a20790685a47aceaa33",
"decdigits": 18,
"symbol": "XIN",
"name": "Mixin"
},
{
"address": "0xd26114cd6EE289AccF82350c8d8487fedB8A0C07",
"decdigits": 18,
"symbol": "OMG",
"name": "OmiseGO"
},
{
"address": "0x89d24a6b4ccb1b6faa2625fe562bdd9a23260359",
"decdigits": 18,
"symbol": "SAI",
"name": "Sai Stablecoin v1.0"
},
{
"address": "0x6c6ee5e31d828de241282b9606c8e98ea48526e2",
"decdigits": 18,
"symbol": "HOT",
"name": "HoloToken"
},
{
"address": "0x6b175474e89094c44da98b954eedeac495271d0f",
"decdigits": 18,
"symbol": "DAI",
"name": "Dai Stablecoin"
},
{
"address": "0xdb25f211ab05b1c97d595516f45794528a807ad8",
"decdigits": 2,
"symbol": "EURS",
"name": "Statis-EURS"
},
{
"address": "0xa66daa57432024023db65477ba87d4e7f5f95213",
"decdigits": 18,
"symbol": "HPT",
"name": "HuobiPoolToken"
},
{
"address": "0x4fabb145d64652a948d72533023f6e7a623c7c53",
"decdigits": 18,
"symbol": "BUSD",
"name": "Binance-USD"
},
{
"address": "0x056fd409e1d7a124bd7017459dfea2f387b6d5cd",
"decdigits": 2,
"symbol": "GUSD",
"name": "Gemini-USD"
},
{
"address": "0x2c537e5624e4af88a7ae4060c022609376c8d0eb",
"decdigits": 6,
"symbol": "TRYB",
"name": "BiLira"
},
{
"address": "0x4922a015c4407f87432b179bb209e125432e4a2a",
"decdigits": 6,
"symbol": "XAUT",
"name": "Tether-Gold"
},
{
"address": "0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48",
"decdigits": 6,
"symbol": "USDC",
"name": "USD-Coin"
},
{
"address": "0xa5b55e6448197db434b92a0595389562513336ff",
"decdigits": 16,
"symbol": "SUSD",
"name": "Santender"
},
{
"address": "0xffe8196bc259e8dedc544d935786aa4709ec3e64",
"decdigits": 18,
"symbol": "HDG",
"name": "HedgeTrade"
},
{
"address": "0x4a16baf414b8e637ed12019fad5dd705735db2e0",
"decdigits": 2,
"symbol": "QCAD",
"name": "QCAD"
}
]
-------------------------------------------------------------------------------------------
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Heuristic track. It consists of 200 labelled directed graphs. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N.
There is the related PACE 2022 Exact dataset, which was for exact computation; those graphs are generally smaller and sparser, as only exact solutions were accepted.
The data format begins with one line N E 0, where N is the number of vertices, E is the number of edges, and 0 is the literal integer zero. The N subsequent lines are each a space-separated list of integers, such as 2 5 11 19. If that appeared on line number 1 (the first after N E 0), it would indicate that there are edges from vertex 1 to each of the vertices 2, 5, 11, and 19. Some lines are blank, and these indicates vertices with outdegree zero. An example graph would be ``` 4 4 0 2 3 3
1 ```
The dataset can be downloaded here. The 100 instances that were available for public testing are precisely the odd-numbered ones in that link; the public instances can be downloaded on their own here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset defines the Mean High Water coastline of New Zealand and offshore islands at a scale of 1:50,000, and describes the type of coast along the coastline, for example, steep coast, mangrove, or stony shore. Purpose The NZ Coastline – Mean High Water dataset is the first step towards improving the national coastline data for New Zealand. LINZ is currently working on a long-term project, “Coastal Mapping” to capture a range of national coastlines derived from LiDAR and bathymetry. This will enable us to generate coastlines, for example, for Mean High Water Springs, Chart Datum and Highest Astronomical Tide. The project is currently focused on capturing LiDAR and bathymetry data, and the timeframe for delivering the new coastlines will be established once the data capture has progressed. Status This dataset was created and is maintained from LINZ Hydrographic and Topographic sources. Originally created in August 2020, this dataset will be replaced by a more accurate dataset once data becomes available through the Coastal Mapping project. Data sources and preparation The spatial coastline data (1:50,000 scale) is sourced from the Topo50 series where it is described as a line forming the boundary between the land and sea, defined by mean high water. The source polygon data has been broken up into line segments to enable a coastal classification to be attributed to each segment of coast. Coastal classification data is based on official Electronic Navigational Charts published by the New Zealand Hydrographic Authority. Not all segments have been assigned a coastal category. APIs and web services This dataset is available via ArcGIS Online and ArcGIS REST services, as well as our standard APIs. LDS APIs and OGC web services ArcGIS Online map services
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Business process event data modeled as labeled property graphs
Data Format
-----------
The dataset comprises one labeled property graph in two different file formats.
#1) Neo4j .dump format
A neo4j (https://neo4j.com) database dump that contains the entire graph and can be imported into a fresh neo4j database instance using the following command, see also the neo4j documentation: https://neo4j.com/docs/
/bin/neo4j-admin.(bat|sh) load --database=graph.db --from=
The .dump was created with Neo4j v3.5.
#2) .graphml format
A .zip file containing a .graphml file of the entire graph
Data Schema
-----------
The graph is a labeled property graph over business process event data. Each graph uses the following concepts
:Event nodes - each event node describes a discrete event, i.e., an atomic observation described by attribute "Activity" that occurred at the given "timestamp"
:Entity nodes - each entity node describes an entity (e.g., an object or a user), it has an EntityType and an identifier (attribute "ID")
:Log nodes - describes a collection of events that were recorded together, most graphs only contain one log node
:Class nodes - each class node describes a type of observation that has been recorded, e.g., the different types of activities that can be observed, :Class nodes group events into sets of identical observations
:CORR relationships - from :Event to :Entity nodes, describes whether an event is correlated to a specific entity; an event can be correlated to multiple entities
:DF relationships - "directly-followed by" between two :Event nodes describes which event is directly-followed by which other event; both events in a :DF relationship must be correlated to the same entity node. All :DF relationships form a directed acyclic graph.
:HAS relationship - from a :Log to an :Event node, describes which events had been recorded in which event log
:OBSERVES relationship - from an :Event to a :Class node, describes to which event class an event belongs, i.e., which activity was observed in the graph
:REL relationship - placeholder for any structural relationship between two :Entity nodes
The concepts a further defined in Stefan Esser, Dirk Fahland: Multi-Dimensional Event Data in Graph Databases. CoRR abs/2005.14552 (2020) https://arxiv.org/abs/2005.14552
Data Contents
-------------
neo4j-bpic19-2021-02-17 (.dump|.graphml.zip)
An integrated graph describing the raw event data of the entire BPI Challenge 2019 dataset.
van Dongen, B.F. (Boudewijn) (2019): BPI Challenge 2019. 4TU.ResearchData. Collection. https://doi.org/10.4121/uuid:d06aff4b-79f0-45e6-8ec8-e19730c248f1
This data originated from a large multinational company operating from The Netherlands in the area of coatings and paints and we ask participants to investigate the purchase order handling process for some of its 60 subsidiaries. In particular, the process owner has compliance questions. In the data, each purchase order (or purchase document) contains one or more line items. For each line item, there are roughly four types of flows in the data: (1) 3-way matching, invoice after goods receipt: For these items, the value of the goods receipt message should be matched against the value of an invoice receipt message and the value put during creation of the item (indicated by both the GR-based flag and the Goods Receipt flags set to true). (2) 3-way matching, invoice before goods receipt: Purchase Items that do require a goods receipt message, while they do not require GR-based invoicing (indicated by the GR-based IV flag set to false and the Goods Receipt flags set to true). For such purchase items, invoices can be entered before the goods are receipt, but they are blocked until goods are received. This unblocking can be done by a user, or by a batch process at regular intervals. Invoices should only be cleared if goods are received and the value matches with the invoice and the value at creation of the item. (3) 2-way matching (no goods receipt needed): For these items, the value of the invoice should match the value at creation (in full or partially until PO value is consumed), but there is no separate goods receipt message required (indicated by both the GR-based flag and the Goods Receipt flags set to false). (4)Consignment: For these items, there are no invoices on PO level as this is handled fully in a separate process. Here we see GR indicator is set to true but the GR IV flag is set to false and also we know by item type (consignment) that we do not expect an invoice against this item. Unfortunately, the complexity of the data goes further than just this division in four categories. For each purchase item, there can be many goods receipt messages and corresponding invoices which are subsequently paid. Consider for example the process of paying rent. There is a Purchase Document with one item for paying rent, but a total of 12 goods receipt messages with (cleared) invoices with a value equal to 1/12 of the total amount. For logistical services, there may even be hundreds of goods receipt messages for one line item. Overall, for each line item, the amounts of the line item, the goods receipt messages (if applicable) and the invoices have to match for the process to be compliant. Of course, the log is anonymized, but some semantics are left in the data, for example: The resources are split between batch users and normal users indicated by their name. The batch users are automated processes executed by different systems. The normal users refer to human actors in the process. The monetary values of each event are anonymized from the original data using a linear translation respecting 0, i.e. addition of multiple invoices for a single item should still lead to the original item worth (although there may be small rounding errors for numerical reasons). Company, vendor, system and document names and IDs are anonymized in a consistent way throughout the log. The company has the key, so any result can be translated by them to business insights about real customers and real purchase documents.
The case ID is a combination of the purchase document and the purchase item. There is a total of 76,349 purchase documents containing in total 251,734 items, i.e. there are 251,734 cases. In these cases, there are 1,595,923 events relating to 42 activities performed by 627 users (607 human users and 20 batch users). Sometimes the user field is empty, or NONE, which indicates no user was recorded in the source system. For each purchase item (or case) the following attributes are recorded: concept:name: A combination of the purchase document id and the item id, Purchasing Document: The purchasing document ID, Item: The item ID, Item Type: The type of the item, GR-Based Inv. Verif.: Flag indicating if GR-based invoicing is required (see above), Goods Receipt: Flag indicating if 3-way matching is required (see above), Source: The source system of this item, Doc. Category name: The name of the category of the purchasing document, Company: The subsidiary of the company from where the purchase originated, Spend classification text: A text explaining the class of purchase item, Spend area text: A text explaining the area for the purchase item, Sub spend area text: Another text explaining the area for the purchase item, Vendor: The vendor to which the purchase document was sent, Name: The name of the vendor, Document Type: The document type, Item Category: The category as explained above (3-way with GR-based invoicing, 3-way without, 2-way, consignment).
The data contains the following entities and their events
- PO - Purchase Order documents handled at a large multinational company operating from The Netherlands
- POItem - an item in a Purchase Order document describing a specific item to be purchased
- Resource - the user or worker handling the document or a specific item
- Vendor - the external organization from which an item is to be purchased
Data Size
---------
BPIC19, nodes: 1926651, relationships: 15082099
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main stock market index in Japan (JP225) decreased 2147 points or 5.38% since the beginning of 2025, according to trading on a contract for difference (CFD) that tracks this benchmark index from Japan. Japan Stock Market Index (JP225) - values, historical data, forecasts and news - updated on March of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of two parts: (1) The variation rules of nutrient release from carbon sources of wetland plants. After the experiment began, water samples were collected at the same period, the original and average concentrations of TOC and TN of each sample were tested and counted, and line charts were drawn. (2) Data on the influence of carbon source materials on nitrogen removal performance of Argento, Canna and corncob. From December 8 to April 27, 2019, water samples of each treatment were collected at the same time, the original concentration, average concentration, carbon source utilization rate and nitrogen removal efficiency of TOC, NO3--N, NH4+-N and TN of each sample were tested and counted, and a line chart was drawn.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/P0RROUhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/P0RROU
The Value Line Investment Survey is one of the oldest, continuously running investment advisory publications. Since 1955, the Survey has been published in multiple formats including print, loose-leaf, microfilm and microfiche. Data from 1997 to present is now available online. The Survey tracks 1700 stocks across 92 industry groups. It provides reported and projected measures of firm performance, proprietary rankings and analysis for each stock on a quarterly basis. DATA AVAILABLE FOR YEARS: 1980-1989 This dataset, a subset of the Survey covering the years 1980-1989 has been digitized from the microfiche collection available at the Dewey Library (FICHE HG 4501.V26). It is only available to MIT students and faculty for academic research. Published weekly, each edition of the Survey has the following three parts: Summary & Index: includes an alphabetical listing of all industries with their relative ranking and the page number for detailed industry analysis. It also includes an alphabetical listing of all stocks in the publication with references to their location in Part 3, Ratings & Reports. Selection & Opinion: contains the latest economic and stock market commentary and advice along with one or more pages of research on interesting stocks or industries, and a variety of pertinent economic and stock market statistics. It also includes three model stock portfolios. Ratings & Reports: This is the core of the Value Line Investment Survey. Preceded by an industry report, each one-page stock report within that industry includes Timeliness, Safety and Technical rankings, 3-to 5-year analyst forecasts for stock prices, income and balance sheet items, up to 17 years of historical data, and Value Line analysts’ commentaries. The report also contains stock price charts, quarterly sales, earnings, and dividend information. Publication Schedule: Each edition of the Survey covers around 130 stocks in seven to eight industries on a preset sequential schedule so that all 1700 stocks are analyzed once every 13 weeks or each quarter. All editions are numbered 1-13 within each quarter. For example, in 1980, reports for Chrysler appear in edition 1 of each quarter on the following dates: January 4, 1980 – page 132 April 4, 1980 – page 133 July 4, 1980 – page 133 October 1, 1980 – page 133 Reports for Coca-Cola were published in edition 10 of each quarter on: March 7, 1980 – page 1514 June 6, 1980 – page 1518 Sept. 5, 1980 – page 1517 Dec. 5, 1980 – page 1548 Any significant news affecting a stock between quarters is covered in the supplementary reports that appear at the end of part 3, Ratings & Reports. File format: Digitized files within this dataset are in PDF format and are arranged by publication date within each compressed annual folder. How to Consult the Value Line Investment Survey: To find reports on a particular stock, consult the alphabetical listing of stocks in the Summary & Index part of the relevant weekly edition. Look for the page number just to the left of the company name and then use the table below to identify the edition where that page number appears. All editions within a given quarter are numbered 1-13 and follow equally sized page ranges for stock reports. The table provides page ranges for stock reports within editions 1-13 of 1980 Q1. It can be used to identify edition and page numbers for any quarter within a given year. Ratings & Reports Edition Pub. Date Pages 1 04-Jan-80 100-242 2 11-Jan-80 250-392 3 18-Jan-80 400-542 4 25-Jan-80 550-692 5 01-Feb-80 700-842 6 08-Feb-80 850-992 7 15-Feb-80 1000-1142 8 22-Feb-80 1150-1292 9 29-Feb-80 1300-1442 10 07-Mar-80 1450-1592 11 14-Mar-80 1600-1742 12 21-Mar-80 1750-1908 13 28-Mar-80 2000-2142 Another way to navigate to the Ratings & Reports part of an edition would be to look around page 50 within the PDF document. Note that the page numbers of the PDF will not match those within the publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an Australian extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).AWS data licence is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.A world speedtest open data was dowloaded (>400Mb, 7M lines of data). An extract of Australia's location (lat, long) revealed 88,000 lines of data (attached as csv).A Jupyter notebook of extract process is attached.A link to Twitter thread of outputs provided.A link to Data tutorial provided (GitHub), including Jupyter Notebook to analyse World Speedtest data, selecting one US State.Data Shows: (Q2)- 3.1M speedtests- 762,000 devices- 88,000 grid locations (600m * 600m), summarised as a point- average speed 33.7Mbps (down), 12.4M (up)- Max speed 724Mbps- data is for 600m * 600m grids, showing average speed up/down, number of tests, and number of users (IP). Added centroid, and now lat/long.See tweet of image of centroids also attached.Versions:v15/16. Add Hist comparing Q1-21 vs Q2-20. Inc ipynb (incHistQ121, v.1.3-Q121) to calc.v14 Add AUS Speedtest Q1 2021 geojson.(79k lines avg d/l 45.4Mbps)v13 - Added three colour MELB map (less than 20Mbps, over 90Mbps, 20-90Mbps)v12 - Added AUS - Syd - Mel Line Chart Q320.v11 - Add line chart compare Q2, Q3, Q4 plus Melb - result virtually indistinguishable. Add line chart to compare Syd - Melb Q3. Also virtually indistinguishable. Add HIST compare Syd - Melb Q3. Add new Jupyter with graph calcs (nbn-AUS-v1.3). Some ERRATA document in Notebook. Issue with resorting table, and graphing only part of table. Not an issue if all lines of table graphed.v10 - Load AURIN sample pics. Speedtest data loaded to AURIN geo-analytic platform; requires edu.au login.v9 - Add comparative Q2, Q3, Q4 Hist pic.v8 - Added Q4 data geojson. Add Q3, Q4 Hist pic.v7 - Rename to include Q2, Q3 in Title.v6 - Add Q3 20 data. Rename geojson AUS data as Q2. Add comparative Histogram. Calc in International.ipynb.v5 - add Jupyter Notebook inc Histograms. Hist is count of geo-locations avg download speed (unweighted by tests).v4 - added Melb choropleth (png 50Mpix) inc legend. (To do - add Melb.geojson). Posted Link to AURIN description of Speedtest data.v3 - Add super fast data (>100Mbps) less than 1% of data - 697 lines. Includes png of superfast.plot(). Link below to Google Maps version of superfast data points. Also Google map of first 100 data points - sample data. Geojson format for loading into GeoPandas, per Jupyter Notebook. New version of Jupyter Notebook, v.1.1.v2 - add centroids image.v1 - initial data load.** Future Work- combine Speedtest data with NBN Technology by location data (national map.gov.au); https://www.data.gov.au/dataset/national-broadband-network-connections-by-technology-type- combine Speedtest data with SEIFA data - socioeconomic categories - to discuss with AURIN.- Further international comparisons- discussed collaboration with Assoc Prof Tooran Alizadeh, USyd.
Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Dataset Here is a description of the dataset files.
followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.
Citation If used for research purposes, please cite the following paper describing the dataset details:
Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984
Acknowledgments: This work is supported by :
the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset supports the research article "Forb composition gradients and intra-annual variation in a threatened Pacific Northwest Bunchgrass Prairie; Averett and Endress. In Print. Ecology and Evolution". This data includes: (1) perennial forb species composition data from Pacific Northwest Bunchgrass Pairie habitat in the Starkey Experimental Forest and Range, northeastern Oregon; (2) environmental, abiotic, and species trait variables measured from each sampling site; (3) plant species list; (4) densities of culturally important forb species; and (5) long-term sample dates for vegetation plots from the Starkey Experimental Forest and surrounding National Forest lands. Forb composition data was collected from 29 plots in the Starkey Experimental Forest and Range, northeastern Oregon, at three different times during 2016 (April; May; July). Methods We used a stratified random design to sample vegetation composition from 29 (154 m2) plots in Pacific Northwest Bunchgrass (PNB) habitat within the SEFR. Areas within the Main and Campbell study areas of the Starkey Experimental Forest and Range were stratified by percent tree cover, then 30 plots were randomly located within areas with < 5% tree cover (grassland plots). One plot was located within an open forest when visited, and therefore, excluded, resulting in a total of 29 plots. We sampled each plot at three different times during the growing season in 2016: April (April 18th – May 2nd); May (May 23rd – June 1st); and July (July 11th – 18th). The first two sample periods coincided with growth of spring ephemerals. The third was within the traditional vegetation sampling window for PNB. One circular plot (radius = 7m) was established at each sampling site and 12 quadrats (1m2) were systematically located within each plot (Appendix A; Fig A1). Two transect lines were laid out perpendicular to each other and intersecting at the center of the plot, resulting in one line running 14 m in length from north (0 m) to south (14 m), and the other 14 m in length from west (0 m) to east (14 m). Four quadrats were centered at 0.5 m, 4 m, 10 m, and 13.5 m along each of the two transects for a total of eight quadrats along the north/south and west/east lines. Four additional quadrats were located 4 m from the plot center along each of the NE, SE, SW, and NW cardinal directions for a total of 12 quadrats per plot. Within each quadrat, presence/absence was recorded and canopy cover was estimated for all forb species during the early (April and May) sampling periods, and all vascular plant species during the late (July) sampling period. Canopy cover was classified into one of eight cover categories (<1%; >1-5%; >5-25%; >25-50%; >50-75%; >75-95%; >95-99%; >99-100%). Plot level abundance for each species was calculated as the frequency of quadrats occupied per plot. Plot-level cover was calculated as the average arithmetic midpoint of the cover classes. Graminoids (grasses, sedges, and rushes) were not identified to species during April and May due to their early phenological development. Plot-level cover of the soil surface, i.e., litter, rock, biotic crust (moss or lichen), and bare ground along with cover of total vegetation and each functional group (i.e., perennial forbs, annual forbs, perennial graminoids, annual graminoids, and shrubs) were estimated the same as species cover. A tile probe was used to measure depth to soil restrictive layer (average of nine samples per plot) within 80 cm of the mineral soil surface during April at the center of the plot and at 1.75 m and 12.25 m along the North/South and West/East lines as well as at 6.5 m from the plot center along each of the NE, SE, SW, and NW lines. Nine soil cores (< 25 cm deep depending on soil depth) were collected offset (0.25 m towards the plot center) from the tile probe measurement locations. The nine soil cores were then mixed for each plot, ground, and dried at 60 ˚C for 48 hours. Soil chemical and textural analyses were performed at AgSource Laboratory (Umatilla, OR, USA) for pH, cation exchange capacity (CEC), organic matter (%), Phosphorous (P; Olsen); Potassium (K; Ammonium Acetate), Magnesium (Mg; Ammonium Acetate), Calcium (Ca; Ammonium Acetate), Sodium (Na; Ammonium Acetate), and percent Sand, Silt, and Clay. Elevation, slope, and aspect were extracted from 30-m resolution digital elevation models (U.S. Geological Survey 2006) using ArcGIS 9.1. We transformed aspect by folding the aspect about the NE – SW lines to align with an expected heat load gradient (SW = maximum heat load and NE = minimum heat load). For more detail regarding methods please refer to Averett and Endress. In Print. Ecology and Evolution "Forb composition gradients and intra-annual variation in a threatened Pacific Northwest Bunchgrass Prairie".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corn decreased 3.39 USd/BU or 0.74% since the beginning of 2025, according to trading on a contract for difference (CFD) that tracks the benchmark market for this commodity. Corn - values, historical data, forecasts and news - updated on March of 2025.
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Crude Oil decreased 2.12 USD/BBL or 2.95% since the beginning of 2025, according to trading on a contract for difference (CFD) that tracks the benchmark market for this commodity. Crude Oil - values, historical data, forecasts and news - updated on March of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Existing Home Sales in the United States increased to 4260 Thousand in February from 4090 Thousand in January of 2025. This dataset provides the latest reported value for - United States Existing Home Sales - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains bitcoin transfer transactions extracted from the Bitcoin Mainnet blockchain.
Part1 is available at https://zenodo.org/deposit/7157356 Part3 is available at https://zenodo.org/deposit/7158133 Part4 is available at https://zenodo.org/deposit/7158328
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
btc-tx-
where
For example file btc-tx-100000-149999-aa.bz2 and the rest of the parts if any contain transactions from
block 100000 to block 149999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
BLOCK TIME FORMAT:
The block time file has the following format:
IMPORTANT NOTE:
Public Bitcoin Mainnet blockchain data is open and can be obtained by connecting as a node on the blockchain or by using the block explorer web sites such as https://btcscan.org . The downloaders and users of this dataset accept the full responsibility of using the data in GDPR compliant manner or any other regulations. We provide the data as is and we cannot be held responsible for anything.
NOTE:
If you use this dataset, please do not forget to add the DOI number to the citation.
If you use our dataset in your research, please also cite our paper: https://link.springer.com/chapter/10.1007/978-3-030-94590-9_14
@incollection{kilicc2022analyzing, title={Analyzing Large-Scale Blockchain Transaction Graphs for Fraudulent Activities}, author={K{\i}l{\i}{\c{c}}, Baran and {"O}zturan, Can and {\c{S}}en, Alper}, booktitle={Big Data and Artificial Intelligence in Digital Finance}, pages={253--267}, year={2022}, publisher={Springer, Cham} }