27 datasets found

MOESM1 of Wikipedia traffic data and electoral prediction: towards...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taha Yasseri; Jonathan Bright (2023). MOESM1 of Wikipedia traffic data and electoral prediction: towards theoretically informed models [Dataset]. http://doi.org/10.6084/m9.figshare.c.3698467_D1.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3698467_D1.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Taha Yasseri; Jonathan Bright
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Party List: A table containing countries, name of the parties (English and local), election dates, party abbreviations, election vote share, change in the vote share from the previous election, number of news mentions, and the link to the Wikipedia page. (csv)
a
Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...
academictorrents.com
bittorrent
Updated Mar 4, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum (2017). Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset) [Dataset]. https://academictorrents.com/details/689af6f153e097538ad7b8fd4ea3e87ce8f6bc42
Explore at:
bittorrentAvailable download formats
Dataset updated
Mar 4, 2017
Dataset authored and provided by
Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta
f
Predicted relative errors.
figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Predicted relative errors. [Dataset]. https://figshare.com/articles/dataset/Predicted_relative_errors_/14894022/1
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0253461.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOS ONE
Authors
Anna Tovo; Samuele Stivanello; Amos Maritan; Samir Suweis; Stefano Favaro; Marco Formentin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Upscaling results for the number of species of the four analysed datasets from local samples covering a fraction p* = 5% of the corresponding global dataset. For each human activity, we display the number of species (users, hashtags, words) and individuals (sent mails, posts, occurrences) at the global scale together with the average fitted RSA distribution parameters at the sampled scale and the relative percentage error (mean and standard deviation among 100 trials) between the true number of species and the one predicted by our framework. See S1 Fig in S1 Appendix for the corresponding fitting curves and predicted global RSA patterns.
Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...
zenodo.org
application/gzip, zip
Updated Jun 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3605388
Dataset updated
Jun 8, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Introduction

Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

For more details, please refer to the description below and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
https://arxiv.org/abs/2001.10256

When using the dataset, please cite the above paper.

Dataset summary

The dataset consists of three parts:

English Wikipedia’s full revision history parsed to HTML,

a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),

a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

Getting the data

Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

use a Torrent-based solution as described at https://github.com/epfl-dlab/WikiHist.html - Option 1 (recommended approach for the full download)

use our download scripts by following the instructions at https://github.com/epfl-dlab/WikiHist.html - Option 2 (the download scripts allow you to bulk-download all data as well as to download revisions for specific articles only).

download it manually from the Internet Archive at https://archive.org/details/WikiHist_html

Dataset details

Part 1: HTML revision history
The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

id: id of this revision

parentid: id of revision modified by this revision

timestamp: time when revision was made

cont_username: username of contributor

cont_id: id of contributor

cont_ip: IP address of contributor

comment: comment made by contributor

model: content model (usually "wikitext")

format: content format (usually "text/x-wiki")

sha1: SHA-1 hash

title: page title

ns: namespace (always 0)

page_id: page id

redirect_title: if page is redirect, title of target page

html: revision content in HTML format

Part 2: Page creation times (page_creation_times.json.gz)

This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

page_id: page id

title: page title

ns: namespace (0 for articles)

timestamp: time when page was created

Part 3: Redirect history (redirect_history.json.gz)

This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

page_id: page id of redirect source

title: page title of redirect source

ns: namespace (0 for articles)

revision_id: revision id of redirect source

timestamp: time at which redirect became active

redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .
WikiMed and PubMedDS: Two large-scale datasets for medical concept...
zenodo.org
data.niaid.nih.gov
zip
Updated Dec 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé (2021). WikiMed and PubMedDS: Two large-scale datasets for medical concept extraction and normalization research [Dataset]. http://doi.org/10.5281/zenodo.5753476
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5753476
Dataset updated
Dec 4, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shikhar Vashishth; Shikhar Vashishth; Denis Newman-Griffis; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).

WikiMed

Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.

WikiMed contains:

393,618 Wikipedia page texts

1,067,083 mentions of medical concepts

57,739 unique UMLS CUIs

Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.

PubMedDS

Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.

PubMedDS contains:

13,197,430 abstract texts

57,943,354 medical concept mentions

44,881 unique UMLS CUIs

Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.

Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.

Data format

Both datasets use JSON format with one document per line. Each document has the following structure:

{ "_id": "A unique identifier of each document", "text": "Contains text over which mentions are ", "title": "Title of Wikipedia/PubMed Article", "split": "[Not in PubMedDS] Dataset split:
P
WIT Dataset
paperswithcode.com
huggingface.co
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
Explore at:
Dataset updated
Jun 14, 2023
Authors
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
Description
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
Z
Single Ground Based AIS Receiver Vessel Tracking Dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vodas M. (2021). Single Ground Based AIS Receiver Vessel Tracking Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3754480
Explore at:
Dataset updated
Apr 19, 2021
Dataset provided by
Zissis D.
Vodas M.
Tserpes K.
Spiliopoulos G.
Kontopoulos I.
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Nowadays, a multitude of tracking systems produce massive amounts of maritime data on a daily basis. The most commonly used is the Automatic Identification System (AIS), a collaborative, self-reporting system that allows vessels to broadcast their identification information, characteristics and destination, along with other information originating from on-board devices and sensors, such as location, speed and heading. AIS messages are broadcast periodically and can be received by other vessels equipped with AIS transceivers, as well as by on the ground or satellite-based sensors.

Since becoming obligatory by the International Maritime Organisation (IMO) for vessels above 300 gross tonnage to carry AIS transponders, large datasets are gradually becoming available and are now being considered as a valid method for maritime intelligence [4].There is now a growing body of literature on methods of exploiting AIS data for safety and optimisation of seafaring, namely traffic analysis, anomaly detection, route extraction and prediction, collision detection, path planning, weather routing, etc., [5].

As the amount of available AIS data grows to massive scales, researchers are realising that computational techniques must contend with difficulties faced when acquiring, storing, and processing the data. Traditional information systems are incapable of dealing with such firehoses of spatiotemporal data where they are required to ingest thousands of data units per second, while performing sub-second query response times.

Processing streaming data seems to exhibit similar characteristics with other big data challenges, such as handling high data volumes and complex data types. While for many applications, big data batch processing techniques are sufficient, for applications such as navigation and others, timeliness is a top priority; making the right decision steering a vessel away from danger, is only useful if it is a decision made in due time. The true challenge lies in the fact that, in order to satisfy real-time application needs, high velocity, unbounded sized data needs to be processed in constraint, in relation to the data size and finite memory. Research on data streams is gaining attention as a subset of the more generic Big Data research field.

Research on such topics requires an uncompressed unclean dataset similar to what would be collected in real world conditions. This dataset contains all decoded messages collected within a 24h period (starting from 29/02/2020 10PM UTC) from a single receiver located near the port of Piraeus (Greece). All vessels identifiers such as IMO and MMSI have been anonymised and no down-sampling procedure, filtering or cleaning has been applied.

The schema of the dataset is provided below:

· t: the time at which the message was received (UTC)

· shipid: the anonymized id of the ship

· lon: the longitude of the current ship position

· lat: the latitude of the current ship position

· heading: (see: https://en.wikipedia.org/wiki/Course_(navigation))

· course: the direction in which the ship moves (see: https://en.wikipedia.org/wiki/Course_(navigation))

· speed: the speed of the ship (measured in knots)

· shiptype: AIS reported ship-type

· destination: AIS reported destination
A
‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-list-of-top-data-breaches-2004-2021-e7ac/latest
Explore at:
Dataset updated
Sep 9, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.

--- Dataset description provided by original source is as follows ---

This is a dataset containing all the major data breaches in the world from 2004 to 2021

As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.

This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?

Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches

Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning

--- Original source retains full ownership of the source dataset ---
Bitcoin Blockchain Historical Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

Content

In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

Method & Acknowledgements

Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

Photo by Andre Francois on Unsplash.

Inspiration

How many bitcoins are sent each day?

How many addresses receive bitcoin each day?

Compare transaction volume to historical prices by joining with other available data sources
Replication dataset and calculations for PIIE WP 16-8, Large Depreciations:...
piie.com
Updated May 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José De Gregorio (2016). Replication dataset and calculations for PIIE WP 16-8, Large Depreciations: Recent Experience in Historical Perspective, by José De Gregorio. (2016). [Dataset]. https://www.piie.com/publications/working-papers/large-depreciations-recent-experience-historical-perspective
Explore at:
Dataset updated
May 16, 2016
Dataset provided by
Peterson Institute for International Economicshttp://www.piie.com/
Authors
José De Gregorio
Description
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Large Depreciations: Recent Experience in Historical Perspective, PIIE Working Paper 16-8. If you use the data, please cite as: De Gregorio, José. (2016). Large Depreciations: Recent Experience in Historical Perspective. PIIE Working Paper 16-8. Peterson Institute for International Economics.
BLM CO Closed to Fluid Mineral Leasing
catalog.data.gov
gbp-blm-egis.hub.arcgis.com
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bureau of Land Management (2024). BLM CO Closed to Fluid Mineral Leasing [Dataset]. https://catalog.data.gov/dataset/blm-co-closed-to-fluid-mineral-leasing
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Bureau of Land Managementhttp://www.blm.gov/
Description
These land parcels managed by BLM have been designated as "closed to fluid mineral leasing" per the individual field office plan for the area. This data was compiled for the Big Game Corridor Planning effort, but may be used as a statewide representation for areas closed to leasing
Data from: Big Tom
wikipedia.tr-tr.nina.az
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
www.wikipedia.tr-tr.nina.az (2025). Big Tom [Dataset]. https://www.wikipedia.tr-tr.nina.az/Big_Tom.html
Explore at:
Dataset updated
Mar 26, 2025
Dataset provided by
Vikipedi//www.wikipedia.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bu madde öksüz maddedir zira herhangi bir maddeden bu maddeye verilmiş bir bağlantı yoktur Lütfen ilgili maddelerd
A Large-Scale AIS Datset from Finnish Water
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debayan Bhattacharya; Debayan Bhattacharya; Ikram Ul Haq; Carlos Pichardo Vicencio; Sebastien Lafond; Sebastien Lafond; Ikram Ul Haq; Carlos Pichardo Vicencio (2024). A Large-Scale AIS Datset from Finnish Water [Dataset]. http://doi.org/10.5281/zenodo.8112336
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8112336
Dataset updated
Sep 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Debayan Bhattacharya; Debayan Bhattacharya; Ikram Ul Haq; Carlos Pichardo Vicencio; Sebastien Lafond; Sebastien Lafond; Ikram Ul Haq; Carlos Pichardo Vicencio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The proposed AIS dataset encompasses a substantial temporal span of 20 months, spanning from April 2021 to December 2022. This extensive coverage period empowers analysts to examine long-term trends and variations in vessel activities. Moreover, it facilitates researchers in comprehending the potential influence of external factors, including weather patterns, seasonal variations, and economic conditions, on vessel traffic and behavior within the Finnish waters.

This dataset encompasses an extensive array of data pertaining to vessel movements and activities encompassing seas, rivers, and lakes. Anticipated to be comprehensive in nature, the dataset encompasses a diverse range of ship types, such as cargo ships, tankers, fishing vessels, passenger ships, and various other categories.

The AIS dataset exhibits a prominent attribute in the form of its exceptional granularity with a total of 2 293 129 345 data points. The provision of such granular information proves can help analysts to comprehend vessel dynamics and operations within the Finnish waters. It enables the identification of patterns and anomalies in vessel behavior and facilitates an assessment of the potential environmental implications associated with maritime activities.

Please cite the following publication when using the dataset:

TBD

The publication is available at: TBD

A preprint version of the publication is available at TBD

csv file structure

YYYY-MM-DD-location.csv

This file contains the received AIS position reports. The structure of the logged parameters is the following: [timestamp, timestampExternal, mmsi, lon, lat, sog, cog, navStat, rot, posAcc, raim, heading]

timestamp I beleive this is the UTC second when the report was generated by the electronic position system (EPFS) (0-59, or 60 if time stamp is not available, which should also be the default value, or 61 if positioning system is in manual input mode, or 62 if electronic position fixing system operates in estimated (dead reckoning) mode, or 63 if the positioning system is inoperative).

timestampExternal The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.

mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity

lon Longitude, Longitude in 1/10 000 min (+/-180 deg, East = positive (as per 2's complement), West = negative (as per 2's complement). 181= (6791AC0h) = not available = default)

lat Latitude, Latitude in 1/10 000 min (+/-90 deg, North = positive (as per 2's complement), South = negative (as per 2's complement). 91deg (3412140h) = not available = default)

sog Speed over ground in 1/10 knot steps (0-102.2 knots) 1 023 = not available, 1 022 = 102.2 knots or higher

cog Course over ground in 1/10 = (0-3599). 3600 (E10h) = not available = default. 3 601-4 095 should not be used

navStat Navigational status, 0 = under way using engine, 1 = at anchor, 2 = not under command, 3 = restricted maneuverability, 4 = constrained by her draught, 5 = moored, 6 = aground, 7 = engaged in fishing, 8 = under way sailing, 9 = reserved for future amendment of navigational status for ships carrying DG, HS, or MP, or IMO hazard or pollutant category C, high speed craft (HSC), 10 = reserved for future amendment of navigational status for ships carrying dangerous goods (DG), harmful substances (HS) or marine pollutants (MP), or IMO hazard or pollutant category A, wing in ground (WIG); 11 = power-driven vessel towing astern (regional use); 12 = power-driven vessel pushing ahead or towing alongside (regional use); 13 = reserved for future use, 14 = AIS-SART (active), MOB-AIS, EPIRB-AIS 15 = undefined = default (also used by AIS-SART, MOB-AIS and EPIRB-AIS under test)

rot ROTAIS Rate of turn

0 to +126 = turning right at up to 708 deg per min or higher

0 to -126 = turning left at up to 708 deg per min or higher

Values between 0 and 708 deg per min coded by ROTAIS = 4.733 SQRT(ROTsensor) degrees per min where ROTsensor is the Rate of Turn as input by an external Rate of Turn Indicator (TI). ROTAIS is rounded to the nearest integer value.

+127 = turning right at more than 5 deg per 30 s (No TI available)

-127 = turning left at more than 5 deg per 30 s (No TI available)

-128 (80 hex) indicates no turn information available (default).

ROT data should not be derived from COG information.

posAcc Position accuracy, The position accuracy (PA) flag should be determined in accordance with the table below:

1 = high (<= 10 m)

0 = low (> 10 m)

0 = default

See https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM

raim RAIM-flag Receiver autonomous integrity monitoring (RAIM) flag of electronic position fixing device; 0 = RAIM not in use = default; 1 = RAIM in use. See Table https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM

Check https://en.wikipedia.org/wiki/Receiver_autonomous_integrity_monitoring

heading True heading, Degrees (0-359) (511 indicates not available = default)

YYYY-MM-DD-metadata.csv

This file contains the received AIS metadata: the ship static and voyage related data. The structure of the logged parameters is the following: [timestamp, destination, mmsi, callSign, imo, shipType, draught, eta, posType, pointA, pointB, pointC, pointD, name]

timestamp The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.

destination Maximum 20 characters using 6-bit ASCII; @@@@@@@@@@@@@@@@@@@@ = not available For SAR aircraft, the use of this field may be decided by the responsible administration

mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity

callSign 7?=?6 bit ASCII characters, @@@@@@@ = not available = default Craft associated with a parent vessel, should use “A” followed by the last 6 digits of the MMSI of the parent vessel. Examples of these craft include towed vessels, rescue boats, tenders, lifeboats and liferafts.

imo 0 = not available = default – Not applicable to SAR aircraft

0000000001-0000999999 not used

0001000000-0009999999 = valid IMO number;

0010000000-1073741823 = official flag state number.

Check: https://en.wikipedia.org/wiki/IMO_number

shipType

0 = not available or no ship = default

1-99 = as defined below

100-199 = reserved, for regional use

200-255 = reserved, for future use Not applicable to SAR aircraft

Check https://www.navcen.uscg.gov/pdf/AIS/AISGuide.pdf and https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic

draught In 1/10 m, 255 = draught 25.5 m or greater, 0 = not available = default; in accordance with IMO Resolution A.851 Not applicable to SAR aircraft, should be set to 0

eta Estimated time of arrival; MMDDHHMM UTC

Bits 19-16: month; 1-12; 0 = not available = default

Bits 15-11: day; 1-31; 0 = not available = default

Bits 10-6: hour; 0-23; 24 = not available = default

Bits 5-0: minute; 0-59; 60 = not available = default

For SAR aircraft, the use of this field may be decided by the responsible administration

posType Type of electronic position fixing device

0 = undefined (default)

1 = GPS

2 = GLONASS

3 = combined
Novel data streams used for capturing public reaction to Zika epidemic...
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Luigi Bragazzi; Cristiano Alicino; Cecilia Trucchi; Chiara Paganino; Ilaria Barberis; Mariano Martini; Laura Sticchi; Eugen Trinka; Francesco Brigo; Filippo Ansaldi; Giancarlo Icardi; Andrea Orsi (2023). Novel data streams used for capturing public reaction to Zika epidemic outbreak. [Dataset]. http://doi.org/10.1371/journal.pone.0185263.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0185263.t001
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nicola Luigi Bragazzi; Cristiano Alicino; Cecilia Trucchi; Chiara Paganino; Ilaria Barberis; Mariano Martini; Laura Sticchi; Eugen Trinka; Francesco Brigo; Filippo Ansaldi; Giancarlo Icardi; Andrea Orsi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Novel data streams used for capturing public reaction to Zika epidemic outbreak.
Large Case List
data.wu.ac.at
data.gov.uk
+1more
Updated Feb 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Her Majesty's Revenue and Customs (2016). Large Case List [Dataset]. https://data.wu.ac.at/odso/data_gov_uk/MWQxMTBhOTgtZWQ5Yy00OTFhLWJjNmItYWIxZjZlZmNiMGIx
Explore at:
Dataset updated
Feb 10, 2016
Dataset provided by
HM Revenue & Customs
Description
Cases within HMRC Local Compliance that have yielded in excess of £1m (Indirect Tax) or £5m (Direct Tax). Updated: monthly.
Replication dataset and calculations for PIIE PB 22-9, The online gig...
piie.com
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee G. Branstetter (2022). Replication dataset and calculations for PIIE PB 22-9, The online gig economy’s impact is not as big as many thought by Lee Branstetter (2022). [Dataset]. https://www.piie.com/publications/policy-briefs/2022/online-gig-economys-impact-not-big-many-thought
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Peterson Institute for International Economicshttp://www.piie.com/
Authors
Lee G. Branstetter
Description
This data package includes the underlying data files to replicate the calculations and charts presented in The online gig economy’s impact is not as big as many thought, PIIE Policy Brief 22-9.

If you use the data, please cite as: Branstetter, Lee (2022). The online gig economy’s impact is not as big as many thought, PIIE Policy Brief 22-9. Peterson Institute for International Economics.
H
Data from: A dataset of publication records for Nobel laureates
dataverse.harvard.edu
Updated Dec 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jichao Li; Yian Yin; Santo Fortunato; Wang Dashun (2018). A dataset of publication records for Nobel laureates [Dataset]. http://doi.org/10.7910/DVN/6NJ5RN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6NJ5RN
Dataset updated
Dec 4, 2018
Dataset provided by
Harvard Dataverse
Authors
Jichao Li; Yian Yin; Santo Fortunato; Wang Dashun
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We constructed the publication records for almost all Nobel laureates in physics, chemistry, and physiology or medicine from 1900 to 2016 (545 out of 590, 92.4%). We first collected information manually from Nobel Prize official websites, their university websites, and Wikipedia. We then match it algorithmically with big data, tracing publication records from the MAG database.
Replication dataset and calculations for PIIE WP 19-12, Aggregate Effects of...
piie.com
Updated Jul 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jérémie Cohen-Setton; Egor Gornostay; Colombe Ladreit de Lacharrière (2019). Replication dataset and calculations for PIIE WP 19-12, Aggregate Effects of Budget Stimulus: Evidence from the Large Fiscal Expansions Database. by Jérémie Cohen-Setton, Egor Gornostay, and Colombe Ladreit de Lacharrière. (2019). [Dataset]. https://www.piie.com/publications/working-papers/aggregate-effects-budget-stimulus-evidence-large-fiscal-expansions
Explore at:
Dataset updated
Jul 15, 2019
Dataset provided by
Peterson Institute for International Economicshttp://www.piie.com/
Authors
Jérémie Cohen-Setton; Egor Gornostay; Colombe Ladreit de Lacharrière
Description
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Aggregate Effects of Budget Stimulus: Evidence from the Large Fiscal Expansions Database. PIIE Working Paper 19-12.

If you use the data, please cite as: Cohen-Setton, Jeremie, Egor Gornostay, and Colombe Ladreit de Lacharrière (2019). Aggregate Effects of Budget Stimulus: Evidence from the Large Fiscal Expansions Database. PIIE Working Paper 19-12. Peterson Institute for International Economics.
Replication dataset and calculations for PIIE PB 17-29, United States Is...
piie.com
Updated Nov 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simeon Djankov (2017). Replication dataset and calculations for PIIE PB 17-29, United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, by Simeon Djankov. (2017). [Dataset]. https://www.piie.com/publications/policy-briefs/united-states-outlier-tax-trends-advanced-and-large-emerging-economies
Explore at:
Dataset updated
Nov 2, 2017
Dataset provided by
Peterson Institute for International Economicshttp://www.piie.com/
Authors
Simeon Djankov
Area covered
United States
Description
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, PIIE Policy Brief 17-29. If you use the data, please cite as: Djankov, Simeon. (2017). United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies. PIIE Policy Brief 17-29. Peterson Institute for International Economics.
Zika and Zika virus related Google Trends search queries at global level,...
figshare.com
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Luigi Bragazzi; Cristiano Alicino; Cecilia Trucchi; Chiara Paganino; Ilaria Barberis; Mariano Martini; Laura Sticchi; Eugen Trinka; Francesco Brigo; Filippo Ansaldi; Giancarlo Icardi; Andrea Orsi (2023). Zika and Zika virus related Google Trends search queries at global level, November 2015 –October 2016. [Dataset]. http://doi.org/10.1371/journal.pone.0185263.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0185263.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nicola Luigi Bragazzi; Cristiano Alicino; Cecilia Trucchi; Chiara Paganino; Ilaria Barberis; Mariano Martini; Laura Sticchi; Eugen Trinka; Francesco Brigo; Filippo Ansaldi; Giancarlo Icardi; Andrea Orsi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Zika and Zika virus related Google Trends search queries at global level, November 2015 –October 2016.

Facebook

Twitter

Click to copy link

Link copied

Cite

Taha Yasseri; Jonathan Bright (2023). MOESM1 of Wikipedia traffic data and electoral prediction: towards theoretically informed models [Dataset]. http://doi.org/10.6084/m9.figshare.c.3698467_D1.v1

MOESM1 of Wikipedia traffic data and electoral prediction: towards theoretically informed models

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.c.3698467_D1.v1

Dataset updated

May 30, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Taha Yasseri; Jonathan Bright

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Party List: A table containing countries, name of the parties (English and local), election dates, party abbreviations, election vote share, change in the vote share from the previous election, number of news mentions, and the link to the Wikipedia page. (csv)

Clear search

Close search

Google apps

Main menu

MOESM1 of Wikipedia traffic data and electoral prediction: towards...

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...

Predicted relative errors.

Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

WikiMed and PubMedDS: Two large-scale datasets for medical concept...

WIT Dataset

Single Ground Based AIS Receiver Vessel Tracking Dataset

‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2

Bitcoin Blockchain Historical Data

Context

Content

Querying BigQuery tables

Method & Acknowledgements

Inspiration

Replication dataset and calculations for PIIE WP 16-8, Large Depreciations:...

BLM CO Closed to Fluid Mineral Leasing

Data from: Big Tom

A Large-Scale AIS Datset from Finnish Water

csv file structure

YYYY-MM-DD-location.csv

YYYY-MM-DD-metadata.csv

Novel data streams used for capturing public reaction to Zika epidemic...

Large Case List

Replication dataset and calculations for PIIE PB 22-9, The online gig...

Data from: A dataset of publication records for Nobel laureates

Replication dataset and calculations for PIIE WP 19-12, Aggregate Effects of...

Replication dataset and calculations for PIIE PB 17-29, United States Is...

Zika and Zika virus related Google Trends search queries at global level,...

MOESM1 of Wikipedia traffic data and electoral prediction: towards theoretically informed models