Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Party List: A table containing countries, name of the parties (English and local), election dates, party abbreviations, election vote share, change in the vote share from the previous election, number of news mentions, and the link to the Wikipedia page. (csv)
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Upscaling results for the number of species of the four analysed datasets from local samples covering a fraction p* = 5% of the corresponding global dataset. For each human activity, we display the number of species (users, hashtags, words) and individuals (sent mails, posts, occurrences) at the global scale together with the average fitted RSA distribution parameters at the sampled scale and the relative percentage error (mean and standard deviation among 100 trials) between the true number of species and the one predicted by our framework. See S1 Fig in S1 Appendix for the corresponding fitting curves and predicted global RSA patterns.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Introduction
Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.
We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.
For more details, please refer to the description below and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
https://arxiv.org/abs/2001.10256
When using the dataset, please cite the above paper.
Dataset summary
The dataset consists of three parts:
Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.
Getting the data
Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:
Dataset details
Part 1: HTML revision history
The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):
Part 2: Page creation times (page_creation_times.json.gz)
This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:
Part 3: Redirect history (redirect_history.json.gz)
This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:
The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.
WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two large-scale, automatically-created datasets of medical concept mentions, linked to the Unified Medical Language System (UMLS).
WikiMed
Derived from Wikipedia data. Mappings of Wikipedia page identifiers to UMLS Concept Unique Identifiers (CUIs) was extracted by crosswalking Wikipedia, Wikidata, Freebase, and the NCBI Taxonomy to reach existing mappings to UMLS CUIs. This created a 1:1 mapping of approximately 60,500 Wikipedia pages to UMLS CUIs. Links to these pages were then extracted as mentions of the corresponding UMLS CUIs.
WikiMed contains:
Manual evaluation of 100 random samples of WikiMed found 91% accuracy in the automatic annotations at the level of UMLS CUIs, and 95% accuracy in terms of semantic type.
PubMedDS
Derived from biomedical literature abstracts from PubMed. Mentions were automatically identified using distant supervision based on Medical Subject Heading (MeSH) headers assigned to the papers in PubMed, and recognition of medical concept mentions using the high-performance scispaCy model. MeSH header codes are included as well as their mappings to UMLS CUIs.
PubMedDS contains:
Comparison with existing manually-annotated datasets (NCBI Disease Corpus, BioCDR, and MedMentions) found 75-90% precision in automatic annotations. Please note this dataset is not a comprehensive annotation of medical concept mentions in these abstracts (only mentions located through distant supervision from MeSH headers were included), but is intended as data for concept normalization research.
Due to its size, PubMedDS is distributed as 30 individual files of approximately 1.5 million mentions each.
Data format
Both datasets use JSON format with one document per line. Each document has the following structure:
{
"_id": "A unique identifier of each document",
"text": "Contains text over which mentions are ",
"title": "Title of Wikipedia/PubMed Article",
"split": "[Not in PubMedDS] Dataset split:
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
Key Advantages
A few unique advantages of WIT:
The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Nowadays, a multitude of tracking systems produce massive amounts of maritime data on a daily basis. The most commonly used is the Automatic Identification System (AIS), a collaborative, self-reporting system that allows vessels to broadcast their identification information, characteristics and destination, along with other information originating from on-board devices and sensors, such as location, speed and heading. AIS messages are broadcast periodically and can be received by other vessels equipped with AIS transceivers, as well as by on the ground or satellite-based sensors.
Since becoming obligatory by the International Maritime Organisation (IMO) for vessels above 300 gross tonnage to carry AIS transponders, large datasets are gradually becoming available and are now being considered as a valid method for maritime intelligence [4].There is now a growing body of literature on methods of exploiting AIS data for safety and optimisation of seafaring, namely traffic analysis, anomaly detection, route extraction and prediction, collision detection, path planning, weather routing, etc., [5].
As the amount of available AIS data grows to massive scales, researchers are realising that computational techniques must contend with difficulties faced when acquiring, storing, and processing the data. Traditional information systems are incapable of dealing with such firehoses of spatiotemporal data where they are required to ingest thousands of data units per second, while performing sub-second query response times.
Processing streaming data seems to exhibit similar characteristics with other big data challenges, such as handling high data volumes and complex data types. While for many applications, big data batch processing techniques are sufficient, for applications such as navigation and others, timeliness is a top priority; making the right decision steering a vessel away from danger, is only useful if it is a decision made in due time. The true challenge lies in the fact that, in order to satisfy real-time application needs, high velocity, unbounded sized data needs to be processed in constraint, in relation to the data size and finite memory. Research on data streams is gaining attention as a subset of the more generic Big Data research field.
Research on such topics requires an uncompressed unclean dataset similar to what would be collected in real world conditions. This dataset contains all decoded messages collected within a 24h period (starting from 29/02/2020 10PM UTC) from a single receiver located near the port of Piraeus (Greece). All vessels identifiers such as IMO and MMSI have been anonymised and no down-sampling procedure, filtering or cleaning has been applied.
The schema of the dataset is provided below:
· t: the time at which the message was received (UTC)
· shipid: the anonymized id of the ship
· lon: the longitude of the current ship position
· lat: the latitude of the current ship position
· heading: (see: https://en.wikipedia.org/wiki/Course_(navigation))
· course: the direction in which the ship moves (see: https://en.wikipedia.org/wiki/Course_(navigation))
· speed: the speed of the ship (measured in knots)
· shiptype: AIS reported ship-type
· destination: AIS reported destination
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This is a dataset containing all the major data breaches in the world from 2004 to 2021
As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.
This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?
Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches
Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.
Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.
In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin
dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]
. Fork this kernel to get started.
Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj
Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".
Photo by Andre Francois on Unsplash.
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Large Depreciations: Recent Experience in Historical Perspective, PIIE Working Paper 16-8. If you use the data, please cite as: De Gregorio, José. (2016). Large Depreciations: Recent Experience in Historical Perspective. PIIE Working Paper 16-8. Peterson Institute for International Economics.
These land parcels managed by BLM have been designated as "closed to fluid mineral leasing" per the individual field office plan for the area. This data was compiled for the Big Game Corridor Planning effort, but may be used as a statewide representation for areas closed to leasing
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bu madde öksüz maddedir zira herhangi bir maddeden bu maddeye verilmiş bir bağlantı yoktur Lütfen ilgili maddelerd
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The proposed AIS dataset encompasses a substantial temporal span of 20 months, spanning from April 2021 to December 2022. This extensive coverage period empowers analysts to examine long-term trends and variations in vessel activities. Moreover, it facilitates researchers in comprehending the potential influence of external factors, including weather patterns, seasonal variations, and economic conditions, on vessel traffic and behavior within the Finnish waters.
This dataset encompasses an extensive array of data pertaining to vessel movements and activities encompassing seas, rivers, and lakes. Anticipated to be comprehensive in nature, the dataset encompasses a diverse range of ship types, such as cargo ships, tankers, fishing vessels, passenger ships, and various other categories.
The AIS dataset exhibits a prominent attribute in the form of its exceptional granularity with a total of 2 293 129 345 data points. The provision of such granular information proves can help analysts to comprehend vessel dynamics and operations within the Finnish waters. It enables the identification of patterns and anomalies in vessel behavior and facilitates an assessment of the potential environmental implications associated with maritime activities.
Please cite the following publication when using the dataset:
TBD
The publication is available at: TBD
A preprint version of the publication is available at TBD
This file contains the received AIS position reports. The structure of the logged parameters is the following: [timestamp, timestampExternal, mmsi, lon, lat, sog, cog, navStat, rot, posAcc, raim, heading]
timestamp
I beleive this is the UTC second when the report was generated by the electronic position system (EPFS) (0-59, or 60 if time stamp is not available, which should also be the default value, or 61 if positioning system is in manual input mode, or 62 if electronic position fixing system operates in estimated (dead reckoning) mode, or 63 if the positioning system is inoperative).
timestampExternal
The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.
mmsi
MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity
lon
Longitude, Longitude in 1/10 000 min (+/-180 deg, East = positive (as per 2's complement), West = negative (as per 2's complement). 181= (6791AC0h) = not available = default)
lat
Latitude, Latitude in 1/10 000 min (+/-90 deg, North = positive (as per 2's complement), South = negative (as per 2's complement). 91deg (3412140h) = not available = default)
sog
Speed over ground in 1/10 knot steps (0-102.2 knots) 1 023 = not available, 1 022 = 102.2 knots or higher
cog
Course over ground in 1/10 = (0-3599). 3600 (E10h) = not available = default. 3 601-4 095 should not be used
navStat
Navigational status, 0 = under way using engine, 1 = at anchor, 2 = not under command, 3 = restricted maneuverability, 4 = constrained by her draught, 5 = moored, 6 = aground, 7 = engaged in fishing, 8 = under way sailing, 9 = reserved for future amendment of navigational status for ships carrying DG, HS, or MP, or IMO hazard or pollutant category C, high speed craft (HSC), 10 = reserved for future amendment of navigational status for ships carrying dangerous goods (DG), harmful substances (HS) or marine pollutants (MP), or IMO hazard or pollutant category A, wing in ground (WIG); 11 = power-driven vessel towing astern (regional use); 12 = power-driven vessel pushing ahead or towing alongside (regional use); 13 = reserved for future use, 14 = AIS-SART (active), MOB-AIS, EPIRB-AIS 15 = undefined = default (also used by AIS-SART, MOB-AIS and EPIRB-AIS under test)
rot
ROTAIS Rate of turn
ROT data should not be derived from COG information.
posAcc
Position accuracy, The position accuracy (PA) flag should be determined in accordance with the table below:
See https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM
raim
RAIM-flag Receiver autonomous integrity monitoring (RAIM) flag of electronic position fixing device; 0 = RAIM not in use = default; 1 = RAIM in use. See Table https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM
Check https://en.wikipedia.org/wiki/Receiver_autonomous_integrity_monitoring
heading
True heading, Degrees (0-359) (511 indicates not available = default)
This file contains the received AIS metadata: the ship static and voyage related data. The structure of the logged parameters is the following: [timestamp, destination, mmsi, callSign, imo, shipType, draught, eta, posType, pointA, pointB, pointC, pointD, name]
timestamp
The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.
destination
Maximum 20 characters using 6-bit ASCII; @@@@@@@@@@@@@@@@@@@@ = not available For SAR aircraft, the use of this field may be decided by the responsible administration
mmsi
MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity
callSign
7?=?6 bit ASCII characters, @@@@@@@ = not available = default Craft associated with a parent vessel, should use “A” followed by the last 6 digits of the MMSI of the parent vessel. Examples of these craft include towed vessels, rescue boats, tenders, lifeboats and liferafts.
imo
0 = not available = default – Not applicable to SAR aircraft
Check: https://en.wikipedia.org/wiki/IMO_number
shipType
Check https://www.navcen.uscg.gov/pdf/AIS/AISGuide.pdf and https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic
draught
In 1/10 m, 255 = draught 25.5 m or greater, 0 = not available = default; in accordance with IMO Resolution A.851 Not applicable to SAR aircraft, should be set to 0
eta
Estimated time of arrival; MMDDHHMM UTC
For SAR aircraft, the use of this field may be decided by the responsible administration
posType
Type of electronic position fixing device
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Novel data streams used for capturing public reaction to Zika epidemic outbreak.
Cases within HMRC Local Compliance that have yielded in excess of £1m (Indirect Tax) or £5m (Direct Tax). Updated: monthly.
This data package includes the underlying data files to replicate the calculations and charts presented in The online gig economy’s impact is not as big as many thought, PIIE Policy Brief 22-9.
If you use the data, please cite as: Branstetter, Lee (2022). The online gig economy’s impact is not as big as many thought, PIIE Policy Brief 22-9. Peterson Institute for International Economics.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We constructed the publication records for almost all Nobel laureates in physics, chemistry, and physiology or medicine from 1900 to 2016 (545 out of 590, 92.4%). We first collected information manually from Nobel Prize official websites, their university websites, and Wikipedia. We then match it algorithmically with big data, tracing publication records from the MAG database.
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Aggregate Effects of Budget Stimulus: Evidence from the Large Fiscal Expansions Database. PIIE Working Paper 19-12.
If you use the data, please cite as: Cohen-Setton, Jeremie, Egor Gornostay, and Colombe Ladreit de Lacharrière (2019). Aggregate Effects of Budget Stimulus: Evidence from the Large Fiscal Expansions Database. PIIE Working Paper 19-12. Peterson Institute for International Economics.
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies, PIIE Policy Brief 17-29. If you use the data, please cite as: Djankov, Simeon. (2017). United States Is Outlier in Tax Trends in Advanced and Large Emerging Economies. PIIE Policy Brief 17-29. Peterson Institute for International Economics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Zika and Zika virus related Google Trends search queries at global level, November 2015 –October 2016.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Party List: A table containing countries, name of the parties (English and local), election dates, party abbreviations, election vote share, change in the vote share from the previous election, number of news mentions, and the link to the Wikipedia page. (csv)