The MapR Distribution including Apache Hadoop provides an enterprise-grade distributed data platform to reliably store and process big data. This CentOS 6.6 based image includes MapR 4.0.2 and built for usage with Sahara MapR plugin. This image is meant to be used with the Sahara project https://wiki.openstack.org/wiki/Sahara and you can find more details on how to use this image at https://wiki.openstack.org/wiki/Sahara/AppsCatalogHowTo
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
_
EuroCrops is a dataset collection combining all publicly available self-declared crop reporting datasets from countries of the European Union.
The raw data obtained from the countries does not come in a unified, machine-readable taxonomy. We, therefore, developed a new Hierarchical Crop and Agriculture Taxonomy (HCAT) that harmonises all declared crops across the European Union. In the shapefiles you'll find these as additional attributes:
Attribute Name | Explanation |
---|---|
EC_trans_n | The original crop name translated into English |
EC_hcat_n | The machine-readable HCAT name of the crop |
EC_hcat_c | The 10-digit HCAT code indicating the hierarchy of the crop |
Participating countries
Find detailed information for all countries of the European Union in our GitHub Wiki, especially the countries represented in EuroCrops:
Please also reference the countries' dependent source in case you're using their data.
Changelog
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
[Coursera] Web Intelligence and Big Data by Dr. Gautam Shroff
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This is a dataset containing all the major data breaches in the world from 2004 to 2021
As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.
This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?
Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches
Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning
--- Original source retains full ownership of the source dataset ---
https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy
Digital Transformation Statistics: Today, businesses are trying to embrace innovative technologies that are also challenging, as they quickly change the digital environment worldwide. Digital transformation statistics involve integrating these technologies to boost productivity, efficiency, and sustainability in operations.
This concept emerged during the COVID-19 pandemic, which heralded an avalanche of more agile and intelligent ways of doing business. The main technologies driving this transformation include artificial intelligence (AI), big data, and cloud computing, which have diverse applications across different sectors. A key trend in 2024 is for companies to adopt new technologies to remain competitive in their respective fields of business.
With a projected $3.7 trillion by the end of this year for the global digital transformation statistics market, it becomes clear that the adoption of cloud computing, automation, and AI has become a major propeller for business growth. As more companies adopt digital strategies, market researchers must understand current trends and statistics that will inform future strategies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the Julia code package for the Bayesian SVM algorithm described in the ECML PKDD 2017 paper; Wenzel et al.: Bayesian Nonlinear Support Vector Machines for Big Data.Files are provided in .jl format; containing Julia language code: a high-performance dynamic programming language for numerical computing. These files can be accessed by openly available text edit software. To run the code please see the description below or the more detailed wiki BSVM.jl - contains the module to run the Bayesian SVM algorithm.AFKMC2.jl - File for the Assumption Free K MC2 algorithm (KMeans)KernelFunctions.jl - Module for the kernel typeDataAccess.jl - Module for either generating data or exporting from an existing datasetrun_test.jl and paper_experiments.jl - Modules to run on a file and compute accuracy on a nFold cross validation, also to compute the brier score and the logscoretest_functions.jl and paper_experiment_functions.jl - Sets of datatype and functions for efficient testing.ECM.jl - Module for expectation conditional maximization (ECM) for nonlinear Bayesian SVMFor datasets used in the related experiments please see https://doi.org/10.6084/m9.figshare.5443621RequirementsThe BayesianSVM only works for version of Julia > 0.5. Other necessary packages will automatically be added in the installation. It is also possible to run the package from Python, to do so please check Pyjulia. If you prefer to use R you have the possibility to use RJulia. All these are a bit technical due to the fact that Julia is still a young package.InstallationTo install the last version of the package in Julia run Pkg.clone("git://github.com/theogf/BayesianSVM.jl.git")Running the AlgorithmHere are the basic steps for using the algorithm : using BayesianSVM Model = BSVM(X_training,y_training) Model.Train() y_predic = sign(Model.Predict(X_test)) y_uncertaintypredic = Model.PredictProb(X_test) Where X_training should be a matrix of size NSamples x NFeatures, and y_training should be a vector of 1 and -1You can find a more complete description in the WikiBackgroundWe propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.Please also check out our github repository:github.com/theogf/BayesianSVM.jl
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
The cleaned bench testing reconstructions for the gold166 datasets have been put online at github https://github.com/BigNeuron/Events-and-News/wiki/BigNeuron-Events-and-News https://github.com/BigNeuron/Data/releases/tag/gold166_bt_v1.0 The respective image datasets were released a while ago from other sites (major pointer is available at github as well https://github.com/BigNeuron/Data/releases/tag/Gold166_v1 but since the files were big, the actual downloading was distributed at 3 continents separately)
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Nowadays, a multitude of tracking systems produce massive amounts of maritime data on a daily basis. The most commonly used is the Automatic Identification System (AIS), a collaborative, self-reporting system that allows vessels to broadcast their identification information, characteristics and destination, along with other information originating from on-board devices and sensors, such as location, speed and heading. AIS messages are broadcast periodically and can be received by other vessels equipped with AIS transceivers, as well as by on the ground or satellite-based sensors.
Since becoming obligatory by the International Maritime Organisation (IMO) for vessels above 300 gross tonnage to carry AIS transponders, large datasets are gradually becoming available and are now being considered as a valid method for maritime intelligence [4].There is now a growing body of literature on methods of exploiting AIS data for safety and optimisation of seafaring, namely traffic analysis, anomaly detection, route extraction and prediction, collision detection, path planning, weather routing, etc., [5].
As the amount of available AIS data grows to massive scales, researchers are realising that computational techniques must contend with difficulties faced when acquiring, storing, and processing the data. Traditional information systems are incapable of dealing with such firehoses of spatiotemporal data where they are required to ingest thousands of data units per second, while performing sub-second query response times.
Processing streaming data seems to exhibit similar characteristics with other big data challenges, such as handling high data volumes and complex data types. While for many applications, big data batch processing techniques are sufficient, for applications such as navigation and others, timeliness is a top priority; making the right decision steering a vessel away from danger, is only useful if it is a decision made in due time. The true challenge lies in the fact that, in order to satisfy real-time application needs, high velocity, unbounded sized data needs to be processed in constraint, in relation to the data size and finite memory. Research on data streams is gaining attention as a subset of the more generic Big Data research field.
Research on such topics requires an uncompressed unclean dataset similar to what would be collected in real world conditions. This dataset contains all decoded messages collected within a 24h period (starting from 29/02/2020 10PM UTC) from a single receiver located near the port of Piraeus (Greece). All vessels identifiers such as IMO and MMSI have been anonymised and no down-sampling procedure, filtering or cleaning has been applied.
The schema of the dataset is provided below:
· t: the time at which the message was received (UTC)
· shipid: the anonymized id of the ship
· lon: the longitude of the current ship position
· lat: the latitude of the current ship position
· heading: (see: https://en.wikipedia.org/wiki/Course_(navigation))
· course: the direction in which the ship moves (see: https://en.wikipedia.org/wiki/Course_(navigation))
· speed: the speed of the ship (measured in knots)
· shiptype: AIS reported ship-type
· destination: AIS reported destination
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Detailed collections data of bumble bee samples used in analysis
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.
Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.
In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin
dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]
. Fork this kernel to get started.
Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj
Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".
Photo by Andre Francois on Unsplash.
Surface water grab sample collected periodically and analyzed for a broad spectrum of physical and chemical constituents
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cluster expansion (CE) has gained an increasing level of popularity in recent years, and many strategies have been proposed for training and fitting the CE models to first-principles calculation results. The paper reports a new strategy for constructing a training set based on their relevance in Monte Carlo sampling for statistical analysis and reduction of the expected error. We call the new strategy a "bootstrapping uncertainty structure selection" (BUSS) scheme and compared its performance against a popular scheme where one uses a combination of random structure and ground-state search (referred to as RGS). The provided dataset contains the training sets generated using BUSS and RGS for constructing a CE model for disordered Cu2ZnSnS4 material. The files are in the format of the Atomic Simulation Environment (ASE) database (please refer to ASE documentation for more information https://wiki.fysik.dtu.dk/ase/index.html). Each .db
file contains 100 DFT calculations, which were generated using iteration cycles. Each iteration cycle is referred to as a generation (marked with gen
key in the database) and each database contains 10 generations where each generation consists of 10 training structures. See more details in the paper.
This data release consists of flux tower measurements of the exchange of energy and mass between the surface and the atmospheric boundary-layer in semi-arid eucalypt woodland using eddy covariance techniques. It been processed using PyFluxPro (v3.3.3) as described in Isaac et al. (2017), https://doi.org/10.5194/bg-14-2903-2017. PyFluxPro takes data recorded at the flux tower and process this data to a final, gap-filled product with Net Ecosystem Exchange (NEE) partitioned into Gross Primary Productivity (GPP) and Ecosystem Respiration (ER). For more information about the processing levels, see https://github.com/OzFlux/PyFluxPro/wiki. The Great Western Woodlands (GWW) comprise a 16 million hectare mosaic of temperate woodland, shrubland and mallee vegetation in south-west Western Australia. The region has remained relatively intact since European settlement, owing to the variable rainfall and lack of readily accessible groundwater. The woodland component is globally unique in that nowhere else do woodlands occur at as little as 220 mm mean annual rainfall. Further, other temperate woodlands around the world have typically become highly fragmented and degraded through agricultural use. The Great Western Woodlands Site was established in 2012 in the Credo Conservation Reserve. The site is in semi-arid woodland and was operated as a pastoral lease from 1907 to 2007. The core 1 ha plot is characterised by Eucalyptus salmonophloia (salmon gum), with Eucalyptus salubris and Eucalyptus clelandii dominating other research plots. The flux station is located in Salmon gum woodland. For additional site information, see https://www.tern.org.au/tern-observatory/tern-ecosystem-processes/great-western-woodlands-supersite/ .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides values for GOLD RESERVES reported in several countries. The data includes current values, previous releases, historical highs and record lows, release frequency, reported unit and currency.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The MapR Distribution including Apache Hadoop provides an enterprise-grade distributed data platform to reliably store and process big data. This CentOS 6.6 based image includes MapR 4.0.2 and built for usage with Sahara MapR plugin. This image is meant to be used with the Sahara project https://wiki.openstack.org/wiki/Sahara and you can find more details on how to use this image at https://wiki.openstack.org/wiki/Sahara/AppsCatalogHowTo