37 datasets found
  1. S

    Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

    • sci-tech-today.com
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/
    Explore at:
    Dataset updated
    Nov 3, 2025
    Dataset authored and provided by
    Sci-Tech Today
    License

    https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    Introduction

    Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

    While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

    The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.

  2. Wiki Words

    • kaggle.com
    zip
    Updated Mar 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samarth Bharadwaj (2017). Wiki Words [Dataset]. https://www.kaggle.com/dataistic/wiki-words
    Explore at:
    zip(427584 bytes)Available download formats
    Dataset updated
    Mar 7, 2017
    Authors
    Samarth Bharadwaj
    Description

    Context

    Human communication abilities have greatly evolved with time. Speech/Text/Images/Videos are the channels we often use to communicate, store/share information."**Text**" is one of the primary modes in formal communication and might continue to be so for quite some time.

    I wonder, how many words, a person would Type in his lifetime, when he sends an email/text message or prepare some documents. The count might run into millions. We are accustomed to key-in words, without worrying much about the 'Effort' involved in typing the word. We don't bother much about the origin of the word or the correlation between the meaning and the textual representation. 'Big' is actually smaller than 'Small' just going by the words' length.

    I had some questions, which, I thought, could be best answered by analyzing the BIG data we are surrounded with today. Since the data volume growing at such high rates, can we bring about some kind of optimization or restructuring in the word usage, so that, we are benefited in terms of Data storage, transmission, processing. Can scanning more documents, would provide better automated suggestions in email / chats, based on what word usually follows a particular word, and assist in quicker sentence completion.

    1. What set of words, in the available text content globally, if we can identify and condense, would reduce the overall storage space required.

    2. What set of words in the regular usage, email/text/documents, if we condense, would reduce the total effort involved in typing (keying-in the text) and reduce the overall size of the text content, which eventually might lead to lesser transmission time, occupy less storage space, lesser processing time for applications which feed on these data for analysis/decision making.

    To answer these, we may have to parse the entire web and almost every email/message/blog post/tweet/machine generated content that is in or will be generated in every Phone/Laptop/Computer/Servers, Data generated by every person/bot. Considering tones of text lying around in databases across the world Webpages/Wikipedia/text archives/Digital libraries, and the multiple versions/copies of these content. Parsing all, would be a humongous task. Fresh data is continually generated from various sources. The plate is never empty, if the data is cooked at a rate than the available processing capability.

    Here is an attempt to analyze a tiny chunk of data, to see, if the outcome is significant enough to take a note of, if the finding is generalized and extrapolated to larger databases.

    Content

    Looking out for a reliable source, I could not think of anything better than the Wikipedia database of Webpages. Wiki articles are available for download as html dumps, for any offline processing. https://dumps.wikimedia.org/other/static_html_dumps/, the dump which I downloaded is a ~40 GB compressed file (that turned in ~208 GB folder containing ~15 million files, upon extraction).

    With my newly acquired R skills, I tried to parse the html pages, extract the distinct words with their total count in the page paragraphs.I could consolidate the output from the "first million" of html files out of available 15 million. Attached dataset "WikiWords_FirstMillion.csv" is a Comma Separated file with the list of words and their count. There are two columns - word and count. "word" column contains distinct words as extracted from the paragraphs in the wiki pages and "count" column has the count of occurrence in one million wiki pages. Non-Alphanumeric characters have been removed at the time of text extraction.

    Any array of characters separated by space are included in the list of words and the count has been presented as is without any filters. To get better estimates, it should be OK to make suitable assumptions, like considering root words, ignoring words if they appear more specific to Wikipedia pages (Welcome, Wikipedia, Articles, Pages, Edit, Contribution.. ).

    Acknowledgements

    Wikimedia, for providing the offline dumps R Community, for the Software/Packages/Blog Posts/Articles/Suggestions and Solution on the Q & A sites

    Inspiration

    1. In case, the entire English Language community across the world decides to designate every alphabet as a word [Apart from 'A' and 'I' all other alphabets seem to be potential candidates to be a word, a one-lettered word],

    (a) Which of the 24 words from the data set are most eligible to get upgraded as a one letter word. Assuming, it is decided to replace the existing words with the newly designated one-lettered word, to achieve storage efficiency.

    (b) Assuming, the word count in the data set is a fair estimate of the composition of the words available in the global text content, (Say we do a "Find" and "Replace" on global text content). If the current big data size is 3 Exabytes (10 ^ 18), and say 30% of i...

  3. Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • zenodo.org
    application/gzip, zip
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Introduction

    Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

    We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

    For more details, please refer to the description below and to the dataset paper:
    Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
    https://arxiv.org/abs/2001.10256

    When using the dataset, please cite the above paper.

    Dataset summary

    The dataset consists of three parts:

    1. English Wikipedia’s full revision history parsed to HTML,
    2. a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),
    3. a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

    Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

    Getting the data

    Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

    Dataset details

    Part 1: HTML revision history
    The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

    WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .

  4. Namu Wiki

    • kaggle.com
    zip
    Updated Oct 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    brainer (2021). Namu Wiki [Dataset]. https://www.kaggle.com/brainer3220/namu-wiki
    Explore at:
    zip(4100527485 bytes)Available download formats
    Dataset updated
    Oct 9, 2021
    Authors
    brainer
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Namu Wiki give dump file for crawling and Big Data EDA and other.

    I have download with official JSON file, and Change to CSV file.

    Content

    Namu_{year, month, day}.csv

    title: Title text: Content

    Inspiration

    Namu Wiki is biggest Korean wiki wiki.

    That's mean this is a large Korean data set that has a great possibility.

    License

    Originally Namu Wiki(나무위키) license is CC BY-NC-SA 2.0 KR but Kaggle is CC BY-NC-SA 4.0 only .

    and I changed License

  5. Wikipedia Knowledge Graph dataset

    • zenodo.org
    • produccioncientifica.ugr.es
    • +2more
    pdf, tsv
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas (2024). Wikipedia Knowledge Graph dataset [Dataset]. http://doi.org/10.5281/zenodo.6346900
    Explore at:
    tsv, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wenceslao Arroyo-Machado; Wenceslao Arroyo-Machado; Daniel Torres-Salinas; Daniel Torres-Salinas; Rodrigo Costas; Rodrigo Costas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.

    There are a total of 9 files, all of them in tsv format, and they have been built under a relational structure. The main one that acts as the core of the dataset is the page file, after it there are 4 files with different entities related to the Wikipedia pages (category, url, pub and page_property files) and 4 other files that act as "intermediate tables" making it possible to connect the pages both with the latter and between pages (page_category, page_url, page_pub and page_link files).

    The document Dataset_summary includes a detailed description of the dataset.

    Thanks to Nees Jan van Eck and the Centre for Science and Technology Studies (CWTS) for the valuable comments and suggestions.

  6. Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...

    • zenodo.org
    bin, zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten (2025). Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.14858280
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten
    Description

    The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:

    • Wiki-Quantities, a dataset for identifying quantities, and
    • Wiki-Measurements, a dataset for extracting measurement context for given quantities.

    The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:

    Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.

    Versions

    The datasets are released in different versions:

    • Processing level: the pre-processed versions can be used directly for training and evaluating models, while the raw versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.
    • Filtering level:
      • Wiki-Quantities is available in a raw, large, small, and tiny version: The raw version is the original version, which includes all the samples originally obtained. In the large version, all duplicates and near duplicates present in the raw version are removed. The small and tiny versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.
      • Wiki-Measurements is available in a large`, small, large_strict, small_strict, small_context, and large_strict_context version: The large version contains all examples minus a few duplicates. The small version is a subset of the large version with very similar examples removed. In the context versions, additional sentences are added around the annotated sentence. In the strict versions, the quantitative facts are more strictly aligned with the text.
    • Quality: all data has been automatically annotated using heuristics. In contrast to the silver data, the gold data has been manually curated.

    Format

    The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:

    • Wiki-Quantities (only quantities annotated):
      • "In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."
      • "Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."
      • "This sail added another 🍏0.5 kn🍏."
    • Wiki-Measurements (measurement context for a single quantity; qualifiers and quantity modifiers are only sparsely annotated):
      • "The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."
      • "The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."
      • "🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."

    The mapping of annotation types to emojis is as follows:

    • Basic quantitative statement:
      • Entity: 🌶️
      • Property: 🍊
      • Quantity: 🍏
      • Value: 🍐
      • Unit: 🍓
      • Quantity modifier: ☎️
    • Qualifier:
      • Temporal scope: 📆
      • Start time: ⏱️
      • End time: ⏰️
      • Location: 📍
      • Reference: 🙋
      • Determination method: 🔭
      • Criterion used: 📏
      • Applies to part: 🦵
      • Scope: 🔎
      • Some qualifier: 🛁

    Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small and silver large. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.

    Evaluation

    The evaluation directories contain the manually validated random samples used for evaluation. The evaluation is based on the large versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.

    License

    In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).

    About Us

    We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

    Acknowledgements

    The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

  7. Code for Fast and Scalable Implementation of the Bayesian SVM

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Wenzel; Théo Galy-Fajou; Matthäus Deutsch; Marius Kloft (2023). Code for Fast and Scalable Implementation of the Bayesian SVM [Dataset]. http://doi.org/10.6084/m9.figshare.5443627.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Florian Wenzel; Théo Galy-Fajou; Matthäus Deutsch; Marius Kloft
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the Julia code package for the Bayesian SVM algorithm described in the ECML PKDD 2017 paper; Wenzel et al.: Bayesian Nonlinear Support Vector Machines for Big Data.Files are provided in .jl format; containing Julia language code: a high-performance dynamic programming language for numerical computing. These files can be accessed by openly available text edit software. To run the code please see the description below or the more detailed wiki BSVM.jl - contains the module to run the Bayesian SVM algorithm.AFKMC2.jl - File for the Assumption Free K MC2 algorithm (KMeans)KernelFunctions.jl - Module for the kernel typeDataAccess.jl - Module for either generating data or exporting from an existing datasetrun_test.jl and paper_experiments.jl - Modules to run on a file and compute accuracy on a nFold cross validation, also to compute the brier score and the logscoretest_functions.jl and paper_experiment_functions.jl - Sets of datatype and functions for efficient testing.ECM.jl - Module for expectation conditional maximization (ECM) for nonlinear Bayesian SVMFor datasets used in the related experiments please see https://doi.org/10.6084/m9.figshare.5443621RequirementsThe BayesianSVM only works for version of Julia > 0.5. Other necessary packages will automatically be added in the installation. It is also possible to run the package from Python, to do so please check Pyjulia. If you prefer to use R you have the possibility to use RJulia. All these are a bit technical due to the fact that Julia is still a young package.InstallationTo install the last version of the package in Julia run Pkg.clone("git://github.com/theogf/BayesianSVM.jl.git")Running the AlgorithmHere are the basic steps for using the algorithm : using BayesianSVM Model = BSVM(X_training,y_training) Model.Train() y_predic = sign(Model.Predict(X_test)) y_uncertaintypredic = Model.PredictProb(X_test) Where X_training should be a matrix of size NSamples x NFeatures, and y_training should be a vector of 1 and -1You can find a more complete description in the WikiBackgroundWe propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.Please also check out our github repository:github.com/theogf/BayesianSVM.jl

  8. EuroCrops

    • zenodo.org
    zip
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maja Schneider; Maja Schneider; Marco Körner; Marco Körner (2023). EuroCrops [Dataset]. http://doi.org/10.5281/zenodo.6866847
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maja Schneider; Maja Schneider; Marco Körner; Marco Körner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EuroCrops is a dataset collection combining all publicly available self-declared crop reporting datasets from countries of the European Union.

    The raw data obtained from the countries does not come in a unified, machine-readable taxonomy. We, therefore, developed a new Hierarchical Crop and Agriculture Taxonomy (HCAT) that harmonises all declared crops across the European Union. In the shapefiles you'll find these as additional attributes:

    Hamonisation with HCAT
    Attribute NameExplanation
    EC_trans_nThe original crop name translated into English
    EC_hcat_nThe machine-readable HCAT name of the crop
    EC_hcat_cThe 10-digit HCAT code indicating the hierarchy of the crop

    Participating countries

    Find detailed information for all countries of the European Union in our GitHub Wiki, especially the countries represented in EuroCrops:

    Please also reference the countries' dependent source in case you're using their data.

  9. Single Ground Based AIS Receiver Vessel Tracking Dataset

    • data.europa.eu
    • data.niaid.nih.gov
    • +2more
    unknown
    Updated Apr 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). Single Ground Based AIS Receiver Vessel Tracking Dataset [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3754481?locale=hr
    Explore at:
    unknown(8174664)Available download formats
    Dataset updated
    Apr 18, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Nowadays, a multitude of tracking systems produce massive amounts of maritime data on a daily basis. The most commonly used is the Automatic Identification System (AIS), a collaborative, self-reporting system that allows vessels to broadcast their identification information, characteristics and destination, along with other information originating from on-board devices and sensors, such as location, speed and heading. AIS messages are broadcast periodically and can be received by other vessels equipped with AIS transceivers, as well as by on the ground or satellite-based sensors. Since becoming obligatory by the International Maritime Organisation (IMO) for vessels above 300 gross tonnage to carry AIS transponders, large datasets are gradually becoming available and are now being considered as a valid method for maritime intelligence [4].There is now a growing body of literature on methods of exploiting AIS data for safety and optimisation of seafaring, namely traffic analysis, anomaly detection, route extraction and prediction, collision detection, path planning, weather routing, etc., [5]. As the amount of available AIS data grows to massive scales, researchers are realising that computational techniques must contend with difficulties faced when acquiring, storing, and processing the data. Traditional information systems are incapable of dealing with such firehoses of spatiotemporal data where they are required to ingest thousands of data units per second, while performing sub-second query response times. Processing streaming data seems to exhibit similar characteristics with other big data challenges, such as handling high data volumes and complex data types. While for many applications, big data batch processing techniques are sufficient, for applications such as navigation and others, timeliness is a top priority; making the right decision steering a vessel away from danger, is only useful if it is a decision made in due time. The true challenge lies in the fact that, in order to satisfy real-time application needs, high velocity, unbounded sized data needs to be processed in constraint, in relation to the data size and finite memory. Research on data streams is gaining attention as a subset of the more generic Big Data research field. Research on such topics requires an uncompressed unclean dataset similar to what would be collected in real world conditions. This dataset contains all decoded messages collected within a 24h period (starting from 29/02/2020 10PM UTC) from a single receiver located near the port of Piraeus (Greece). All vessels identifiers such as IMO and MMSI have been anonymised and no down-sampling procedure, filtering or cleaning has been applied. The schema of the dataset is provided below: · t: the time at which the message was received (UTC) · shipid: the anonymized id of the ship · lon: the longitude of the current ship position · lat: the latitude of the current ship position · heading: (see: https://en.wikipedia.org/wiki/Course_(navigation)) · course: the direction in which the ship moves (see: https://en.wikipedia.org/wiki/Course_(navigation)) · speed: the speed of the ship (measured in knots) · shiptype: AIS reported ship-type · destination: AIS reported destination

  10. fluTwitterData.csv: Data file containing weekly ILI and tweet counts from A...

    • rs.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lewis Mitchell; Joshua V. Ross (2023). fluTwitterData.csv: Data file containing weekly ILI and tweet counts from A data-driven model for influenza transmission incorporating media effects [Dataset]. http://doi.org/10.6084/m9.figshare.4021752.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Royal Societyhttp://royalsociety.org/
    Authors
    Lewis Mitchell; Joshua V. Ross
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numerous studies have attempted to model the effect of mass media on the transmission of diseases such as influenza, however, quantitative data on media engagement has until recently been difficult to obtain. With the recent explosion of ‘big data’ coming from online social media and the like, large volumes of data on a population’s engagement with mass media during an epidemic are becoming available to researchers. In this study, we combine an online dataset comprising millions of shared messages relating to influenza with traditional surveillance data on flu activity to suggest a functional form for the relationship between the two. Using this data, we present a simple deterministic model for influenza dynamics incorporating media effects, and show that such a model helps explain the dynamics of historical influenza outbreaks. Furthermore, through model selection we show that the proposed media function fits historical data better than other media functions proposed in earlier studies.

  11. Table S2 from An integrative machine learning approach to discovering...

    • rs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milla Kibble; Suleiman A. Khan; Muhammad Ammad-ud-din; Sailalitha Bollepalli; Teemu Palviainen; Jaakko Kaprio; Kirsi H. Pietiläinen; Miina Ollikainen (2023). Table S2 from An integrative machine learning approach to discovering multi-level molecular mechanisms of obesity using data from monozygotic twin pairs [Dataset]. http://doi.org/10.6084/m9.figshare.13102744.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Royal Societyhttp://royalsociety.org/
    Authors
    Milla Kibble; Suleiman A. Khan; Muhammad Ammad-ud-din; Sailalitha Bollepalli; Teemu Palviainen; Jaakko Kaprio; Kirsi H. Pietiläinen; Miina Ollikainen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table of CpGs used in the analysis and the sources from which they were chosen CpG sites were selected from the recent meta-analysis of Wahl et al. (110). They show that BMI is associated with widespread changes in DNA methylation and genetic associa-tion analyses demonstrate that the alterations in DNA methylation are predominantly the consequence of adiposity, rather than the cause. We also include CpGs associated with elevated liver fat (19), CpGs whose methylation has been previously shown to differ in the adipose tissue of BMI-discordant MZ twin pairs (111), smoking-associated CpGs (112), and CpG sites that have been associated with weight loss (113). The number in between dollar signs links to the source from which each CpG was chosen, and is also included in the component diagrams.

  12. Table S3 from An integrative machine learning approach to discovering...

    • rs.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milla Kibble; Suleiman A. Khan; Muhammad Ammad-ud-din; Sailalitha Bollepalli; Teemu Palviainen; Jaakko Kaprio; Kirsi H. Pietiläinen; Miina Ollikainen (2023). Table S3 from An integrative machine learning approach to discovering multi-level molecular mechanisms of obesity using data from monozygotic twin pairs [Dataset]. http://doi.org/10.6084/m9.figshare.13102756.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Royal Societyhttp://royalsociety.org/
    Authors
    Milla Kibble; Suleiman A. Khan; Muhammad Ammad-ud-din; Sailalitha Bollepalli; Teemu Palviainen; Jaakko Kaprio; Kirsi H. Pietiläinen; Miina Ollikainen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table of the 63 dietary variables used in the analysis

  13. h

    large_spanish_corpus

    • huggingface.co
    Updated Apr 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Cañete (2019). large_spanish_corpus [Dataset]. https://huggingface.co/datasets/josecannete/large_spanish_corpus
    Explore at:
    Dataset updated
    Apr 20, 2019
    Authors
    José Cañete
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.

  14. Z

    A Large-Scale AIS Datset from Finnish Water

    • data.niaid.nih.gov
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Debayan Bhattacharya; Ikram Ul Haq; Carlos Pichardo Vicencio; Lafond, Sebastien (2024). A Large-Scale AIS Datset from Finnish Water [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8112335
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Åbo Akademi University
    Authors
    Debayan Bhattacharya; Ikram Ul Haq; Carlos Pichardo Vicencio; Lafond, Sebastien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The proposed AIS dataset encompasses a substantial temporal span of 20 months, spanning from April 2021 to December 2022. This extensive coverage period empowers analysts to examine long-term trends and variations in vessel activities. Moreover, it facilitates researchers in comprehending the potential influence of external factors, including weather patterns, seasonal variations, and economic conditions, on vessel traffic and behavior within the Finnish waters.

    This dataset encompasses an extensive array of data pertaining to vessel movements and activities encompassing seas, rivers, and lakes. Anticipated to be comprehensive in nature, the dataset encompasses a diverse range of ship types, such as cargo ships, tankers, fishing vessels, passenger ships, and various other categories.

    The AIS dataset exhibits a prominent attribute in the form of its exceptional granularity with a total of 2 293 129 345 data points. The provision of such granular information proves can help analysts to comprehend vessel dynamics and operations within the Finnish waters. It enables the identification of patterns and anomalies in vessel behavior and facilitates an assessment of the potential environmental implications associated with maritime activities.

    Please cite the following publication when using the dataset:

    TBD

    The publication is available at: TBD

    A preprint version of the publication is available at TBD

    csv file structure

    YYYY-MM-DD-location.csv

    This file contains the received AIS position reports. The structure of the logged parameters is the following: [timestamp, timestampExternal, mmsi, lon, lat, sog, cog, navStat, rot, posAcc, raim, heading]

    timestamp I beleive this is the UTC second when the report was generated by the electronic position system (EPFS) (0-59, or 60 if time stamp is not available, which should also be the default value, or 61 if positioning system is in manual input mode, or 62 if electronic position fixing system operates in estimated (dead reckoning) mode, or 63 if the positioning system is inoperative).

    timestampExternal The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.

    mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity

    lon Longitude, Longitude in 1/10 000 min (+/-180 deg, East = positive (as per 2's complement), West = negative (as per 2's complement). 181= (6791AC0h) = not available = default)

    lat Latitude, Latitude in 1/10 000 min (+/-90 deg, North = positive (as per 2's complement), South = negative (as per 2's complement). 91deg (3412140h) = not available = default)

    sog Speed over ground in 1/10 knot steps (0-102.2 knots) 1 023 = not available, 1 022 = 102.2 knots or higher

    cog Course over ground in 1/10 = (0-3599). 3600 (E10h) = not available = default. 3 601-4 095 should not be used

    navStat Navigational status, 0 = under way using engine, 1 = at anchor, 2 = not under command, 3 = restricted maneuverability, 4 = constrained by her draught, 5 = moored, 6 = aground, 7 = engaged in fishing, 8 = under way sailing, 9 = reserved for future amendment of navigational status for ships carrying DG, HS, or MP, or IMO hazard or pollutant category C, high speed craft (HSC), 10 = reserved for future amendment of navigational status for ships carrying dangerous goods (DG), harmful substances (HS) or marine pollutants (MP), or IMO hazard or pollutant category A, wing in ground (WIG); 11 = power-driven vessel towing astern (regional use); 12 = power-driven vessel pushing ahead or towing alongside (regional use); 13 = reserved for future use, 14 = AIS-SART (active), MOB-AIS, EPIRB-AIS 15 = undefined = default (also used by AIS-SART, MOB-AIS and EPIRB-AIS under test)

    rot ROTAIS Rate of turn

    0 to +126 = turning right at up to 708 deg per min or higher

    0 to -126 = turning left at up to 708 deg per min or higher

    Values between 0 and 708 deg per min coded by ROTAIS = 4.733 SQRT(ROTsensor) degrees per min where ROTsensor is the Rate of Turn as input by an external Rate of Turn Indicator (TI). ROTAIS is rounded to the nearest integer value.

    +127 = turning right at more than 5 deg per 30 s (No TI available)

    -127 = turning left at more than 5 deg per 30 s (No TI available)

    -128 (80 hex) indicates no turn information available (default).

    ROT data should not be derived from COG information.

    posAcc Position accuracy, The position accuracy (PA) flag should be determined in accordance with the table below:

    1 = high (<= 10 m)

    0 = low (> 10 m)

    0 = default

    See https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM

    raim RAIM-flag Receiver autonomous integrity monitoring (RAIM) flag of electronic position fixing device; 0 = RAIM not in use = default; 1 = RAIM in use. See Table https://www.navcen.uscg.gov/?pageName=AISMessagesA#RAIM

    Check https://en.wikipedia.org/wiki/Receiver_autonomous_integrity_monitoring

    heading True heading, Degrees (0-359) (511 indicates not available = default)

    YYYY-MM-DD-metadata.csv

    This file contains the received AIS metadata: the ship static and voyage related data. The structure of the logged parameters is the following: [timestamp, destination, mmsi, callSign, imo, shipType, draught, eta, posType, pointA, pointB, pointC, pointD, name]

    timestamp The timestamp associated with the MQTT message received from www.digitraffic.fi. It is assumed this timestamp is the Epoch time corresponding to when the AIS message was received by digitraffic.fi.

    destination Maximum 20 characters using 6-bit ASCII; @@@@@@@@@@@@@@@@@@@@ = not available For SAR aircraft, the use of this field may be decided by the responsible administration

    mmsi MMSI number, Maritime Mobile Service Identity (MMSI) is a unique 9 digit number that is assigned to a (Digital Selective Calling) DSC radio or an AIS unit. Check https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity

    callSign 7?=?6 bit ASCII characters, @@@@@@@ = not available = default Craft associated with a parent vessel, should use “A” followed by the last 6 digits of the MMSI of the parent vessel. Examples of these craft include towed vessels, rescue boats, tenders, lifeboats and liferafts.

    imo 0 = not available = default – Not applicable to SAR aircraft

    0000000001-0000999999 not used

    0001000000-0009999999 = valid IMO number;

    0010000000-1073741823 = official flag state number.

    Check: https://en.wikipedia.org/wiki/IMO_number

    shipType

    0 = not available or no ship = default

    1-99 = as defined below

    100-199 = reserved, for regional use

    200-255 = reserved, for future use Not applicable to SAR aircraft

    Check https://www.navcen.uscg.gov/pdf/AIS/AISGuide.pdf and https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic

    draught In 1/10 m, 255 = draught 25.5 m or greater, 0 = not available = default; in accordance with IMO Resolution A.851 Not applicable to SAR aircraft, should be set to 0

    eta Estimated time of arrival; MMDDHHMM UTC

    Bits 19-16: month; 1-12; 0 = not available = default

    Bits 15-11: day; 1-31; 0 = not available = default

    Bits 10-6: hour; 0-23; 24 = not available = default

    Bits 5-0: minute; 0-59; 60 = not available = default

    For SAR aircraft, the use of this field may be decided by the responsible administration

    posType Type of electronic position fixing device

    0 = undefined (default)

    1 = GPS

    2 = GLONASS

    3 = combined GPS/GLONASS

    4 = Loran-C

    5 = Chayka

    6 = integrated navigation system

    7 = surveyed

    8 = Galileo,

    9-14 = not used

    15 = internal GNSS

    pointA Reference point for reported position.

    Also indicates the dimension of ship (m). For SAR aircraft, the use of this field may be decided by the responsible administration. If used it should indicate the maximum dimensions of the craft. As default should A = B = C = D be set to “0”

    Check: https://www.navcen.uscg.gov/?pageName=AISMessagesAStatic#_Reference_point_for

    pointB See above

    pointC See above

    pointD See above

    name Maximum 20 characters 6 bit ASCII "@@@@@@@@@@@@@@@@@@@@" = not available = default The Name should be as shown on the station radio license. For SAR aircraft, it should be set to “SAR AIRCRAFT NNNNNNN” where NNNNNNN equals the aircraft registration number.

  15. r

    Arthropod Kraken2 Database v1

    • demo.researchdata.se
    • researchdata.se
    Updated Aug 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha López Clinton; Tom van der Valk (2025). Arthropod Kraken2 Database v1 [Dataset]. http://doi.org/10.17044/SCILIFELAB.29666605
    Explore at:
    Dataset updated
    Aug 18, 2025
    Dataset provided by
    Swedish Museum of Natural History
    Authors
    Samantha López Clinton; Tom van der Valk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kraken2 Arthopod Reference Database v.1Kraken2 (v2.1.2) database containing all 2,593 reference assemblies for Arthropoda available on NCBI as of March 2023.

    This database was built for and used in the analysis of shotgun sequencing data of bulkDNA from Malaise trap samples collected by the Insect Biome Atlas, in the context of the manuscript "Small Bugs, Big Data: Metagenomics for arthropod biodiversity monitoring" by authors: López Clinton Samantha, Iwaszkiewicz-Eggebrecht Ela, Miraldo Andreia, Goodsell Robert, Webster Mathew T, Ronquist Fredrik, van der Valk Tom (for submission to Ecology and Evolution).

    For custom database building, Kraken2 requires all headers in reference assembly fasta files to be annotated with "kraken:taxid|XXX" at the end of each header. Where "XXX" is the corresponding National Center for Biotechnology Information (NCBI) taxID of the species. The code used to add the taxID information to each fasta file header, and update the accession2taxid.map file required by Kraken2 for database building, is available in this GitHub repository (https://github.com/SamanthaLop/Small_Bugs_Big_Data) (also linked under "Related Materials" below).

    ContentBelow is a list of the files in this item (in addition to the README and MANIFEST files), and their description. The first three files (marked with a *) are required to run Kraken2 classifications using the database.

    • * hash.k2d.gz - A hash file with all minimiser to taxon mappings (855 GB).
    • * opts.k2d - A file containing all options used when building the Kraken2 database (64 B).
    • * taxo.k2d - A file containing the taxonomy information used to build the database (385.9 KB).
    • seqid2taxid.map.gz - A file containing contig accession numbers and their corresponding taxids (810.6 MB). Note that this file is needed by Kraken2 when building the database, and as it was updated during custom building, it has been included for reference, but it is not required to use the database for classification.
    • genome_assembly_metadata.tsv - NCBI-generated table (tsv format, gzipped) of all reference assemblies for Arthropoda as of March 2023, which were used in the database construction. This includes columns: Assembly Accession, Assembly Name, Organism Name, Organism Infraspecific Names Breed, Organism Infraspecific Names Strain, Organism Infraspecific Names Cultival, Organism Infraspecific Names Ecotype, Organism Infraspecific Names Isolate, Organism Infraspecific Names Sex, Annotation Name, Assembly Stats Total Sequence Length, Assembly Level, Assembly Submission, and WGS project accession. How to use the database- Download the hash.k2d.gz, opts.k2d, and taxo.k2d files to the same directory (e.g. /PATH/TO/DATABASE/).
    • Unzip the hash.k2d.gz file.
    • Install or load Kraken2 to run classification on sequencing data using the database.
    • When running Kraken2, indicate the path to the directory (not the individual files) with the --db flag (e.g. kraken2 --db /PATH/TO/DATABASE/ ...). Note that the whole database must be loaded into memory by Kraken2 to be able to classify any sequencing reads, so ensure you have access to enough memory before running (the uncompressed hash file is around 1.1 TB).

    We also recommend using the Kraken2 option --memory-mapping, as it ensures the database is loaded once for all samples, instead of once for each individual sample, saving considerable time and resources.

    For more information on using Kraken2, see the Kraken2 wiki manual (https://github.com/DerrickWood/kraken2/wiki/Manual) .

    This database was built by Samantha López Clinton (samantha.lopezclinton@nrm) and Tom van der Valk (tom.vandervalk@nrm.se).

  16. c

    Global IT Information Technology Market Report 2025 Edition, Market Size,...

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research, Global IT Information Technology Market Report 2025 Edition, Market Size, Share, CAGR, Forecast, Revenue [Dataset]. https://www.cognitivemarketresearch.com/it-information-technology-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, Information Technology Global Market Size was USD XX Million in 2024 and is set to achieve a market size of USD XX Million by the end of 2033 growing at a CAGR of XX% from 2025 to 2033.

    North America held largest share of xx% in the year 2024 
    Europe held share of xx% in the year 2024 
    Asia-Pacific held significant share of xx% in the year 2024 
    South America held significant share of xx% in the year 2024
    Middle East and Africa held significant share of xx% in the year 2024 
    

    Market Dynamics of IT Information Technology Market

    Key Drivers of IT Information Technology Market

    The Growing Adoption of Cloud Computing, Artificial Intelligence, and Big Data

    The extensive incorporation of cutting-edge digital technologies—cloud computing, AI, and big data—serves as a key catalyst for the growth of the IT market. Cloud computing provides businesses with scalable and adaptable infrastructure, AI enhances operational efficiency through automation and predictive analytics, and big data supports informed decision-making. For example, Atera’s collaboration with Azure OpenAI facilitates predictive issue resolution and significantly enhances IT productivity. These technologies are transforming workflows across various industries and driving innovation, ensuring that the IT sector remains at the forefront of global digital transformation.

    Source:https://www.microsoft.com/en/customers/story/1662731177894407321-atera-professional-services-azure-en-israel

    The Transformative Influence of IoT is Enhancing the Global IT Sector

    The rapid proliferation of Internet of Things (IoT) devices—projected to exceed 16.6 billion by the close of 2023—has intensified the demand for IT infrastructure, services, and analytics. IoT fosters real-time data gathering, automation, and predictive maintenance in sectors such as healthcare, manufacturing, and smart cities. The immense data produced by interconnected devices is propelling advancements in AI, cloud computing, and edge computing. With increasing investments in 5G and digital infrastructure, IoT continues to serve as a vital enabler of IT market growth on a global scale.

    (Source:https://iot-analytics.com/product/state-of-iot-summer-2024/)

    Key Restraints in IT Information Technology Market

    Growing Concerns Regarding Data Privacy are Impeding IT Market Expansion

    High-profile cyber incidents, such as the 2021 Microsoft Exchange Server breach, have triggered considerable anxiety regarding data security. Consumer apprehensions about surveillance, unauthorized access, and the corporate misuse of personal data are on the rise. According to Deloitte, almost 60% of consumers express concerns about security breaches, with trust in corporate data management notably diminished. This situation has prompted demands for more stringent privacy regulations and may hinder digital adoption due to heightened compliance requirements and public skepticism.

    (Source:https://www2.deloitte.com/us/en/insights/industry/telecommunications/connectivity-mobile-trends-survey/2023/data-privacy-and-security.html

    https://en.wikipedia.org/wiki/WannaCry_ransomware_attack)

    Cybersecurity Threats and the Escalation of Attack Complexity

    The emergence of intricate cyber threats, such as ransomware (e.g., WannaCry), poses a persistent challenge for the IT industry. Cybercriminals take advantage of weaknesses in essential systems, leading to financial losses, data breaches, and damage to reputation. Tackling cybersecurity necessitates ongoing investment in threat detection, endpoint security, and adherence to regulations. These evolving threats not only increase operational expenses but also discourage smaller enterprises from adopting advanced IT solutions due to the fear of vulnerability.

    Key Trends of IT Information Technology Market

    Expansion of Edge Computing to Facilitate Real-Time Applications

    As IoT and smart devices become more prevalent, edge computing is gaining traction by processing data nearer to its source. This approach minimizes latency and enhances response times, making it particularly suitable for real-time applications such as autonomous vehicles, smart manufacturing, and augmented reality. The shift towards edge infrastructure is transforming IT architectures to more effectively balance cloud and on-premise computing requirements.

    Increase i...

  17. Commercial Reference Building: Large Hotel

    • data.openei.org
    • s.cnmilf.com
    • +2more
    data, image_document +1
    Updated Nov 25, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Deru; Kristin Field; Daniel Studer; Kyle Benne; Brent Griffith; Paul Torcellini; Bing Liu; Mark Halverson; Dave Winiarski; Michael Rosenberg; Mehry Yazdanian; Joe Huang; Drury Crawley; Michael Deru; Kristin Field; Daniel Studer; Kyle Benne; Brent Griffith; Paul Torcellini; Bing Liu; Mark Halverson; Dave Winiarski; Michael Rosenberg; Mehry Yazdanian; Joe Huang; Drury Crawley (2014). Commercial Reference Building: Large Hotel [Dataset]. https://data.openei.org/submissions/158
    Explore at:
    website, data, image_documentAvailable download formats
    Dataset updated
    Nov 25, 2014
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Office of Energy Efficiency and Renewable Energyhttp://energy.gov/eere
    Open Energy Data Initiative (OEDI)
    National Renewable Energy Laboratory
    Authors
    Michael Deru; Kristin Field; Daniel Studer; Kyle Benne; Brent Griffith; Paul Torcellini; Bing Liu; Mark Halverson; Dave Winiarski; Michael Rosenberg; Mehry Yazdanian; Joe Huang; Drury Crawley; Michael Deru; Kristin Field; Daniel Studer; Kyle Benne; Brent Griffith; Paul Torcellini; Bing Liu; Mark Halverson; Dave Winiarski; Michael Rosenberg; Mehry Yazdanian; Joe Huang; Drury Crawley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Commercial reference buildings provide complete descriptions for whole building energy analysis using EnergyPlus (see "About EnergyPlus" resource link) simulation software. Included here is data pertaining to the reference building type "Large Hotel" for each of the 16 climate zones described on the Wiki page (see "OpenEI Wiki Page for Commercial Reference Buildings" resource link), and each of three construction categories: new (2004) construction, post-1980 construction existing buildings, and pre-1980 construction existing buildings.

    The dataset includes four key components: building summary, zone summary, location summary and a picture. Building summary includes details about: form, fabric, and HVAC. Zone summary includes details such as: area, volume, lighting, and occupants for all types of zones in the building. Location summary includes key building information as it pertains to each climate zone, including: fabric and HVAC details, utility costs, energy end use, and peak energy demand.

    In total, DOE developed 16 reference building types that represent approximately 70% of commercial buildings in the U.S.; for each type, building models are available for each of the three construction categories. The commercial reference buildings (formerly known as commercial building benchmark models) were developed by the U.S. Department of Energy (DOE), in conjunction with three of its national laboratories.

    Additional data is available directly from DOE's Energy Efficiency & Renewable Energy (EERE) website (see "About Commercial Buildings" resource link), including EnergyPlus software input files (.idf) and results of the EnergyPlus simulations (.html).

    Note: There have been many changes and improvements since this dataset was released. Several revisions have been made to the models and moved to a different approach to representing typical building energy consumption. For current data on building energy consumption please see the ComStock resource below.

  18. a

    Data from: gray zone lymphoma

    • alliancegenome.org
    Updated Jul 22, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alliance of Genome Resources (2010). gray zone lymphoma [Dataset]. http://identifiers.org/DOID:5822
    Explore at:
    Dataset updated
    Jul 22, 2010
    Dataset authored and provided by
    Alliance of Genome Resources
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A lymphoma that is characterized by having cellular features of both classic Hodgkin's lymphomas and large B-cell Lymphomas. url:http://en.wikipedia.org/wiki/Gray_zone_lymphoma

  19. d

    Data from: LamaH-Ice: LArge-SaMple DAta for Hydrology and Environmental...

    • dataone.org
    • beta.hydroshare.org
    • +2more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hordur Bragi Helgason; Bart Nijssen (2024). LamaH-Ice: LArge-SaMple DAta for Hydrology and Environmental Sciences for Iceland [Dataset]. https://dataone.org/datasets/sha256%3A27e862974ff92ab05b3e6961a09ddfee03655c2839936bb5c3dca3eb520d65f7
    Explore at:
    Dataset updated
    Jun 1, 2024
    Dataset provided by
    Hydroshare
    Authors
    Hordur Bragi Helgason; Bart Nijssen
    Time period covered
    Jan 1, 1950 - Dec 31, 2021
    Area covered
    Description

    LamaH-Ice (LArge-SaMple DAta for Hydrology and Environmental Sciences for Iceland) is a large-sample hydro-meteorological dataset for Iceland. The dataset includes daily and hourly hydro-meteorological timeseries, including observed streamflow, and catchment characteristics for 107 river basins in Iceland. The catchment characteristics describe the topographic, hydroclimatic, land cover, vegetation, soils, geological and glaciological attributes of the river catchments, as well as the human influence on streamflow in the catchments. LamaH-Ice conforms to the structure of existing LSH datasets and includes most variables offered in these datasets, as well as additional information relevant to cold-region hydrology, e.g., timeseries of snow cover, glacier mass balance and albedo. A large majority of the watersheds in LamaH-Ice are not subject to human activities, such as diversions and flow regulations.

    LamaH-Ice contains meteorological forcings from three different reanalysis datasets (ERA5-Land, RAV-II and CARRA). Streamflow measurements in LamaH-Ice cover a part of the period, with the average streamflow series length being 33 years for daily data (with gaps) and 11 years for hourly data. The dataset is described in detail in a paper published in the journal "Earth System Science Data" (ESSD). The code used to assemble the dataset is available in folder "F_appendix" in the dataset as well as on GitHub (https://github.com/hhelgason/LamaH-Ice). We offer two downloadable files for the LamaH-Ice dataset: 1) Hydrometeorological time series with both daily and hourly resolutions (30 GB after decompression) and 2) Hydrometeorological time series with daily resolution only (2 GB). Other than the temporal resolution, there are no differences between the two downloadable files.

    This HydroShare resource also hosts the "LamaH-Ice Caravan extension", which complements the "Caravan - A global community dataset for large-sample hydrology" Caravan dataset (Kratzert et al., 2023). The data is formatted in the same manner as the data currently existing in Caravan. To process the Caravan extension, the following guide was used: https://github.com/kratzert/Caravan/wiki/Extending-Caravan-with-new-basins. Some features, e.g. hourly atmospheric and streamflow series, glacier mass balance and MODIS timeseries data are thus only available in the LamaH-Ice dataset.

    Data disclaimer: The Icelandic Meteorological Office (IMO) and the National Power Company of Iceland (NPC) own the data from most streamflow gauges in the dataset. The streamflow data is published on Hydroshare with permission of all data owners. Neither we nor the provider of the streamflow dataset can be liable for the data provided. The IMO and the NPC reserve the rights to retrospectively check and update the streamflow timeseries at any time, and these changes will most likely not be reflected in the published dataset. If up-to-date data is needed, users are encouraged to contact the IMO and the NPC.

    License: The streamflow data is subject to the CC BY-NC 4.0 (creativecommons.org/licenses/by-nc/4.0/). The streamflow data cannot be used for commercial purposes. All data except for the streamflow measurements are subject to the CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Users can share and adapt the dataset only if appropriate credit is given (the ESSD data description paper is cited, the version of the dataset and all data sources are listed which are declared in the folder "Info") and any changes are clearly indicated, and a link to the original license is provided.

    Updates since the HydroShare repository was first created on August 18, 2023: April 30, 2024: • Streamflow series were corrected (replaced) for gauges with IDs 31, 70 and 72, and hydrological signatures and water balance files recalculated using the corrected streamflow series. March 12, 2024: Dataset Revision: In line with the ESSD manuscript revision, significant updates have been made. For a detailed list, visit https://doi.org/10.5194/essd-2023-349-AC1. Key changes include: • A timeseries for reference ET has been computed using RAV-II reanalysis meteorological timeseries. • Climate indices recalculated with RAV-II reanalysis; ERA5-Land indices remain under an "_ERA5L" suffix. • Hydrological signatures are now derived from RAV-II reanalysis precipitation. • Standardized .csv column separators to semicolons. • Enhanced metadata for all shapefiles. • Added attributes (g_lon, g_lat, g_frac_dyn, g_area_dyn) to the dataset. • Reordered catchment attributes table columns for consistency with the LamaH-Ice paper. • Corrected ERA5-Land reanalysis errors for shortwave and longwave flux timeseries. February 22, 2024 • Caravan Extension Fix: Corrected latitude and longitude mix-up. October 1, 2023 • GeoPackages added as an alternative to shapefiles, readme files added in all subfolders for guidance.

  20. MEDICINA-corpus_reducido+MIR+wiki

    • kaggle.com
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel González Martínez (2023). MEDICINA-corpus_reducido+MIR+wiki [Dataset]. https://www.kaggle.com/datasets/manuelgonzlezmartnez/medicina-corpus-reducido-mir-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manuel González Martínez
    Description

    This datasets contains the tokenized version of a dataset containing 60% of OSCAR spanish corpus, wiki data from multiple countries and medicine books. As the weight is so big i needed to cut the OSCAR corpus to make it a little bit smaller, for the same reason i uploaded the tokenized version as If you want/need to work with this dataset inside kaggle you do not have enough space for tokenizing the dataset.

    I have also uploaded the code used for tokenize the dataset.

    If you want me to upload the entire dataset divided in 4 parts ask for It. :)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sci-Tech Today (2025). Big Data Statistics By Market Size, Usage, Adoption and Facts (2025) [Dataset]. https://www.sci-tech-today.com/stats/big-data-statistics/

Big Data Statistics By Market Size, Usage, Adoption and Facts (2025)

Explore at:
Dataset updated
Nov 3, 2025
Dataset authored and provided by
Sci-Tech Today
License

https://www.sci-tech-today.com/privacy-policyhttps://www.sci-tech-today.com/privacy-policy

Time period covered
2022 - 2032
Area covered
Global
Description

Introduction

Big Data Statistics: By 2025, the world is expected to generate an unimaginable 181 zettabytes of data, representing an average annual growth rate of 23.13%, with 2.5 quintillion bytes created daily. This indicates an influx of roughly 29 terabytes every second, yes, you read it right.

While over 97% of businesses are now investing in Big Data, only about 40% are truly effective at leveraging analytics. The global Big Data analytics market, valued at approximately $348.21 billion in 2024, is on a trajectory to reach over $961.89 billion by 2032, exhibiting a robust CAGR of 13.5%.

The dominance of some unstructured data, which constitutes about 90% of all global data, underscores the complex statistical challenge. Industries, from healthcare, which could save up to $300 billion annually in the US alone through data initiatives, to entertainment, where Netflix saves an estimated $1 billion per year using data-driven recommendation algorithms, which are being fundamentally changed by Big Data. So, let's dive deeper into this stats article discussing everything about big data from its current insights to future forecasts. Let’s get started.

Search
Clear search
Close search
Google apps
Main menu