100+ datasets found
  1. Data from: A large-scale comparative analysis of Coding Standard conformance...

    • figshare.com
    application/x-gzip
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Oct 4, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

  2. Importance of data sources for analytics vs access among U.S. businesses...

    • statista.com
    Updated Mar 25, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Importance of data sources for analytics vs access among U.S. businesses 2015 [Dataset]. https://www.statista.com/statistics/562625/united-states-data-analytics-importance-vs-access/
    Explore at:
    Dataset updated
    Mar 25, 2016
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    This statistic illustrates the importance of various data sources for business analytics, compared to the level of access businesses have to those data sources, according to a marketing survey of C-level executives, conducted in ************* by Black Ink. As of *************, product and service usage data was listed as important by ** percent of respondents, but the degree of access to that data was put at ** percent.

  3. Data sources for anti-fraud data analytics initiatives in global...

    • statista.com
    Updated May 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Data sources for anti-fraud data analytics initiatives in global organizations 2019 [Dataset]. https://www.statista.com/statistics/1043542/worldwide-fraud-fight-data-analytics-data-sources/
    Explore at:
    Dataset updated
    May 23, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2019
    Area covered
    Worldwide
    Description

    Internal structured data is the most commonly used data source for anti-fraud data analytics initiatives in organizations, according to a global company survey in 2019. Almost three quarters of the respondents said that internal structured data was used in their companies for anti-fraud analytics tests.

  4. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global, United States, Canada
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

    What will be the Size of the Data Science Platform Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

    How is this Data Science Platform Industry segmented?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen

  5. Q

    Catalog of Ocean Data Science Initiatives

    • data.qdr.syr.edu
    pdf, txt, xlsx
    Updated May 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lauren Alexandra Drakopulos; Lauren Alexandra Drakopulos; Elizabeth Havice; Elizabeth Havice; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell (2022). Catalog of Ocean Data Science Initiatives [Dataset]. http://doi.org/10.5064/F6ZQWQJS
    Explore at:
    xlsx(344302), pdf(81722), txt(4514), pdf(222143)Available download formats
    Dataset updated
    May 26, 2022
    Dataset provided by
    Qualitative Data Repository
    Authors
    Lauren Alexandra Drakopulos; Lauren Alexandra Drakopulos; Elizabeth Havice; Elizabeth Havice; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell
    License

    https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions

    Dataset funded by
    National Science Foundation, Human-Environment and Geographical Sciences Program
    Description

    Project Overview This dataset is a catalog of oceans data science initiatives (ODSIs). We define an ODSI as an initiative that mobilizes (often geospatial and temporal) big data and/or novel data sources about the oceans with an express goal of informing or improving conditions in the oceans. ODSI identification began in Jan 2020. Additional ODSIs will continue to be added. We identified more than 150 ODSIs and populated the catalog with data gathered from ODSI websites describing key features of their work including 1) the data infrastructure 2) their organizational structure, 3) the ocean worlds, or ontologies, they create, and 4) the (explicit or implicit) policy and governance ‘solutions’ and relations they promote. The ODSIs in the catalog are global and regional in scope and aim to enhance understanding around three topical concerns: fisheries extraction, biodiversity conservation, and enhancing basic scientific knowledge. Data overview For 100 ODSIs, we created metadata about the data architecture, organizational governance, and world-making practices such as their stated purpose, theory of change, and problem/solution framing. For a subset of 30 ODSIs, we created metadata about their policy and governance stances and practices. All metadata was created based on a textual analysis of their websites and public communications. Data collection overview Sampling strategy: We began with a purposive sample of ODSIs based on the research team’s prior knowledge of and participation in global and regional ODSIs. This sample allowed us to pilot and refine our metadata catalog approach. We then used a combination of keyword searches on Google using search terms such as ‘ocean data’ ‘marine data’ and ‘fisheries data’. Adopting a snowball sampling method, we reviewed the websites of ODSIs that came up in our initial search to find references to additional ODSIs. To determine if an entity was an ODSI, we reviewed web pages for information on purpose, goals, objectives, mission, values (usually in tabs labeled ‘About’ ‘Goals’ or ‘Objectives’) and we looked for links to ‘data’ or ‘data products.’ Entities were selected for our catalog based on two criteria: 1) their stated purpose, goals, objectives, mission, values indicated a commitment to advancing ocean science and data and 2) if they focused on regional or global scales. We selected and categorized ODSIs according to three broad focal areas in global and regional oceans governance: fisheries extraction, biodiversity conservation, and basic ocean science development. Shared data organization This catalog is comprised of three files. 'Havice_ODSIC.pdf' provides a list of each ODSI included in the catalog, and a permalink to the webpage used to populate catalog metadata categories. 'Havice_ODSIC-CodingScheme.pdf' provides a list of code description for the catalog metadata. 'Havice_ODSIC-Metadata.xlsx' is the full catalog with populated metadata.

  6. Z

    Data from: Large-scale comparison of bibliographic data sources: Scopus, Web...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Eck, Nees Jan (2020). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3837427
    Explore at:
    Dataset updated
    Aug 17, 2020
    Dataset provided by
    Waltman, Ludo
    Van Eck, Nees Jan
    Visser, Martijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains supplementary material for the paper 'Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic' by Martijn Visser, Nees Jan van Eck, and Ludo Waltman. The data set provides the statistics presented in the figures in the paper.

  7. Map of articles about "Teaching Open Science"

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabel Steinhardt; Isabel Steinhardt (2020). Map of articles about "Teaching Open Science" [Dataset]. http://doi.org/10.5281/zenodo.3371415
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Isabel Steinhardt; Isabel Steinhardt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This description is part of the blog post "Systematic Literature Review of teaching Open Science" https://sozmethode.hypotheses.org/839

    According to my opinion, we do not pay enough attention to teaching Open Science in higher education. Therefore, I designed a seminar to teach students the practices of Open Science by doing qualitative research.About this seminar, I wrote the article ”Teaching Open Science and qualitative methods“. For the article ”Teaching Open Science and qualitative methods“, I started to review the literature on ”Teaching Open Science“. The result of my literature review is that certain aspects of Open Science are used for teaching. However, Open Science with all its aspects (Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools) is not an issue in publications about teaching.

    Based on this insight, I have started a systematic literature review. I realized quickly that I need help to analyse and interpret the articles and to evaluate my preliminary findings. Especially different disciplinary cultures of teaching different aspects of Open Science are challenging, as I myself, as a social scientist, do not have enough insight to be able to interpret the results correctly. Therefore, I would like to invite you to participate in this research project!

    I am now looking for people who would like to join a collaborative process to further explore and write the systematic literature review on “Teaching Open Science“. Because I want to turn this project into a Massive Open Online Paper (MOOP). According to the 10 rules of Tennant et al (2019) on MOOPs, it is crucial to find a core group that is enthusiastic about the topic. Therefore, I am looking for people who are interested in creating the structure of the paper and writing the paper together with me. I am also looking for people who want to search for and review literature or evaluate the literature I have already found. Together with the interested persons I would then define, the rules for the project (cf. Tennant et al. 2019). So if you are interested to contribute to the further search for articles and / or to enhance the interpretation and writing of results, please get in touch. For everyone interested to contribute, the list of articles collected so far is freely accessible at Zotero: https://www.zotero.org/groups/2359061/teaching_open_science. The figure shown below provides a first overview of my ongoing work. I created the figure with the free software yEd and uploaded the file to zenodo, so everyone can download and work with it:

    To make transparent what I have done so far, I will first introduce what a systematic literature review is. Secondly, I describe the decisions I made to start with the systematic literature review. Third, I present the preliminary results.

    Systematic literature review – an Introduction

    Systematic literature reviews “are a method of mapping out areas of uncertainty, and identifying where little or no relevant research has been done.” (Petticrew/Roberts 2008: 2). Fink defines the systematic literature review as a “systemic, explicit, and reproducible method for identifying, evaluating, and synthesizing the existing body of completed and recorded work produced by researchers, scholars, and practitioners.” (Fink 2019: 6). The aim of a systematic literature reviews is to surpass the subjectivity of a researchers’ search for literature. However, there can never be an objective selection of articles. This is because the researcher has for example already made a preselection by deciding about search strings, for example “Teaching Open Science”. In this respect, transparency is the core criteria for a high-quality review.

    In order to achieve high quality and transparency, Fink (2019: 6-7) proposes the following seven steps:

    1. Selecting a research question.
    2. Selecting the bibliographic database.
    3. Choosing the search terms.
    4. Applying practical screening criteria.
    5. Applying methodological screening criteria.
    6. Doing the review.
    7. Synthesizing the results.

    I have adapted these steps for the “Teaching Open Science” systematic literature review. In the following, I will present the decisions I have made.

    Systematic literature review – decisions I made

    1. Research question: I am interested in the following research questions: How is Open Science taught in higher education? Is Open Science taught in its full range with all aspects like Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools? Which aspects are taught? Are there disciplinary differences as to which aspects are taught and, if so, why are there such differences?
    2. Databases: I started my search at the Directory of Open Science (DOAJ). “DOAJ is a community-curated online directory that indexes and provides access to high quality, open access, peer-reviewed journals.” (https://doaj.org/) Secondly, I used the Bielefeld Academic Search Engine (base). Base is operated by Bielefeld University Library and “one of the world’s most voluminous search engines especially for academic web resources” (base-search.net). Both platforms are non-commercial and focus on Open Access publications and thus differ from the commercial publication databases, such as Web of Science and Scopus. For this project, I deliberately decided against commercial providers and the restriction of search in indexed journals. Thus, because my explicit aim was to find articles that are open in the context of Open Science.
    3. Search terms: To identify articles about teaching Open Science I used the following search strings: “teaching open science” OR teaching “open science” OR teach „open science“. The topic search looked for the search strings in title, abstract and keywords of articles. Since these are very narrow search terms, I decided to broaden the method. I searched in the reference lists of all articles that appear from this search for further relevant literature. Using Google Scholar I checked which other authors cited the articles in the sample. If the so checked articles met my methodological criteria, I included them in the sample and looked through the reference lists and citations at Google Scholar. This process has not yet been completed.
    4. Practical screening criteria: I have included English and German articles in the sample, as I speak these languages (articles in other languages are very welcome, if there are people who can interpret them!). In the sample only journal articles, articles in edited volumes, working papers and conference papers from proceedings were included. I checked whether the journals were predatory journals – such articles were not included. I did not include blogposts, books or articles from newspapers. I only included articles that fulltexts are accessible via my institution (University of Kassel). As a result, recently published articles at Elsevier could not be included because of the special situation in Germany regarding the Project DEAL (https://www.projekt-deal.de/about-deal/). For articles that are not freely accessible, I have checked whether there is an accessible version in a repository or whether preprint is available. If this was not the case, the article was not included. I started the analysis in May 2019.
    5. Methodological criteria: The method described above to check the reference lists has the problem of subjectivity. Therefore, I hope that other people will be interested in this project and evaluate my decisions. I have used the following criteria as the basis for my decisions: First, the articles must focus on teaching. For example, this means that articles must describe how a course was designed and carried out. Second, at least one aspect of Open Science has to be addressed. The aspects can be very diverse (FOSS, repositories, wiki, data management, etc.) but have to comply with the principles of openness. This means, for example, I included an article when it deals with the use of FOSS in class and addresses the aspects of openness of FOSS. I did not include articles when the authors describe the use of a particular free and open source software for teaching but did not address the principles of openness or re-use.
    6. Doing the review: Due to the methodical approach of going through the reference lists, it is possible to create a map of how the articles relate to each other. This results in thematic clusters and connections between clusters. The starting point for the map were four articles (Cook et al. 2018; Marsden, Thompson, and Plonsky 2017; Petras et al. 2015; Toelch and Ostwald 2018) that I found using the databases and criteria described above. I used yEd to generate the network. „yEd is a powerful desktop application that can be used to quickly and effectively generate high-quality diagrams.” (https://www.yworks.com/products/yed) In the network, arrows show, which articles are cited in an article and which articles are cited by others as well. In addition, I made an initial rough classification of the content using colours. This classification is based on the contents mentioned in the articles’ title and abstract. This rough content classification requires a more exact, i.e., content-based subdivision and

  8. f

    Library Data Services Landscape Scan

    • arizona.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey C Oliver; Fernando Rios; Kiriann Carini; Chun Ly (2023). Library Data Services Landscape Scan [Dataset]. http://doi.org/10.25422/azu.data.22297177.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    University of Arizona Research Data Repository
    Authors
    Jeffrey C Oliver; Fernando Rios; Kiriann Carini; Chun Ly
    License

    https://opensource.org/licenses/BSD-3-Clausehttps://opensource.org/licenses/BSD-3-Clause

    Description

    R code and data for a landscape scan of data services at academic libraries. Original data is licensed CC By 4.0, data obtained from other sources is licensed according to the original licensing terms. R scripts are licensed under the BSD 3-clause license. Summary This work generally focuses on four questions:

    Which research data services does an academic library provide? For a subset of those services, what form does the support come in? i.e. consulting, instruction, or web resources? Are there differences in support between three categories of services: data management, geospatial, and data science? How does library resourcing (i.e. salaries) affect the number of research data services?

    Approach Using direct survey of web resources, we investigated the services offered at 25 Research 1 universities in the United States of America. Please refer to the included README.md files for more information.

    For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu

  9. Data from: Inventory of online public databases and repositories holding...

    • catalog.data.gov
    • s.cnmilf.com
    • +4more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  10. Data from: Multi-Source Distributed System Data for AI-powered Analytics

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Nov 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract:

    In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
    The major contributions have been materialized in the form of novel algorithms.
    Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
    Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
    Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
    Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

    General Information:

    This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

    You may find details of this dataset from the original paper:

    Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

    If you use the data, implementation, or any details of the paper, please cite!

    BIBTEX:

    _

    @inproceedings{nedelkoski2020multi,
     title={Multi-source Distributed System Data for AI-Powered Analytics},
     author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
     booktitle={European Conference on Service-Oriented and Cloud Computing},
     pages={161--176},
     year={2020},
     organization={Springer}
    }
    

    _

    The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

    The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

    Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

    Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/

  11. d

    Real Estate Data Sources & Analytics

    • datarade.ai
    .json, .csv, .xls
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forloop.ai, Real Estate Data Sources & Analytics [Dataset]. https://datarade.ai/data-products/real-estate-data-sources-analytics-forloop-ai
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset provided by
    Forloop.ai
    Area covered
    El Salvador, Bolivia (Plurinational State of), United States of America, Saint Barthélemy, Pitcairn, Martinique, South Sudan, Northern Mariana Islands, Austria, Bahamas
    Description

    With our analytics tools, businesses can make data-driven decisions to enhance their operations and stay ahead of the competition. Our analytics tools help businesses gain insights into market trends, identify investment opportunities, and optimize their marketing efforts.

    Whether you are a real estate developer, a property investor, or a financial institution, our real estate data sources and analytics can help you gain a competitive edge in the market.

    Sources: Idealista IT KnightFrank rightmove Biura Inmuebles24 Sreality propestar habitaclia.com Iroda Findboliger.dk Immoweb Homegate Zimmo Funda.nl (RENT) fotocasa Funda.nl Comparis.ch Google Maps Flexioffices leboncoin idealista ES Remax.pl Realting Kyero parisattitude Finn.no immoscout24.de Immobilier.ch Jaap.nl Immo-vlan SeLoger Booli.se Immowelt Realla idealista (rent) Hemnet.se Home.ch Boliga.dk Instant Office UK idealista PT pisos.com domy.pl

  12. o

    Indeed Data Science & ML Job Postings

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Indeed Data Science & ML Job Postings [Dataset]. https://www.opendatabay.com/data/ai-ml/cc486027-ff62-4396-a1d5-b98c3aa7a223
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset offers insights into job postings, primarily focusing on roles in Data Engineering, Data Analysis, Data Science, and Machine Learning Engineering. It contains approximately 1583 records of job information, providing a snapshot of the employment landscape in these fields. The dataset is ideal for understanding market demands and trends.

    Columns

    • job_title: The specific title of the job post.
    • company: The name of the hiring company.
    • job_location: The city and state where the job is located.
    • job_summary: A detailed description outlining the purpose of the hiring.
    • post_date: The date when the job was posted on Indeed.
    • today: The date when the data was collected.
    • job_salary: The expected salary range for the position.
    • job_url: A direct link to the job posting for further details.

    Distribution

    The dataset is provided as a single CSV file, named 'job_dataset.csv'. It comprises 1583 rows and 8 columns, representing the structure of the collected job information. The data collection occurred around 26th July 2022.

    Usage

    This dataset is well-suited for various analytical tasks: * Cleaning and refining job data. * Identifying the most in-demand skills within the data and machine learning sectors. * Analysing the geographical distribution of jobs. * Conducting Natural Language Processing (NLP) and research on job descriptions. * Market analysis for job seekers, recruiters, and educational institutions.

    Coverage

    The dataset has a global scope, with notable concentrations of job postings in locations such as Bengaluru, Karnataka (30%) and Gurgaon, Haryana (7%). The records primarily cover job postings for data-related roles, including Data Engineer, Data Analyst, Data Scientist, and ML Engineer, with data collected around July 2022. Some postings were listed over 30 days prior to the collection date.

    License

    CC0

    Who Can Use It

    This dataset is valuable for: * Data Scientists and Analysts: For market research, trend analysis, and skill demand assessment. * Machine Learning Engineers: To understand job requirements and role distributions. * Researchers: For academic studies on labour markets and skill development. * Job Seekers: To identify popular roles, required skills, and geographical opportunities. * Companies and Recruiters: For talent acquisition strategies and competitor analysis.

    Dataset Name Suggestions

    • Indeed Data Science & ML Job Postings
    • Global Data Roles Dataset
    • Job Market Insights: Data Careers
    • Data Analytics & AI Job Data
    • UK Data Professional Vacancies

    Attributes

    Original Data Source: Indeed job (Data science /data analyst/ ML)

  13. n

    Data from: Designing data science workshops for data-intensive environmental...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    California State Polytechnic University
    Montana State University
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

    Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. 
    The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. 
    The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
    
  14. Alternative Data Market Analysis North America, Europe, APAC, South America,...

    • technavio.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio, Alternative Data Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Canada, China, UK, Mexico, Germany, Japan, India, Italy, France - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/alternative-data-market-industry-analysis
    Explore at:
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global, United States, Canada, Mexico
    Description

    Snapshot img

    Alternative Data Market Size 2025-2029

    The alternative data market size is forecast to increase by USD 60.32 billion, at a CAGR of 52.5% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increased availability and diversity of data sources. This expanding data landscape is fueling the rise of alternative data-driven investment strategies across various industries. However, the market faces challenges related to data quality and standardization. As companies increasingly rely on alternative data to inform business decisions, ensuring data accuracy and consistency becomes paramount. Addressing these challenges requires robust data management systems and collaboration between data providers and consumers to establish industry-wide standards. Companies that effectively navigate these dynamics can capitalize on the wealth of opportunities presented by alternative data, driving innovation and competitive advantage.

    What will be the Size of the Alternative Data Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, with new applications and technologies shaping its dynamics. Predictive analytics and deep learning are increasingly being integrated into business intelligence systems, enabling more accurate risk management and sales forecasting. Data aggregation from various sources, including social media and web scraping, enriches datasets for more comprehensive quantitative analysis. Data governance and metadata management are crucial for maintaining data accuracy and ensuring data security. Real-time analytics and cloud computing facilitate decision support systems, while data lineage and data timeliness are essential for effective portfolio management. Unstructured data, such as sentiment analysis and natural language processing, provide valuable insights for various sectors. Machine learning algorithms and execution algorithms are revolutionizing trading strategies, from proprietary trading to high-frequency trading. Data cleansing and data validation are essential for maintaining data quality and relevance. Standard deviation and regression analysis are essential tools for financial modeling and risk management. Data enrichment and data warehousing are crucial for data consistency and completeness, allowing for more effective customer segmentation and sales forecasting. Data security and fraud detection are ongoing concerns, with advancements in technology continually addressing new threats. The market's continuous dynamism is reflected in its integration of various technologies and applications. From data mining and data visualization to supply chain optimization and pricing optimization, the market's evolution is driven by the ongoing unfolding of market activities and evolving patterns.

    How is this Alternative Data Industry segmented?

    The alternative data industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. TypeCredit and debit card transactionsSocial mediaMobile application usageWeb scrapped dataOthersEnd-userBFSIIT and telecommunicationRetailOthersGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKAPACChinaIndiaJapanRest of World (ROW)

    By Type Insights

    The credit and debit card transactions segment is estimated to witness significant growth during the forecast period.Alternative data derived from card and debit card transactions plays a pivotal role in business intelligence, offering valuable insights into consumer spending behaviors. This data is essential for market analysts, financial institutions, and businesses aiming to optimize strategies and enhance customer experiences. Two primary categories exist within this data segment: credit card transactions and debit card transactions. Credit card transactions reveal consumers' discretionary spending patterns, luxury purchases, and credit management abilities. By analyzing this data through quantitative methods, such as regression analysis and time series analysis, businesses can gain a deeper understanding of consumer preferences and trends. Debit card transactions, on the other hand, provide insights into essential spending habits, budgeting strategies, and daily expenses. This data is crucial for understanding consumers' practical needs and lifestyle choices. Machine learning algorithms, such as deep learning and predictive analytics, can be employed to uncover patterns and trends in debit card transactions, enabling businesses to tailor their offerings and services accordingly. Data governance, data security, and data accuracy are critical considerations when dealing with sensitive financial d

  15. d

    Gaming industry data sources & analytics

    • datarade.ai
    .json, .csv, .xls
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forloop.ai (2023). Gaming industry data sources & analytics [Dataset]. https://datarade.ai/data-products/gaming-industry-data-sources-analytics-forloop-ai
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 20, 2023
    Dataset provided by
    Forloop.ai
    Area covered
    India, Papua New Guinea, Nicaragua, New Caledonia, United States Minor Outlying Islands, Nauru, Comoros, Trinidad and Tobago, Paraguay, Burkina Faso
    Description

    Our gaming industry data solutions leverage data from major gaming platforms like Playstation, Xbox, and Steam to provide businesses with insights into the global gaming market. We collect data on the number of hours played for specific games, as well as monthly trends in player behavior and preferences.

  16. Most useful Data Science resources - Part 3

    • kaggle.com
    Updated Feb 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PAVAN KUMAR D (2021). Most useful Data Science resources - Part 3 [Dataset]. https://www.kaggle.com/mragpavank/most-useful-data-science-resources-part-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PAVAN KUMAR D
    Description

    These resources were shared publicly in LinkedIn by other users.

  17. O

    Open Source Big Data Tools Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Open Source Big Data Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/open-source-big-data-tools-58978
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The open-source big data tools market is experiencing robust growth, driven by the increasing need for scalable, cost-effective data management and analysis solutions across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising volume and velocity of data generated across industries, from banking and finance to manufacturing and government, necessitate powerful and adaptable tools. Secondly, the cost-effectiveness and flexibility of open-source solutions compared to proprietary alternatives are major drawcards, especially for smaller organizations and startups. The ease of customization and community support further enhance their appeal. Growth is also being propelled by technological advancements such as the development of more sophisticated data analytics tools, improved cloud integration, and increased adoption of containerization technologies like Docker and Kubernetes for deployment and management. The market's segmentation across application (banking, manufacturing, etc.) and tool type (data collection, storage, analysis) reflects the diverse range of uses and specialized tools available. Key restraints to market growth include the complexity associated with implementing and managing open-source solutions, requiring skilled personnel and ongoing maintenance. Security concerns and the need for robust data governance frameworks also pose challenges. However, the growing maturity of the open-source ecosystem, coupled with the emergence of managed services providers offering support and expertise, is mitigating these limitations. The continued advancements in artificial intelligence (AI) and machine learning (ML) are further integrating with open-source big data tools, creating synergistic opportunities for growth in predictive analytics and advanced data processing. This integration, alongside the ever-increasing volume of data needing analysis, will undoubtedly drive continued market expansion over the forecast period.

  18. Most useful Data Science resources - Part 4

    • kaggle.com
    Updated Feb 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PAVAN KUMAR D (2021). Most useful Data Science resources - Part 4 [Dataset]. https://www.kaggle.com/mragpavank/most-useful-data-science-resources-part-4/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PAVAN KUMAR D
    Description

    These resources were shared publicly in LinkedIn by other users.

  19. d

    Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).

  20. Global advanced analytics and data science software market share 2025

    • statista.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global advanced analytics and data science software market share 2025 [Dataset]. https://www.statista.com/statistics/1258535/advanced-analytics-data-science-market-share-technology-worldwide/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    Worldwide
    Description

    MATLAB led the global advanced analytics and data science software industry in 2025 with a market share of ***** percent. First launched in 1984, MATLAB is developed by the U.S. firm MathWorks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
Organization logo

Data from: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Related Article
Explore at:
application/x-gzipAvailable download formats
Dataset updated
Oct 4, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

Search
Clear search
Close search
Google apps
Main menu