100+ datasets found

Data from: A large-scale comparative analysis of Coding Standard conformance...
figshare.com
application/x-gzip
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12377237.v3
Dataset updated
Oct 4, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
Importance of data sources for analytics vs access among U.S. businesses...
statista.com
Updated Mar 25, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Importance of data sources for analytics vs access among U.S. businesses 2015 [Dataset]. https://www.statista.com/statistics/562625/united-states-data-analytics-importance-vs-access/
Explore at:
Dataset updated
Mar 25, 2016
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
This statistic illustrates the importance of various data sources for business analytics, compared to the level of access businesses have to those data sources, according to a marketing survey of C-level executives, conducted in ************* by Black Ink. As of *************, product and service usage data was listed as important by ** percent of respondents, but the degree of access to that data was put at ** percent.
Data sources for anti-fraud data analytics initiatives in global...
statista.com
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Data sources for anti-fraud data analytics initiatives in global organizations 2019 [Dataset]. https://www.statista.com/statistics/1043542/worldwide-fraud-fight-data-analytics-data-sources/
Explore at:
Dataset updated
May 23, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2019
Area covered
Worldwide
Description
Internal structured data is the most commonly used data source for anti-fraud data analytics initiatives in organizations, according to a global company survey in 2019. Almost three quarters of the respondents said that internal structured data was used in their companies for anti-fraud analytics tests.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, United States, Canada
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
Q
Catalog of Ocean Data Science Initiatives
data.qdr.syr.edu
pdf, txt, xlsx
Updated May 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lauren Alexandra Drakopulos; Lauren Alexandra Drakopulos; Elizabeth Havice; Elizabeth Havice; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell (2022). Catalog of Ocean Data Science Initiatives [Dataset]. http://doi.org/10.5064/F6ZQWQJS
Explore at:
xlsx(344302), pdf(81722), txt(4514), pdf(222143)Available download formats
Unique identifier
https://doi.org/10.5064/F6ZQWQJS
Dataset updated
May 26, 2022
Dataset provided by
Qualitative Data Repository
Authors
Lauren Alexandra Drakopulos; Lauren Alexandra Drakopulos; Elizabeth Havice; Elizabeth Havice; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell; Katie Crisp; Ana Zurita Posas; Lisa M. Campbell
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Dataset funded by
National Science Foundation, Human-Environment and Geographical Sciences Program
Description
Project Overview This dataset is a catalog of oceans data science initiatives (ODSIs). We define an ODSI as an initiative that mobilizes (often geospatial and temporal) big data and/or novel data sources about the oceans with an express goal of informing or improving conditions in the oceans. ODSI identification began in Jan 2020. Additional ODSIs will continue to be added. We identified more than 150 ODSIs and populated the catalog with data gathered from ODSI websites describing key features of their work including 1) the data infrastructure 2) their organizational structure, 3) the ocean worlds, or ontologies, they create, and 4) the (explicit or implicit) policy and governance ‘solutions’ and relations they promote. The ODSIs in the catalog are global and regional in scope and aim to enhance understanding around three topical concerns: fisheries extraction, biodiversity conservation, and enhancing basic scientific knowledge. Data overview For 100 ODSIs, we created metadata about the data architecture, organizational governance, and world-making practices such as their stated purpose, theory of change, and problem/solution framing. For a subset of 30 ODSIs, we created metadata about their policy and governance stances and practices. All metadata was created based on a textual analysis of their websites and public communications. Data collection overview Sampling strategy: We began with a purposive sample of ODSIs based on the research team’s prior knowledge of and participation in global and regional ODSIs. This sample allowed us to pilot and refine our metadata catalog approach. We then used a combination of keyword searches on Google using search terms such as ‘ocean data’ ‘marine data’ and ‘fisheries data’. Adopting a snowball sampling method, we reviewed the websites of ODSIs that came up in our initial search to find references to additional ODSIs. To determine if an entity was an ODSI, we reviewed web pages for information on purpose, goals, objectives, mission, values (usually in tabs labeled ‘About’ ‘Goals’ or ‘Objectives’) and we looked for links to ‘data’ or ‘data products.’ Entities were selected for our catalog based on two criteria: 1) their stated purpose, goals, objectives, mission, values indicated a commitment to advancing ocean science and data and 2) if they focused on regional or global scales. We selected and categorized ODSIs according to three broad focal areas in global and regional oceans governance: fisheries extraction, biodiversity conservation, and basic ocean science development. Shared data organization This catalog is comprised of three files. 'Havice_ODSIC.pdf' provides a list of each ODSI included in the catalog, and a permalink to the webpage used to populate catalog metadata categories. 'Havice_ODSIC-CodingScheme.pdf' provides a list of code description for the catalog metadata. 'Havice_ODSIC-Metadata.xlsx' is the full catalog with populated metadata.
Z
Data from: Large-scale comparison of bibliographic data sources: Scopus, Web...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Aug 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Eck, Nees Jan (2020). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3837427
Explore at:
Dataset updated
Aug 17, 2020
Dataset provided by
Waltman, Ludo
Van Eck, Nees Jan
Visser, Martijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains supplementary material for the paper 'Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic' by Martijn Visser, Nees Jan van Eck, and Ludo Waltman. The data set provides the statistics presented in the figures in the paper.
Map of articles about "Teaching Open Science"
zenodo.org
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabel Steinhardt; Isabel Steinhardt (2020). Map of articles about "Teaching Open Science" [Dataset]. http://doi.org/10.5281/zenodo.3371415
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3371415
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Isabel Steinhardt; Isabel Steinhardt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This description is part of the blog post "Systematic Literature Review of teaching Open Science" https://sozmethode.hypotheses.org/839

According to my opinion, we do not pay enough attention to teaching Open Science in higher education. Therefore, I designed a seminar to teach students the practices of Open Science by doing qualitative research.About this seminar, I wrote the article ”Teaching Open Science and qualitative methods“. For the article ”Teaching Open Science and qualitative methods“, I started to review the literature on ”Teaching Open Science“. The result of my literature review is that certain aspects of Open Science are used for teaching. However, Open Science with all its aspects (Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools) is not an issue in publications about teaching.

Based on this insight, I have started a systematic literature review. I realized quickly that I need help to analyse and interpret the articles and to evaluate my preliminary findings. Especially different disciplinary cultures of teaching different aspects of Open Science are challenging, as I myself, as a social scientist, do not have enough insight to be able to interpret the results correctly. Therefore, I would like to invite you to participate in this research project!

I am now looking for people who would like to join a collaborative process to further explore and write the systematic literature review on “Teaching Open Science“. Because I want to turn this project into a Massive Open Online Paper (MOOP). According to the 10 rules of Tennant et al (2019) on MOOPs, it is crucial to find a core group that is enthusiastic about the topic. Therefore, I am looking for people who are interested in creating the structure of the paper and writing the paper together with me. I am also looking for people who want to search for and review literature or evaluate the literature I have already found. Together with the interested persons I would then define, the rules for the project (cf. Tennant et al. 2019). So if you are interested to contribute to the further search for articles and / or to enhance the interpretation and writing of results, please get in touch. For everyone interested to contribute, the list of articles collected so far is freely accessible at Zotero: https://www.zotero.org/groups/2359061/teaching_open_science. The figure shown below provides a first overview of my ongoing work. I created the figure with the free software yEd and uploaded the file to zenodo, so everyone can download and work with it:

To make transparent what I have done so far, I will first introduce what a systematic literature review is. Secondly, I describe the decisions I made to start with the systematic literature review. Third, I present the preliminary results.

Systematic literature review – an Introduction

Systematic literature reviews “are a method of mapping out areas of uncertainty, and identifying where little or no relevant research has been done.” (Petticrew/Roberts 2008: 2). Fink defines the systematic literature review as a “systemic, explicit, and reproducible method for identifying, evaluating, and synthesizing the existing body of completed and recorded work produced by researchers, scholars, and practitioners.” (Fink 2019: 6). The aim of a systematic literature reviews is to surpass the subjectivity of a researchers’ search for literature. However, there can never be an objective selection of articles. This is because the researcher has for example already made a preselection by deciding about search strings, for example “Teaching Open Science”. In this respect, transparency is the core criteria for a high-quality review.

In order to achieve high quality and transparency, Fink (2019: 6-7) proposes the following seven steps:

Selecting a research question.

Selecting the bibliographic database.

Choosing the search terms.

Applying practical screening criteria.

Applying methodological screening criteria.

Doing the review.

Synthesizing the results.

I have adapted these steps for the “Teaching Open Science” systematic literature review. In the following, I will present the decisions I have made.

Systematic literature review – decisions I made

Research question: I am interested in the following research questions: How is Open Science taught in higher education? Is Open Science taught in its full range with all aspects like Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools? Which aspects are taught? Are there disciplinary differences as to which aspects are taught and, if so, why are there such differences?

Databases: I started my search at the Directory of Open Science (DOAJ). “DOAJ is a community-curated online directory that indexes and provides access to high quality, open access, peer-reviewed journals.” (https://doaj.org/) Secondly, I used the Bielefeld Academic Search Engine (base). Base is operated by Bielefeld University Library and “one of the world’s most voluminous search engines especially for academic web resources” (base-search.net). Both platforms are non-commercial and focus on Open Access publications and thus differ from the commercial publication databases, such as Web of Science and Scopus. For this project, I deliberately decided against commercial providers and the restriction of search in indexed journals. Thus, because my explicit aim was to find articles that are open in the context of Open Science.

Search terms: To identify articles about teaching Open Science I used the following search strings: “teaching open science” OR teaching “open science” OR teach „open science“. The topic search looked for the search strings in title, abstract and keywords of articles. Since these are very narrow search terms, I decided to broaden the method. I searched in the reference lists of all articles that appear from this search for further relevant literature. Using Google Scholar I checked which other authors cited the articles in the sample. If the so checked articles met my methodological criteria, I included them in the sample and looked through the reference lists and citations at Google Scholar. This process has not yet been completed.

Practical screening criteria: I have included English and German articles in the sample, as I speak these languages (articles in other languages are very welcome, if there are people who can interpret them!). In the sample only journal articles, articles in edited volumes, working papers and conference papers from proceedings were included. I checked whether the journals were predatory journals – such articles were not included. I did not include blogposts, books or articles from newspapers. I only included articles that fulltexts are accessible via my institution (University of Kassel). As a result, recently published articles at Elsevier could not be included because of the special situation in Germany regarding the Project DEAL (https://www.projekt-deal.de/about-deal/). For articles that are not freely accessible, I have checked whether there is an accessible version in a repository or whether preprint is available. If this was not the case, the article was not included. I started the analysis in May 2019.

Methodological criteria: The method described above to check the reference lists has the problem of subjectivity. Therefore, I hope that other people will be interested in this project and evaluate my decisions. I have used the following criteria as the basis for my decisions: First, the articles must focus on teaching. For example, this means that articles must describe how a course was designed and carried out. Second, at least one aspect of Open Science has to be addressed. The aspects can be very diverse (FOSS, repositories, wiki, data management, etc.) but have to comply with the principles of openness. This means, for example, I included an article when it deals with the use of FOSS in class and addresses the aspects of openness of FOSS. I did not include articles when the authors describe the use of a particular free and open source software for teaching but did not address the principles of openness or re-use.

Doing the review: Due to the methodical approach of going through the reference lists, it is possible to create a map of how the articles relate to each other. This results in thematic clusters and connections between clusters. The starting point for the map were four articles (Cook et al. 2018; Marsden, Thompson, and Plonsky 2017; Petras et al. 2015; Toelch and Ostwald 2018) that I found using the databases and criteria described above. I used yEd to generate the network. „yEd is a powerful desktop application that can be used to quickly and effectively generate high-quality diagrams.” (https://www.yworks.com/products/yed) In the network, arrows show, which articles are cited in an article and which articles are cited by others as well. In addition, I made an initial rough classification of the content using colours. This classification is based on the contents mentioned in the articles’ title and abstract. This rough content classification requires a more exact, i.e., content-based subdivision and
f
Library Data Services Landscape Scan
arizona.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey C Oliver; Fernando Rios; Kiriann Carini; Chun Ly (2023). Library Data Services Landscape Scan [Dataset]. http://doi.org/10.25422/azu.data.22297177.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25422/azu.data.22297177.v1
Dataset updated
May 30, 2023
Dataset provided by
University of Arizona Research Data Repository
Authors
Jeffrey C Oliver; Fernando Rios; Kiriann Carini; Chun Ly
License
https://opensource.org/licenses/BSD-3-Clausehttps://opensource.org/licenses/BSD-3-Clause
Description
R code and data for a landscape scan of data services at academic libraries. Original data is licensed CC By 4.0, data obtained from other sources is licensed according to the original licensing terms. R scripts are licensed under the BSD 3-clause license. Summary This work generally focuses on four questions:

Which research data services does an academic library provide? For a subset of those services, what form does the support come in? i.e. consulting, instruction, or web resources? Are there differences in support between three categories of services: data management, geospatial, and data science? How does library resourcing (i.e. salaries) affect the number of research data services?

Approach Using direct survey of web resources, we investigated the services offered at 25 Research 1 universities in the United States of America. Please refer to the included README.md files for more information.

For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu
Data from: Inventory of online public databases and repositories holding...
catalog.data.gov
s.cnmilf.com
+4more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Data from: Multi-Source Distributed System Data for AI-powered Analytics
zenodo.org
explore.openaire.eu
+1more
zip
Updated Nov 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549604
Dataset updated
Nov 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

_

@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }

_

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
d
Real Estate Data Sources & Analytics
datarade.ai
.json, .csv, .xls
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forloop.ai, Real Estate Data Sources & Analytics [Dataset]. https://datarade.ai/data-products/real-estate-data-sources-analytics-forloop-ai
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset provided by
Forloop.ai
Area covered
El Salvador, Bolivia (Plurinational State of), United States of America, Saint Barthélemy, Pitcairn, Martinique, South Sudan, Northern Mariana Islands, Austria, Bahamas
Description
With our analytics tools, businesses can make data-driven decisions to enhance their operations and stay ahead of the competition. Our analytics tools help businesses gain insights into market trends, identify investment opportunities, and optimize their marketing efforts.

Whether you are a real estate developer, a property investor, or a financial institution, our real estate data sources and analytics can help you gain a competitive edge in the market.

Sources: Idealista IT KnightFrank rightmove Biura Inmuebles24 Sreality propestar habitaclia.com Iroda Findboliger.dk Immoweb Homegate Zimmo Funda.nl (RENT) fotocasa Funda.nl Comparis.ch Google Maps Flexioffices leboncoin idealista ES Remax.pl Realting Kyero parisattitude Finn.no immoscout24.de Immobilier.ch Jaap.nl Immo-vlan SeLoger Booli.se Immowelt Realla idealista (rent) Hemnet.se Home.ch Boliga.dk Instant Office UK idealista PT pisos.com domy.pl
o
Indeed Data Science & ML Job Postings
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Indeed Data Science & ML Job Postings [Dataset]. https://www.opendatabay.com/data/ai-ml/cc486027-ff62-4396-a1d5-b98c3aa7a223
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset offers insights into job postings, primarily focusing on roles in Data Engineering, Data Analysis, Data Science, and Machine Learning Engineering. It contains approximately 1583 records of job information, providing a snapshot of the employment landscape in these fields. The dataset is ideal for understanding market demands and trends.

Columns

job_title: The specific title of the job post.

company: The name of the hiring company.

job_location: The city and state where the job is located.

job_summary: A detailed description outlining the purpose of the hiring.

post_date: The date when the job was posted on Indeed.

today: The date when the data was collected.

job_salary: The expected salary range for the position.

job_url: A direct link to the job posting for further details.

Distribution

The dataset is provided as a single CSV file, named 'job_dataset.csv'. It comprises 1583 rows and 8 columns, representing the structure of the collected job information. The data collection occurred around 26th July 2022.

Usage

This dataset is well-suited for various analytical tasks: * Cleaning and refining job data. * Identifying the most in-demand skills within the data and machine learning sectors. * Analysing the geographical distribution of jobs. * Conducting Natural Language Processing (NLP) and research on job descriptions. * Market analysis for job seekers, recruiters, and educational institutions.

Coverage

The dataset has a global scope, with notable concentrations of job postings in locations such as Bengaluru, Karnataka (30%) and Gurgaon, Haryana (7%). The records primarily cover job postings for data-related roles, including Data Engineer, Data Analyst, Data Scientist, and ML Engineer, with data collected around July 2022. Some postings were listed over 30 days prior to the collection date.

License

CC0

Who Can Use It

This dataset is valuable for: * Data Scientists and Analysts: For market research, trend analysis, and skill demand assessment. * Machine Learning Engineers: To understand job requirements and role distributions. * Researchers: For academic studies on labour markets and skill development. * Job Seekers: To identify popular roles, required skills, and geographical opportunities. * Companies and Recruiters: For talent acquisition strategies and competitor analysis.

Dataset Name Suggestions

Indeed Data Science & ML Job Postings

Global Data Roles Dataset

Job Market Insights: Data Careers

Data Analytics & AI Job Data

UK Data Professional Vacancies

Attributes

Original Data Source: Indeed job (Data science /data analyst/ ML)
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
California State Polytechnic University
Montana State University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
Alternative Data Market Analysis North America, Europe, APAC, South America,...
technavio.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio, Alternative Data Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, Canada, China, UK, Mexico, Germany, Japan, India, Italy, France - Size and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/alternative-data-market-industry-analysis
Explore at:
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, United States, Canada, Mexico
Description
Snapshot img

Alternative Data Market Size 2025-2029

The alternative data market size is forecast to increase by USD 60.32 billion, at a CAGR of 52.5% between 2024 and 2029.

The market is experiencing significant growth, driven by the increased availability and diversity of data sources. This expanding data landscape is fueling the rise of alternative data-driven investment strategies across various industries. However, the market faces challenges related to data quality and standardization. As companies increasingly rely on alternative data to inform business decisions, ensuring data accuracy and consistency becomes paramount. Addressing these challenges requires robust data management systems and collaboration between data providers and consumers to establish industry-wide standards. Companies that effectively navigate these dynamics can capitalize on the wealth of opportunities presented by alternative data, driving innovation and competitive advantage.

What will be the Size of the Alternative Data Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, with new applications and technologies shaping its dynamics. Predictive analytics and deep learning are increasingly being integrated into business intelligence systems, enabling more accurate risk management and sales forecasting. Data aggregation from various sources, including social media and web scraping, enriches datasets for more comprehensive quantitative analysis. Data governance and metadata management are crucial for maintaining data accuracy and ensuring data security. Real-time analytics and cloud computing facilitate decision support systems, while data lineage and data timeliness are essential for effective portfolio management. Unstructured data, such as sentiment analysis and natural language processing, provide valuable insights for various sectors. Machine learning algorithms and execution algorithms are revolutionizing trading strategies, from proprietary trading to high-frequency trading. Data cleansing and data validation are essential for maintaining data quality and relevance. Standard deviation and regression analysis are essential tools for financial modeling and risk management. Data enrichment and data warehousing are crucial for data consistency and completeness, allowing for more effective customer segmentation and sales forecasting. Data security and fraud detection are ongoing concerns, with advancements in technology continually addressing new threats. The market's continuous dynamism is reflected in its integration of various technologies and applications. From data mining and data visualization to supply chain optimization and pricing optimization, the market's evolution is driven by the ongoing unfolding of market activities and evolving patterns.

How is this Alternative Data Industry segmented?

The alternative data industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. TypeCredit and debit card transactionsSocial mediaMobile application usageWeb scrapped dataOthersEnd-userBFSIIT and telecommunicationRetailOthersGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKAPACChinaIndiaJapanRest of World (ROW)

By Type Insights

The credit and debit card transactions segment is estimated to witness significant growth during the forecast period.Alternative data derived from card and debit card transactions plays a pivotal role in business intelligence, offering valuable insights into consumer spending behaviors. This data is essential for market analysts, financial institutions, and businesses aiming to optimize strategies and enhance customer experiences. Two primary categories exist within this data segment: credit card transactions and debit card transactions. Credit card transactions reveal consumers' discretionary spending patterns, luxury purchases, and credit management abilities. By analyzing this data through quantitative methods, such as regression analysis and time series analysis, businesses can gain a deeper understanding of consumer preferences and trends. Debit card transactions, on the other hand, provide insights into essential spending habits, budgeting strategies, and daily expenses. This data is crucial for understanding consumers' practical needs and lifestyle choices. Machine learning algorithms, such as deep learning and predictive analytics, can be employed to uncover patterns and trends in debit card transactions, enabling businesses to tailor their offerings and services accordingly. Data governance, data security, and data accuracy are critical considerations when dealing with sensitive financial d
d
Gaming industry data sources & analytics
datarade.ai
.json, .csv, .xls
Updated Mar 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forloop.ai (2023). Gaming industry data sources & analytics [Dataset]. https://datarade.ai/data-products/gaming-industry-data-sources-analytics-forloop-ai
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Mar 20, 2023
Dataset provided by
Forloop.ai
Area covered
India, Papua New Guinea, Nicaragua, New Caledonia, United States Minor Outlying Islands, Nauru, Comoros, Trinidad and Tobago, Paraguay, Burkina Faso
Description
Our gaming industry data solutions leverage data from major gaming platforms like Playstation, Xbox, and Steam to provide businesses with insights into the global gaming market. We collect data on the number of hours played for specific games, as well as monthly trends in player behavior and preferences.
Most useful Data Science resources - Part 3
kaggle.com
Updated Feb 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVAN KUMAR D (2021). Most useful Data Science resources - Part 3 [Dataset]. https://www.kaggle.com/mragpavank/most-useful-data-science-resources-part-3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVAN KUMAR D
Description
These resources were shared publicly in LinkedIn by other users.
O
Open Source Big Data Tools Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Open Source Big Data Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/open-source-big-data-tools-58978
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The open-source big data tools market is experiencing robust growth, driven by the increasing need for scalable, cost-effective data management and analysis solutions across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising volume and velocity of data generated across industries, from banking and finance to manufacturing and government, necessitate powerful and adaptable tools. Secondly, the cost-effectiveness and flexibility of open-source solutions compared to proprietary alternatives are major drawcards, especially for smaller organizations and startups. The ease of customization and community support further enhance their appeal. Growth is also being propelled by technological advancements such as the development of more sophisticated data analytics tools, improved cloud integration, and increased adoption of containerization technologies like Docker and Kubernetes for deployment and management. The market's segmentation across application (banking, manufacturing, etc.) and tool type (data collection, storage, analysis) reflects the diverse range of uses and specialized tools available. Key restraints to market growth include the complexity associated with implementing and managing open-source solutions, requiring skilled personnel and ongoing maintenance. Security concerns and the need for robust data governance frameworks also pose challenges. However, the growing maturity of the open-source ecosystem, coupled with the emergence of managed services providers offering support and expertise, is mitigating these limitations. The continued advancements in artificial intelligence (AI) and machine learning (ML) are further integrating with open-source big data tools, creating synergistic opportunities for growth in predictive analytics and advanced data processing. This integration, alongside the ever-increasing volume of data needing analysis, will undoubtedly drive continued market expansion over the forecast period.
Most useful Data Science resources - Part 4
kaggle.com
Updated Feb 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVAN KUMAR D (2021). Most useful Data Science resources - Part 4 [Dataset]. https://www.kaggle.com/mragpavank/most-useful-data-science-resources-part-4/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PAVAN KUMAR D
Description
These resources were shared publicly in LinkedIn by other users.
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Global advanced analytics and data science software market share 2025
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global advanced analytics and data science software market share 2025 [Dataset]. https://www.statista.com/statistics/1258535/advanced-analytics-data-science-market-share-technology-worldwide/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
Worldwide
Description
MATLAB led the global advanced analytics and data science software industry in 2025 with a market share of ***** percent. First launched in 1984, MATLAB is developed by the U.S. firm MathWorks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3

Data from: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Explore at:

application/x-gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12377237.v3

Dataset updated

Oct 4, 2021

Dataset provided by

Figsharehttp://figshare.com/

Authors

Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

Clear search

Close search

Google apps

Main menu

Data from: A large-scale comparative analysis of Coding Standard conformance...

Importance of data sources for analytics vs access among U.S. businesses...

Data sources for anti-fraud data analytics initiatives in global...

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Catalog of Ocean Data Science Initiatives

Data from: Large-scale comparison of bibliographic data sources: Scopus, Web...

Map of articles about "Teaching Open Science"

Library Data Services Landscape Scan

Data from: Inventory of online public databases and repositories holding...

Data from: Multi-Source Distributed System Data for AI-powered Analytics

Real Estate Data Sources & Analytics

Indeed Data Science & ML Job Postings

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: Designing data science workshops for data-intensive environmental...

Alternative Data Market Analysis North America, Europe, APAC, South America,...

Snapshot img

Gaming industry data sources & analytics

Most useful Data Science resources - Part 3

These resources were shared publicly in LinkedIn by other users.

Open Source Big Data Tools Report

Most useful Data Science resources - Part 4

These resources were shared publicly in LinkedIn by other users.

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Global advanced analytics and data science software market share 2025

Data from: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects