Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.
https://www.databridgemarketresearch.com/privacy-policyhttps://www.databridgemarketresearch.com/privacy-policy
Report Metric |
Details |
Forecast Period |
2023 to 2030 |
Base Year |
2022 |
Historic Years |
2021 (Customizable to 2015-2020) |
Quantitative Units |
Revenue in USD Billion, Volumes in Units, Pricing in USD |
Segments Covered |
Component Type (Platform, Services), Function Division (Marketing, Sales, Logistics, Finance and Accounting, Customer Support, Business Operations, Others), Deployment Model (On-Premises, Cloud based), Organization Size (Small and Medium-sized Enterprises (SMEs), Large Enterprises), End User Application (Banking, Financial Services, and Insurance (BFSI), Telecom and IT, Retail and E-commerce, Healthcare and Life sciences, Manufacturing, Energy and Utilities, Media and Entertainment, Transportation and Logistics, Government, Others) |
Countries Covered |
U.S., Canada and Mexico in North America, Germany, France, U.K., Netherlands, Switzerland, Belgium, Russia, Italy, Spain, Turkey, Rest of Europe in Europe, China, Japan, India, South Korea, Singapore, Malaysia, Australia, Thailand, Indonesia, Philippines, Rest of Asia-Pacific (APAC) in the Asia-Pacific (APAC), Saudi Arabia, U.A.E, South Africa, Egypt, Israel, Rest of Middle East and Africa (MEA) as a part of Middle East and Africa (MEA), Brazil, Argentina and Rest of South America as part of South America. |
Market Players Covered |
IBM (U.S.), DataRobot Inc., (U.S.), apheris AI GmbH (Germany), The Digital Talent Ecosystem (U.S.), Databand (Israel), dotData (U.S.), Explorium Inc., (U.S.), Noogata (Israel), Tecton Inc., (U.S.), Spell Designs Pty Ltd (U.S.), Arrikto Inc., (U.S.), Iterative (U.S.), Google Inc (U.S.), Microsoft (U.S.), SAS Institute Inc., (U.S.), Amazon Web Services, Inc. (U.S.), The MathWorks, Inc. (U.S.), Cloudera Inc.,(U.S.), Teradata (U.S.), TIBCO Software Inc. (U.S.), ALTERYX, INC. (U.S.), RapidMiner (U.S.), Databricks (U.S.), Snowflake Inc., (U.S.), H2O.ai (U.S.), Altair Inc., (U.S.), Anaconda Inc., (U.S.), SAP SE (U.S.), Domino Data Lab Inc., (U.S.) and Dataiku (U.S.) |
Market Opportunities |
|
A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2023 was Python 3.x, chosen by 65 percent of respondents. PySpark ranked second, being preferred by 13 percent of respondents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical foundations of data science is a book. Explore Statistical foundations of data science through unique data from The British Library.
In 2022, over 139 thousand of the data science job positions were available in multi-national corporation IT and KPO service provider companies in the south Asian country of India. An increase in the availability of the data science jobs was seen over the years from 2019.
espejelomar/data-science-job-salaries dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
The Biological and Environmental Data Education Network (BEDE Network) develops and shares teacher-training workshops, curricular designs, teaching modules, and best practices to help integrate computational data science skills into all levels of the biological and environmental sciences curriculum.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Quran Inspired us best optimization algorithms within our universe
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Dataintelo published a new report titled “Data Science and Machine-Learning Platforms Market research report which is segmented by Type (Open Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools), by Application (Small-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises), by Industry (Healthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation), by Deployment (On-Premise, Cloud), by Players/Companies SAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight”. As per the study the market is expected to grow at a CAGR of XX% in the forecast period.
Report Attributes | Report Details |
Report Title | Data Science and Machine-Learning Platforms Market Research Report |
By Type | Open Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools |
By Application | Small-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises |
By Industry | Healthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation |
By Deployment | On-Premise, Cloud |
By Companies | SAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight |
Regions Covered | North America, Europe, APAC, Latin America, MEA |
Base Year | 2023 |
Historical Year | 2017 to 2022 (Data from 2010 can be provided as per availability) |
Forecast Year | 2032 |
Number of Pages | 127 |
Number of Tables & Figures | 236 |
Customization Available | Yes, the report can be customized as per your need. |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.
Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)
Types of Products : Clothing , Sports , and Electronic Supplies
Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
This dataset was created by Elroy
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Greetings , fellow analyst !
REC corp LTD. is small-scaled business venture established in India. They have been selling FOUR PRODUCTS for OVER TEN YEARS . The products are P1, P2, P3 and P4.
They have collected data from their retail centers and organized it into a small csv file , which has been given to you. **The excel file contains about 8 numerical parameters : **
Q1- Total unit sales of product 1
Q2- Total unit sales of product 2
Q3- Total unit sales of product 3
Q4- Total unit sales of product 4
S1- Total revenue from product 1
S2- Total revenue from product 2
S3- Total revenue from product 3
S4- Total revenue from product 4
Example :
On 13-06-2010 , product 1 had been brought by 5422 people and INR 17187.74 had been generated in revenue from product 1.
**Now , REC corp needs you to solve the following questions : **
1) Is there any trend in the sales of all four products during certain months?
2) Out of all four products , which product has seen the highest sales in all the given years?
3) The company has all it's retail centers closed on the 31st of December every year. Mr: Hariharan , the CEO , would love to get an estimate on no: of units of each product that could be sold on 31st of Dec , every year , if all their retail centers were kept open.
4) The CEO is considering an idea to drop the production of any one of the products. He wants you to analyze this data and suggest whether his idea would result in a massive setback for the company.
5) The CEO would also like to predict the sales and revenues for the year 2024. He wants you to give a yearly estimate with the best possible accuracy.
Can you help REC corp with your analytical and data science skills ?
NOTE: This is a hypothetical dataset generated using python for educational purposes. It bears no resemblance to any real firm. Any similarity is a matter of coincidence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files for "Python for Data Science" (AY250; UC Berkeley)
Course website: https://github.com/profjsb/python-seminar
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data science reseach & publication dataset, which was indexed by Scopus from 1983 to 2019. The dataset contains data authors, authors ID Scopus, title, year, source title, volume, issue, article number in Scopus, DOI, link, affiliation, abstract, index keywords, references, Correspondence Address, editors, publisher, conference name, conference date, conference code, ISSN, language, document type, access type, and EID.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data files for lecture material.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2
Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
The journals’ author guidelines and/or editorial policies were examined on whether they take a stance with regard to the availability of the underlying data of the submitted article. The mere explicated possibility of providing supplementary material along with the submitted article was not considered as a research data policy in the present study. Furthermore, the present article excluded source codes or algorithms from the scope of the paper and thus policies related to them are not included in the analysis of the present article.
For selection of journals within the field of neurosciences, Clarivate Analytics’ InCites Journal Citation Reports database was searched using categories of neurosciences and neuroimaging. From the results, journals with the 40 highest Impact Factor (for the year 2017) indicators were extracted for scrutiny of research data policies. Respectively, the selection journals within the field of physics was created by performing a similar search with the categories of physics, applied; physics, atomic, molecular & chemical; physics, condensed matter; physics, fluids & plasmas; physics, mathematical; physics, multidisciplinary; physics, nuclear and physics, particles & fields. From the results, journals with the 40 highest Impact Factor indicators were again extracted for scrutiny. Similarly, the 40 journals representing the field of operations research were extracted by using the search category of operations research and management.
Journal-specific data policies were sought from journal specific websites providing journal specific author guidelines or editorial policies. Within the present study, the examination of journal data policies was done in May 2019. The primary data source was journal-specific author guidelines. If journal guidelines explicitly linked to the publisher’s general policy with regard to research data, these were used in the analyses of the present article. If journal-specific research data policy, or lack of, was inconsistent with the publisher’s general policies, the journal-specific policies and guidelines were prioritized and used in the present article’s data. If journals’ author guidelines were not openly available online due to, e.g., accepting submissions on an invite-only basis, the journal was not included in the data of the present article. Also journals that exclusively publish review articles were excluded and replaced with the journal having the next highest Impact Factor indicator so that each set representing the three field of sciences consisted of 40 journals. The final data thus consisted of 120 journals in total.
‘Public deposition’ refers to a scenario where researcher deposits data to a public repository and thus gives the administrative role of the data to the receiving repository. ‘Scientific sharing’ refers to a scenario where researcher administers his or her data locally and by request provides it to interested reader. Note that none of the journals examined in the present article required that all data types underlying a submitted work should be deposited into a public data repositories. However, some journals required public deposition of data of specific types. Within the journal research data policies examined in the present article, these data types are well presented by the Springer Nature policy on “Availability of data, materials, code and protocols” (Springer Nature, 2018), that is, DNA and RNA data; protein sequences and DNA and RNA sequencing data; genetic polymorphisms data; linked phenotype and genotype data; gene expression microarray data; proteomics data; macromolecular structures and crystallographic data for small molecules. Furthermore, the registration of clinical trials in a public repository was also considered as a data type in this study. The term specific data types used in the custom coding framework of the present study thus refers to both life sciences data and public registration of clinical trials. These data types have community-endorsed public repositories where deposition was most often mandated within the journals’ research data policies.
The term ‘location’ refers to whether the journal’s data policy provides suggestions or requirements for the repositories or services used to share the underlying data of the submitted works. A mere general reference to ‘public repositories’ was not considered a location suggestion, but only references to individual repositories and services. The category of ‘immediate release of data’ examines whether the journals’ research data policy addresses the timing of publication of the underlying data of submitted works. Note that even though the journals may only encourage public deposition of the data, the editorial processes could be set up so that it leads to either publication of the research data or the research data metadata in conjunction to publishing of the submitted work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PhishRepo is implemented to fill the data gap in the anti-phishing domain, and it is still at an experimental level. PhishRepo collects the data available here during its testing stage, and the dataset includes verified phishing webpages. Therefore, it contains few data points only. The provided dataset contains diverse information sources collected related to the latest phishing pages. The diverse feature-rich data present in the dataset is a current need in the machine learning-based anti-phishing domain to overcome inept learning models in phishing detection. The dataset can be used to analyse significant phishing features, experiment with different feature extraction techniques, effectively try out some representation learning techniques such as deep learning from these raw data at a practical level. The dataset contains an index.csv file, and it will be the main file that should be used when mapping index file content with available folders. Generally, a folder should contain a webpage.html, alexa.xml, response.csv, screenshot.png and fullview.png files and src folder, which carries offline webpage resources. If something is missing in the folder level, that indicates in the index.csv file.
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.