100+ datasets found

P
Data from: Data Science Problems Dataset
paperswithcode.com
Updated Nov 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan (2022). Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems
Explore at:
Dataset updated
Nov 17, 2022
Authors
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan
Description
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

Global Data Science Platform Market – Industry Trends and Forecast to 2030

databridgemarketresearch.com

Updated May 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Data Bridge Market Research (2023). Global Data Science Platform Market – Industry Trends and Forecast to 2030 [Dataset]. https://www.databridgemarketresearch.com/reports/global-data-science-platform-market

Explore at:

Dataset updated

May 2023

Dataset authored and provided by

Data Bridge Market Research

License

https://www.databridgemarketresearch.com/privacy-policyhttps://www.databridgemarketresearch.com/privacy-policy

Time period covered

2023 - 2030

Area covered

Global

Description

Report Metric	Details
Forecast Period	2023 to 2030
Base Year	2022
Historic Years	2021 (Customizable to 2015-2020)
Quantitative Units	Revenue in USD Billion, Volumes in Units, Pricing in USD
Segments Covered	Component Type (Platform, Services), Function Division (Marketing, Sales, Logistics, Finance and Accounting, Customer Support, Business Operations, Others), Deployment Model (On-Premises, Cloud based), Organization Size (Small and Medium-sized Enterprises (SMEs), Large Enterprises), End User Application (Banking, Financial Services, and Insurance (BFSI), Telecom and IT, Retail and E-commerce, Healthcare and Life sciences, Manufacturing, Energy and Utilities, Media and Entertainment, Transportation and Logistics, Government, Others)
Countries Covered	U.S., Canada and Mexico in North America, Germany, France, U.K., Netherlands, Switzerland, Belgium, Russia, Italy, Spain, Turkey, Rest of Europe in Europe, China, Japan, India, South Korea, Singapore, Malaysia, Australia, Thailand, Indonesia, Philippines, Rest of Asia-Pacific (APAC) in the Asia-Pacific (APAC), Saudi Arabia, U.A.E, South Africa, Egypt, Israel, Rest of Middle East and Africa (MEA) as a part of Middle East and Africa (MEA), Brazil, Argentina and Rest of South America as part of South America. East and Africa (MEA), Brazil, Argentina and Rest of South America as part of South America
Market Players Covered	IBM (U.S.), DataRobot Inc., (U.S.), apheris AI GmbH (Germany), The Digital Talent Ecosystem (U.S.), Databand (Israel), dotData (U.S.), Explorium Inc., (U.S.), Noogata (Israel), Tecton Inc., (U.S.), Spell Designs Pty Ltd (U.S.), Arrikto Inc., (U.S.), Iterative (U.S.), Google Inc (U.S.), Microsoft (U.S.), SAS Institute Inc., (U.S.), Amazon Web Services, Inc. (U.S.), The MathWorks, Inc. (U.S.), Cloudera Inc.,(U.S.), Teradata (U.S.), TIBCO Software Inc. (U.S.), ALTERYX, INC. (U.S.), RapidMiner (U.S.), Databricks (U.S.), Snowflake Inc., (U.S.), H2O.ai (U.S.), Altair Inc., (U.S.), Anaconda Inc., (U.S.), SAP SE (U.S.), Domino Data Lab Inc., (U.S.) and Dataiku (U.S.)
Market Opportunities	Rapid advancements in technologies such as artificial intelligence (AI), machine learning (ML), and internet of things (IoT) Increasing investment in research and development

Most used technologies in the data science tech stack worldwide 2023
statista.com
teosuisse.net
+3more
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Most used technologies in the data science tech stack worldwide 2023 [Dataset]. https://www.statista.com/statistics/1292394/popular-technologies-in-the-data-science-tech-stack/
Explore at:
Dataset updated
Mar 22, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 1, 2022 - Dec 1, 2023
Area covered
Worldwide
Description
A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2023 was Python 3.x, chosen by 65 percent of respondents. PySpark ranked second, being preferred by 13 percent of respondents.
w
Data from: Statistical foundations of data science
workwithdata.com
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Statistical foundations of data science [Dataset]. https://www.workwithdata.com/object/statistical-foundations-data-science-book-by-jianqing-fan-0000
Explore at:
Dataset updated
May 27, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical foundations of data science is a book. Explore Statistical foundations of data science through unique data from The British Library.
Number of open data science jobs India 2019-2022, by company type
statista.com
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Number of open data science jobs India 2019-2022, by company type [Dataset]. https://www.statista.com/statistics/1320198/india-number-of-available-data-science-jobs-by-company-type/
Explore at:
Dataset updated
Mar 13, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
India
Description
In 2022, over 139 thousand of the data science job positions were available in multi-national corporation IT and KPO service provider companies in the south Asian country of India. An increase in the availability of the data science jobs was seen over the years from 2019.
h
data-science-job-salaries
huggingface.co
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Espejel (2023). data-science-job-salaries [Dataset]. https://huggingface.co/datasets/espejelomar/data-science-job-salaries
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Authors
Omar Espejel
Description
espejelomar/data-science-job-salaries dataset hosted on Hugging Face and contributed by the HF Datasets community
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
b2find.dkrz.de
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
q
Data from: BEDE - Biological and Environmental Data Education Network:...
qubeshub.org
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Aiello-Lammens; Sarah Supp; Erika Crispo; Kelly O'Donnell; Nate Emery (2023). BEDE - Biological and Environmental Data Education Network: Preparing Instructors to Integrate Data Science into Undergraduate Biology and Environmental Science Curricula (RCN-UBE Introduction) [Dataset]. http://doi.org/10.25334/1T2P-NK24
Explore at:
Unique identifier
https://doi.org/10.25334/1T2P-NK24
Dataset updated
May 11, 2023
Dataset provided by
QUBES
Authors
Matthew Aiello-Lammens; Sarah Supp; Erika Crispo; Kelly O'Donnell; Nate Emery
Description
The Biological and Environmental Data Education Network (BEDE Network) develops and shares teacher-training workshops, curricular designs, teaching modules, and best practices to help integrate computational data science skills into all levels of the biological and environmental sciences curriculum.
Quranic Data Science
osf.io
Updated Oct 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Aly Yahia (2018). Quranic Data Science [Dataset]. http://doi.org/10.17605/OSF.IO/7BAEG
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/7BAEG
Dataset updated
Oct 3, 2018
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Ahmed Aly Yahia
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Quran Inspired us best optimization algorithms within our universe

Data Science and Machine-Learning Platforms Market Research Report 2023-2032...

dataintelo.com

csv, pdf, pptx

Updated Sep 8, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Dataintelo (2023). Data Science and Machine-Learning Platforms Market Research Report 2023-2032 [Dataset]. https://dataintelo.com/report/data-science-and-machine-learning-platforms-market-report

Explore at:

pdf, pptx, csvAvailable download formats

Dataset updated

Sep 8, 2023

Dataset authored and provided by

Dataintelo

License

https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

Time period covered

2024 - 2032

Area covered

Global

Description

Dataintelo published a new report titled “Data Science and Machine-Learning Platforms Market research report which is segmented by Type (Open Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools), by Application (Small-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises), by Industry (Healthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation), by Deployment (On-Premise, Cloud), by Players/Companies SAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight”. As per the study the market is expected to grow at a CAGR of XX% in the forecast period.

Report Scope

Report Attributes	Report Details
Report Title	Data Science and Machine-Learning Platforms Market Research Report
By Type	Open Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools
By Application	Small-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises
By Industry	Healthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation
By Deployment	On-Premise, Cloud
By Companies	SAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight
Regions Covered	North America, Europe, APAC, Latin America, MEA
Base Year	2023
Historical Year	2017 to 2022 (Data from 2010 can be provided as per availability)
Forecast Year	2032
Number of Pages	127
Number of Tables & Figures	236
Customization Available	Yes, the report can be customized as per your need.

m
DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS
data.mendeley.com
narcis.nl
Updated Mar 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
Explore at:
Unique identifier
https://doi.org/10.17632/8gx2fvg2k6.3
Dataset updated
Mar 12, 2019
Authors
Fabian Constante
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
Indeed Dataset - Data Scientist/Analyst/Engineer)
kaggle.com
zip
Updated Nov 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elroy (2018). Indeed Dataset - Data Scientist/Analyst/Engineer) [Dataset]. https://www.kaggle.com/elroyggj/indeed-dataset-data-scientistanalystengineer
Explore at:
zip(5298676 bytes)Available download formats
Dataset updated
Nov 2, 2018
Authors
Elroy
Description
Dataset

This dataset was created by Elroy

Contents
Product Sales Data
kaggle.com
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K S ABISHEK (2023). Product Sales Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/4980479
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4980479
Dataset updated
Feb 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
K S ABISHEK
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Greetings , fellow analyst !

REC corp LTD. is small-scaled business venture established in India. They have been selling FOUR PRODUCTS for OVER TEN YEARS . The products are P1, P2, P3 and P4.

They have collected data from their retail centers and organized it into a small csv file , which has been given to you. **The excel file contains about 8 numerical parameters : **

Q1- Total unit sales of product 1
Q2- Total unit sales of product 2
Q3- Total unit sales of product 3
Q4- Total unit sales of product 4

S1- Total revenue from product 1
S2- Total revenue from product 2
S3- Total revenue from product 3
S4- Total revenue from product 4

Example :
On 13-06-2010 , product 1 had been brought by 5422 people and INR 17187.74 had been generated in revenue from product 1.

**Now , REC corp needs you to solve the following questions : **

1) Is there any trend in the sales of all four products during certain months?
2) Out of all four products , which product has seen the highest sales in all the given years?
3) The company has all it's retail centers closed on the 31st of December every year. Mr: Hariharan , the CEO , would love to get an estimate on no: of units of each product that could be sold on 31st of Dec , every year , if all their retail centers were kept open.
4) The CEO is considering an idea to drop the production of any one of the products. He wants you to analyze this data and suggest whether his idea would result in a massive setback for the company.
5) The CEO would also like to predict the sales and revenues for the year 2024. He wants you to give a yearly estimate with the best possible accuracy.

Can you help REC corp with your analytical and data science skills ?

NOTE: This is a hypothetical dataset generated using python for educational purposes. It bears no resemblance to any real firm. Any similarity is a matter of coincidence.
"Python for Data Science" (AY250; UC Berkeley) Data files
zenodo.org
application/gzip, bin
Updated Jan 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua; Joshua (2022). "Python for Data Science" (AY250; UC Berkeley) Data files [Dataset]. http://doi.org/10.5281/zenodo.5889322
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5889322
Dataset updated
Jan 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joshua; Joshua
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Berkeley
Description
Data files for "Python for Data Science" (AY250; UC Berkeley)

homework1_data.tgz - Data for HW1

Course website: https://github.com/profjsb/python-seminar
m
Data Science Publication (1983-2019)
data.mendeley.com
commons.datacite.org
Updated Apr 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agung Purnomo (2020). Data Science Publication (1983-2019) [Dataset]. http://doi.org/10.17632/4c3mpmwk74.1
Explore at:
Unique identifier
https://doi.org/10.17632/4c3mpmwk74.1
Dataset updated
Apr 13, 2020
Authors
Agung Purnomo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data science reseach & publication dataset, which was indexed by Scopus from 1983 to 2019. The dataset contains data authors, authors ID Scopus, title, year, source title, volume, issue, article number in Scopus, DOI, link, affiliation, abstract, index keywords, references, Correspondence Address, editors, publisher, conference name, conference date, conference code, ISSN, language, document type, access type, and EID.
Google Analytics Sample
console.cloud.google.com
Updated Jul 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data
Explore at:
Dataset updated
Jul 15, 2017
Dataset provided by
Googlehttp://google.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
Files for lectures on R for Public Health Data Science Research.
figshare.com
txt
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Klopper (2023). Files for lectures on R for Public Health Data Science Research. [Dataset]. http://doi.org/10.6084/m9.figshare.24492109.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24492109.v1
Dataset updated
Nov 2, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Juan Klopper
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data files for lecture material.
m
Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...
data.mendeley.com
commons.datacite.org
Updated Jul 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
Explore at:
Unique identifier
https://doi.org/10.17632/xmcg82mx9k.3
Dataset updated
Jul 25, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Data of the submitted article "Journal research data sharing policies: a...
zenodo.org
Updated May 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(under review); (under review) (2021). Data of the submitted article "Journal research data sharing policies: a study of highly-cited journals in neuroscience, physics, and operations research" [Dataset]. http://doi.org/10.5281/zenodo.3268352
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3268352
Dataset updated
May 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
(under review); (under review)
Description
The journals’ author guidelines and/or editorial policies were examined on whether they take a stance with regard to the availability of the underlying data of the submitted article. The mere explicated possibility of providing supplementary material along with the submitted article was not considered as a research data policy in the present study. Furthermore, the present article excluded source codes or algorithms from the scope of the paper and thus policies related to them are not included in the analysis of the present article.

For selection of journals within the field of neurosciences, Clarivate Analytics’ InCites Journal Citation Reports database was searched using categories of neurosciences and neuroimaging. From the results, journals with the 40 highest Impact Factor (for the year 2017) indicators were extracted for scrutiny of research data policies. Respectively, the selection journals within the field of physics was created by performing a similar search with the categories of physics, applied; physics, atomic, molecular & chemical; physics, condensed matter; physics, fluids & plasmas; physics, mathematical; physics, multidisciplinary; physics, nuclear and physics, particles & fields. From the results, journals with the 40 highest Impact Factor indicators were again extracted for scrutiny. Similarly, the 40 journals representing the field of operations research were extracted by using the search category of operations research and management.

Journal-specific data policies were sought from journal specific websites providing journal specific author guidelines or editorial policies. Within the present study, the examination of journal data policies was done in May 2019. The primary data source was journal-specific author guidelines. If journal guidelines explicitly linked to the publisher’s general policy with regard to research data, these were used in the analyses of the present article. If journal-specific research data policy, or lack of, was inconsistent with the publisher’s general policies, the journal-specific policies and guidelines were prioritized and used in the present article’s data. If journals’ author guidelines were not openly available online due to, e.g., accepting submissions on an invite-only basis, the journal was not included in the data of the present article. Also journals that exclusively publish review articles were excluded and replaced with the journal having the next highest Impact Factor indicator so that each set representing the three field of sciences consisted of 40 journals. The final data thus consisted of 120 journals in total.

‘Public deposition’ refers to a scenario where researcher deposits data to a public repository and thus gives the administrative role of the data to the receiving repository. ‘Scientific sharing’ refers to a scenario where researcher administers his or her data locally and by request provides it to interested reader. Note that none of the journals examined in the present article required that all data types underlying a submitted work should be deposited into a public data repositories. However, some journals required public deposition of data of specific types. Within the journal research data policies examined in the present article, these data types are well presented by the Springer Nature policy on “Availability of data, materials, code and protocols” (Springer Nature, 2018), that is, DNA and RNA data; protein sequences and DNA and RNA sequencing data; genetic polymorphisms data; linked phenotype and genotype data; gene expression microarray data; proteomics data; macromolecular structures and crystallographic data for small molecules. Furthermore, the registration of clinical trials in a public repository was also considered as a data type in this study. The term specific data types used in the custom coding framework of the present study thus refers to both life sciences data and public registration of clinical trials. These data types have community-endorsed public repositories where deposition was most often mandated within the journals’ research data policies.

The term ‘location’ refers to whether the journal’s data policy provides suggestions or requirements for the repositories or services used to share the underlying data of the submitted works. A mere general reference to ‘public repositories’ was not considered a location suggestion, but only references to individual repositories and services. The category of ‘immediate release of data’ examines whether the journals’ research data policy addresses the timing of publication of the underlying data of submitted works. Note that even though the journals may only encourage public deposition of the data, the editorial processes could be set up so that it leads to either publication of the research data or the research data metadata in conjunction to publishing of the submitted work.
m
phishrepo-dataset
data.mendeley.com
Updated Oct 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhash Ariyadasa (2021). phishrepo-dataset [Dataset]. http://doi.org/10.17632/ttmmtsgbs8.1
Explore at:
Unique identifier
https://doi.org/10.17632/ttmmtsgbs8.1
Dataset updated
Oct 5, 2021
Authors
Subhash Ariyadasa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PhishRepo is implemented to fill the data gap in the anti-phishing domain, and it is still at an experimental level. PhishRepo collects the data available here during its testing stage, and the dataset includes verified phishing webpages. Therefore, it contains few data points only. The provided dataset contains diverse information sources collected related to the latest phishing pages. The diverse feature-rich data present in the dataset is a current need in the machine learning-based anti-phishing domain to overcome inept learning models in phishing detection. The dataset can be used to analyse significant phishing features, experiment with different feature extraction techniques, effectively try out some representation learning techniques such as deep learning from these raw data at a practical level. The dataset contains an index.csv file, and it will be the main file that should be used when mapping index file content with available folders. Generally, a folder should contain a webpage.html, alexa.xml, response.csv, screenshot.png and fullview.png files and src folder, which carries offline webpage resources. If something is missing in the folder level, that indicates in the index.csv file.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan (2022). Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems

Data from: Data Science Problems Dataset

Explore at:

Dataset updated

Nov 17, 2022

Authors

Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan

Description

Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

Clear search

Close search

Google apps

Main menu

Data from: Data Science Problems Dataset

Global Data Science Platform Market – Industry Trends and Forecast to 2030

Most used technologies in the data science tech stack worldwide 2023

Data from: Statistical foundations of data science

Number of open data science jobs India 2019-2022, by company type

data-science-job-salaries

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

Data from: BEDE - Biological and Environmental Data Education Network:...

Quranic Data Science

Data Science and Machine-Learning Platforms Market Research Report 2023-2032...

Report Scope

DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

Indeed Dataset - Data Scientist/Analyst/Engineer)

Dataset

Contents

Product Sales Data

"Python for Data Science" (AY250; UC Berkeley) Data files

Data Science Publication (1983-2019)

Google Analytics Sample

Files for lectures on R for Public Health Data Science Research.

Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

Data of the submitted article "Journal research data sharing policies: a...

phishrepo-dataset

Data from: Data Science Problems DatasetSee More Versions

Data from: Data Science Problems Dataset