100+ datasets found
  1. P

    Data from: Data Science Problems Dataset

    • paperswithcode.com
    Updated Nov 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan (2022). Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems
    Explore at:
    Dataset updated
    Nov 17, 2022
    Authors
    Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan
    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

  2. D

    Global Data Science Platform Market – Industry Trends and Forecast to 2030

    • databridgemarketresearch.com
    Updated May 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Bridge Market Research (2023). Global Data Science Platform Market – Industry Trends and Forecast to 2030 [Dataset]. https://www.databridgemarketresearch.com/reports/global-data-science-platform-market
    Explore at:
    Dataset updated
    May 2023
    Dataset authored and provided by
    Data Bridge Market Research
    License

    https://www.databridgemarketresearch.com/privacy-policyhttps://www.databridgemarketresearch.com/privacy-policy

    Time period covered
    2023 - 2030
    Area covered
    Global
    Description

    Report Metric

    Details

    Forecast Period

    2023 to 2030

    Base Year

    2022

    Historic Years

    2021 (Customizable to 2015-2020)

    Quantitative Units

    Revenue in USD Billion, Volumes in Units, Pricing in USD

    Segments Covered

    Component Type (Platform, Services), Function Division (Marketing, Sales, Logistics, Finance and Accounting, Customer Support, Business Operations, Others), Deployment Model (On-Premises, Cloud based), Organization Size (Small and Medium-sized Enterprises (SMEs), Large Enterprises), End User Application (Banking, Financial Services, and Insurance (BFSI), Telecom and IT, Retail and E-commerce, Healthcare and Life sciences, Manufacturing, Energy and Utilities, Media and Entertainment, Transportation and Logistics, Government, Others)

    Countries Covered

    U.S., Canada and Mexico in North America, Germany, France, U.K., Netherlands, Switzerland, Belgium, Russia, Italy, Spain, Turkey, Rest of Europe in Europe, China, Japan, India, South Korea, Singapore, Malaysia, Australia, Thailand, Indonesia, Philippines, Rest of Asia-Pacific (APAC) in the Asia-Pacific (APAC), Saudi Arabia, U.A.E, South Africa, Egypt, Israel, Rest of Middle East and Africa (MEA) as a part of Middle East and Africa (MEA), Brazil, Argentina and Rest of South America as part of South America.
    East and Africa (MEA), Brazil, Argentina and Rest of South America as part of South America

    Market Players Covered

    IBM (U.S.), DataRobot Inc., (U.S.), apheris AI GmbH (Germany), The Digital Talent Ecosystem (U.S.), Databand (Israel), dotData (U.S.), Explorium Inc., (U.S.), Noogata (Israel), Tecton Inc., (U.S.), Spell Designs Pty Ltd (U.S.), Arrikto Inc., (U.S.), Iterative (U.S.), Google Inc (U.S.), Microsoft (U.S.), SAS Institute Inc., (U.S.), Amazon Web Services, Inc. (U.S.), The MathWorks, Inc. (U.S.), Cloudera Inc.,(U.S.), Teradata (U.S.), TIBCO Software Inc. (U.S.), ALTERYX, INC. (U.S.), RapidMiner (U.S.), Databricks (U.S.), Snowflake Inc., (U.S.), H2O.ai (U.S.), Altair Inc., (U.S.), Anaconda Inc., (U.S.), SAP SE (U.S.), Domino Data Lab Inc., (U.S.) and Dataiku (U.S.)

    Market Opportunities

    • Rapid advancements in technologies such as artificial intelligence (AI), machine learning (ML), and internet of things (IoT)
    • Increasing investment in research and development
  3. Most used technologies in the data science tech stack worldwide 2023

    • statista.com
    • teosuisse.net
    • +3more
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Most used technologies in the data science tech stack worldwide 2023 [Dataset]. https://www.statista.com/statistics/1292394/popular-technologies-in-the-data-science-tech-stack/
    Explore at:
    Dataset updated
    Mar 22, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 1, 2022 - Dec 1, 2023
    Area covered
    Worldwide
    Description

    A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2023 was Python 3.x, chosen by 65 percent of respondents. PySpark ranked second, being preferred by 13 percent of respondents.

  4. w

    Data from: Statistical foundations of data science

    • workwithdata.com
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Statistical foundations of data science [Dataset]. https://www.workwithdata.com/object/statistical-foundations-data-science-book-by-jianqing-fan-0000
    Explore at:
    Dataset updated
    May 27, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical foundations of data science is a book. Explore Statistical foundations of data science through unique data from The British Library.

  5. Number of open data science jobs India 2019-2022, by company type

    • statista.com
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Number of open data science jobs India 2019-2022, by company type [Dataset]. https://www.statista.com/statistics/1320198/india-number-of-available-data-science-jobs-by-company-type/
    Explore at:
    Dataset updated
    Mar 13, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    India
    Description

    In 2022, over 139 thousand of the data science job positions were available in multi-national corporation IT and KPO service provider companies in the south Asian country of India. An increase in the availability of the data science jobs was seen over the years from 2019.

  6. h

    data-science-job-salaries

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Espejel (2023). data-science-job-salaries [Dataset]. https://huggingface.co/datasets/espejelomar/data-science-job-salaries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Omar Espejel
    Description

    espejelomar/data-science-job-salaries dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    • b2find.dkrz.de
    Updated Nov 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  8. q

    Data from: BEDE - Biological and Environmental Data Education Network:...

    • qubeshub.org
    Updated May 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Aiello-Lammens; Sarah Supp; Erika Crispo; Kelly O'Donnell; Nate Emery (2023). BEDE - Biological and Environmental Data Education Network: Preparing Instructors to Integrate Data Science into Undergraduate Biology and Environmental Science Curricula (RCN-UBE Introduction) [Dataset]. http://doi.org/10.25334/1T2P-NK24
    Explore at:
    Dataset updated
    May 11, 2023
    Dataset provided by
    QUBES
    Authors
    Matthew Aiello-Lammens; Sarah Supp; Erika Crispo; Kelly O'Donnell; Nate Emery
    Description

    The Biological and Environmental Data Education Network (BEDE Network) develops and shares teacher-training workshops, curricular designs, teaching modules, and best practices to help integrate computational data science skills into all levels of the biological and environmental sciences curriculum.

  9. Quranic Data Science

    • osf.io
    Updated Oct 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Aly Yahia (2018). Quranic Data Science [Dataset]. http://doi.org/10.17605/OSF.IO/7BAEG
    Explore at:
    Dataset updated
    Oct 3, 2018
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Ahmed Aly Yahia
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Quran Inspired us best optimization algorithms within our universe

  10. D

    Data Science and Machine-Learning Platforms Market Research Report 2023-2032...

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2023). Data Science and Machine-Learning Platforms Market Research Report 2023-2032 [Dataset]. https://dataintelo.com/report/data-science-and-machine-learning-platforms-market-report
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Dataintelo published a new report titled “Data Science and Machine-Learning Platforms Market research report which is segmented by Type (Open Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools), by Application (Small-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises), by Industry (Healthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation), by Deployment (On-Premise, Cloud), by Players/Companies SAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight”. As per the study the market is expected to grow at a CAGR of XX% in the forecast period.

    Report Scope

    Report AttributesReport Details
    Report TitleData Science and Machine-Learning Platforms Market Research Report
    By TypeOpen Source Data Integration Tools, Cloud-based Data Integration Tools, Hybrid Data Integration Tools
    By ApplicationSmall-Sized Enterprises, Medium-Sized Enterprises, Large Enterprises
    By IndustryHealthcare, Finance, Retail, Manufacturing, IT & Telecommunication, Government, Energy & Utilities, Transportation
    By DeploymentOn-Premise, Cloud
    By CompaniesSAS, Alteryx, IBM, RapidMiner, KNIME, Microsoft, Dataiku, Databricks, TIBCO Software, MathWorks, H20.ai, Anaconda, SAP, Google, Domino Data Lab, Angoss, Lexalytics, Rapid Insight
    Regions CoveredNorth America, Europe, APAC, Latin America, MEA
    Base Year2023
    Historical Year2017 to 2022 (Data from 2010 can be provided as per availability)
    Forecast Year2032
    Number of Pages127
    Number of Tables & Figures236
    Customization AvailableYes, the report can be customized as per your need.

  11. m

    DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

    • data.mendeley.com
    • narcis.nl
    Updated Mar 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
    Explore at:
    Dataset updated
    Mar 12, 2019
    Authors
    Fabian Constante
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

    Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

  12. Indeed Dataset - Data Scientist/Analyst/Engineer)

    • kaggle.com
    zip
    Updated Nov 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elroy (2018). Indeed Dataset - Data Scientist/Analyst/Engineer) [Dataset]. https://www.kaggle.com/elroyggj/indeed-dataset-data-scientistanalystengineer
    Explore at:
    zip(5298676 bytes)Available download formats
    Dataset updated
    Nov 2, 2018
    Authors
    Elroy
    Description

    Dataset

    This dataset was created by Elroy

    Contents

  13. Product Sales Data

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K S ABISHEK (2023). Product Sales Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/4980479
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K S ABISHEK
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Greetings , fellow analyst !

    REC corp LTD. is small-scaled business venture established in India. They have been selling FOUR PRODUCTS for OVER TEN YEARS . The products are P1, P2, P3 and P4.

    They have collected data from their retail centers and organized it into a small csv file , which has been given to you. **The excel file contains about 8 numerical parameters : **

    Q1- Total unit sales of product 1
    Q2- Total unit sales of product 2
    Q3- Total unit sales of product 3
    Q4- Total unit sales of product 4

    S1- Total revenue from product 1
    S2- Total revenue from product 2
    S3- Total revenue from product 3
    S4- Total revenue from product 4

    Example :
    On 13-06-2010 , product 1 had been brought by 5422 people and INR 17187.74 had been generated in revenue from product 1.

    **Now , REC corp needs you to solve the following questions : **

    1) Is there any trend in the sales of all four products during certain months?
    2) Out of all four products , which product has seen the highest sales in all the given years?
    3) The company has all it's retail centers closed on the 31st of December every year. Mr: Hariharan , the CEO , would love to get an estimate on no: of units of each product that could be sold on 31st of Dec , every year , if all their retail centers were kept open.
    4) The CEO is considering an idea to drop the production of any one of the products. He wants you to analyze this data and suggest whether his idea would result in a massive setback for the company.
    5) The CEO would also like to predict the sales and revenues for the year 2024. He wants you to give a yearly estimate with the best possible accuracy.

    Can you help REC corp with your analytical and data science skills ?

    NOTE: This is a hypothetical dataset generated using python for educational purposes. It bears no resemblance to any real firm. Any similarity is a matter of coincidence.

  14. "Python for Data Science" (AY250; UC Berkeley) Data files

    • zenodo.org
    application/gzip, bin
    Updated Jan 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua; Joshua (2022). "Python for Data Science" (AY250; UC Berkeley) Data files [Dataset]. http://doi.org/10.5281/zenodo.5889322
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joshua; Joshua
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Berkeley
    Description

    Data files for "Python for Data Science" (AY250; UC Berkeley)

    Course website: https://github.com/profjsb/python-seminar

  15. m

    Data Science Publication (1983-2019)

    • data.mendeley.com
    • commons.datacite.org
    Updated Apr 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agung Purnomo (2020). Data Science Publication (1983-2019) [Dataset]. http://doi.org/10.17632/4c3mpmwk74.1
    Explore at:
    Dataset updated
    Apr 13, 2020
    Authors
    Agung Purnomo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data science reseach & publication dataset, which was indexed by Scopus from 1983 to 2019. The dataset contains data authors, authors ID Scopus, title, year, source title, volume, issue, article number in Scopus, DOI, link, affiliation, abstract, index keywords, references, Correspondence Address, editors, publisher, conference name, conference date, conference code, ISSN, language, document type, access type, and EID.

  16. Google Analytics Sample

    • console.cloud.google.com
    Updated Jul 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data
    Explore at:
    Dataset updated
    Jul 15, 2017
    Dataset provided by
    Googlehttp://google.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

  17. Files for lectures on R for Public Health Data Science Research.

    • figshare.com
    txt
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Klopper (2023). Files for lectures on R for Public Health Data Science Research. [Dataset]. http://doi.org/10.6084/m9.figshare.24492109.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 2, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Juan Klopper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data files for lecture material.

  18. m

    Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

    • data.mendeley.com
    • commons.datacite.org
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
    Explore at:
    Dataset updated
    Jul 25, 2022
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

    Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

    Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

    The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.

  19. Data of the submitted article "Journal research data sharing policies: a...

    • zenodo.org
    Updated May 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (under review); (under review) (2021). Data of the submitted article "Journal research data sharing policies: a study of highly-cited journals in neuroscience, physics, and operations research" [Dataset]. http://doi.org/10.5281/zenodo.3268352
    Explore at:
    Dataset updated
    May 26, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    (under review); (under review)
    Description

    The journals’ author guidelines and/or editorial policies were examined on whether they take a stance with regard to the availability of the underlying data of the submitted article. The mere explicated possibility of providing supplementary material along with the submitted article was not considered as a research data policy in the present study. Furthermore, the present article excluded source codes or algorithms from the scope of the paper and thus policies related to them are not included in the analysis of the present article.

    For selection of journals within the field of neurosciences, Clarivate Analytics’ InCites Journal Citation Reports database was searched using categories of neurosciences and neuroimaging. From the results, journals with the 40 highest Impact Factor (for the year 2017) indicators were extracted for scrutiny of research data policies. Respectively, the selection journals within the field of physics was created by performing a similar search with the categories of physics, applied; physics, atomic, molecular & chemical; physics, condensed matter; physics, fluids & plasmas; physics, mathematical; physics, multidisciplinary; physics, nuclear and physics, particles & fields. From the results, journals with the 40 highest Impact Factor indicators were again extracted for scrutiny. Similarly, the 40 journals representing the field of operations research were extracted by using the search category of operations research and management.

    Journal-specific data policies were sought from journal specific websites providing journal specific author guidelines or editorial policies. Within the present study, the examination of journal data policies was done in May 2019. The primary data source was journal-specific author guidelines. If journal guidelines explicitly linked to the publisher’s general policy with regard to research data, these were used in the analyses of the present article. If journal-specific research data policy, or lack of, was inconsistent with the publisher’s general policies, the journal-specific policies and guidelines were prioritized and used in the present article’s data. If journals’ author guidelines were not openly available online due to, e.g., accepting submissions on an invite-only basis, the journal was not included in the data of the present article. Also journals that exclusively publish review articles were excluded and replaced with the journal having the next highest Impact Factor indicator so that each set representing the three field of sciences consisted of 40 journals. The final data thus consisted of 120 journals in total.

    ‘Public deposition’ refers to a scenario where researcher deposits data to a public repository and thus gives the administrative role of the data to the receiving repository. ‘Scientific sharing’ refers to a scenario where researcher administers his or her data locally and by request provides it to interested reader. Note that none of the journals examined in the present article required that all data types underlying a submitted work should be deposited into a public data repositories. However, some journals required public deposition of data of specific types. Within the journal research data policies examined in the present article, these data types are well presented by the Springer Nature policy on “Availability of data, materials, code and protocols” (Springer Nature, 2018), that is, DNA and RNA data; protein sequences and DNA and RNA sequencing data; genetic polymorphisms data; linked phenotype and genotype data; gene expression microarray data; proteomics data; macromolecular structures and crystallographic data for small molecules. Furthermore, the registration of clinical trials in a public repository was also considered as a data type in this study. The term specific data types used in the custom coding framework of the present study thus refers to both life sciences data and public registration of clinical trials. These data types have community-endorsed public repositories where deposition was most often mandated within the journals’ research data policies.

    The term ‘location’ refers to whether the journal’s data policy provides suggestions or requirements for the repositories or services used to share the underlying data of the submitted works. A mere general reference to ‘public repositories’ was not considered a location suggestion, but only references to individual repositories and services. The category of ‘immediate release of data’ examines whether the journals’ research data policy addresses the timing of publication of the underlying data of submitted works. Note that even though the journals may only encourage public deposition of the data, the editorial processes could be set up so that it leads to either publication of the research data or the research data metadata in conjunction to publishing of the submitted work.

  20. m

    phishrepo-dataset

    • data.mendeley.com
    Updated Oct 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhash Ariyadasa (2021). phishrepo-dataset [Dataset]. http://doi.org/10.17632/ttmmtsgbs8.1
    Explore at:
    Dataset updated
    Oct 5, 2021
    Authors
    Subhash Ariyadasa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PhishRepo is implemented to fill the data gap in the anti-phishing domain, and it is still at an experimental level. PhishRepo collects the data available here during its testing stage, and the dataset includes verified phishing webpages. Therefore, it contains few data points only. The provided dataset contains diverse information sources collected related to the latest phishing pages. The diverse feature-rich data present in the dataset is a current need in the machine learning-based anti-phishing domain to overcome inept learning models in phishing detection. The dataset can be used to analyse significant phishing features, experiment with different feature extraction techniques, effectively try out some representation learning techniques such as deep learning from these raw data at a practical level. The dataset contains an index.csv file, and it will be the main file that should be used when mapping index file content with available folders. Generally, a folder should contain a webpage.html, alexa.xml, response.csv, screenshot.png and fullview.png files and src folder, which carries offline webpage resources. If something is missing in the folder level, that indicates in the index.csv file.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan (2022). Data Science Problems Dataset [Dataset]. https://paperswithcode.com/dataset/data-science-problems

Data from: Data Science Problems Dataset

Related Article
Explore at:
Dataset updated
Nov 17, 2022
Authors
Shubham Chandel; Colin B. Clement; Guillermo Serrato; Neel Sundaresan
Description

Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

Search
Clear search
Close search
Google apps
Main menu