100+ datasets found
  1. Top 2500 Kaggle Datasets

    • kaggle.com
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saket Kumar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

    Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

    Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

    Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

    Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

    Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

    Column Definitions:

    Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

  2. H

    Political Analysis Using R: Example Code and Data, Plus Data for Practice...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Jamie Monogan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

  3. Powerful Data for Power BI

    • kaggle.com
    zip
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiv_D24Coder (2023). Powerful Data for Power BI [Dataset]. https://www.kaggle.com/datasets/shivd24coder/powerful-data-for-power-bi
    Explore at:
    zip(907404 bytes)Available download formats
    Dataset updated
    Aug 28, 2023
    Authors
    Shiv_D24Coder
    Description

    Explore the world of data visualization with this Power BI dataset containing HR Analytics and Sales Analytics datasets. Gain insights, create impactful reports, and craft engaging dashboards using real-world data from HR and sales domains. Sharpen your Power BI skills and uncover valuable data-driven insights with this powerful dataset. Happy analyzing!

  4. All Seaborn Built-in Datasets 📊✨

    • kaggle.com
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets 📊✨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
    Explore at:
    zip(1383218 bytes)Available download formats
    Dataset updated
    Aug 27, 2024
    Authors
    Abdelrahman Mohamed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

    • Included Datasets:
      • Anagrams: Analysis of word anagram patterns.
      • Anscombe: Anscombe's quartet demonstrating the importance of data visualization.
      • Attention: Data on attention span variations in different scenarios.
      • Brain Networks: Connectivity data within brain networks.
      • Car Crashes: US car crash statistics.
      • Diamonds: Data on diamond properties including price, cut, and clarity.
      • Dots: Randomly generated data for scatter plot visualization.
      • Dow Jones: Historical records of the Dow Jones Industrial Average.
      • Exercise: The relationship between exercise and health metrics.
      • Flights: Monthly passenger numbers on flights.
      • FMRI: Functional MRI data capturing brain activity.
      • Geyser: Eruption times of the Old Faithful geyser.
      • Glue: Strength of glue under different conditions.
      • Health Expenditure: Health expenditure statistics across countries.
      • Iris: Famous dataset for classifying Iris species.
      • MPG: Miles per gallon for various vehicles.
      • Penguins: Data on penguin species and their features.
      • Planets: Characteristics of discovered exoplanets.
      • Sea Ice: Measurements of sea ice extent.
      • Taxis: Taxi trips data in a city.
      • Tips: Tipping data collected from a restaurant.
      • Titanic: Survival data from the Titanic disaster.

    This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.

  5. N

    Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...

    • neilsberg.com
    csv, json
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Advance from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/advance-in-population-by-year/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    IN, Advance
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Advance population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Advance across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2023, the population of Advance was 505, a 0.40% increase year-by-year from 2022. Previously, in 2022, Advance population was 503, a decline of 0.59% compared to a population of 506 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Advance decreased by 54. In this period, the peak population was 598 in the year 2009. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2023

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2023)
    • Population: The population for the specific year for the Advance is shown in this column.
    • Year on Year Change: This column displays the change in Advance population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Advance Population by Year. You can refer the same here

  6. u

    Optimization and Evaluation Datasets for PiMine

    • fdr.uni-hamburg.de
    md, zip
    Updated Jan 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias (2024). Optimization and Evaluation Datasets for PiMine [Dataset]. http://doi.org/10.25592/uhhfdm.13972
    Explore at:
    md, zipAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset provided by
    ZBH Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany
    Authors
    Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]

    The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.

    In addition, we added the results of the case studies analyzed in [1] to enable readers to follow the discussion and investigate the results individually.

    Data Set description:

    The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.

    The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.

    The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.

    Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.

    References:

    [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
    [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
    [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
    [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.

  7. Data from: Clinical Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
    Explore at:
    zip(16220 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Mohamadreza Momeni
    Description

    The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

    Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

    About Dataset:

    333 scholarly articles cite this dataset.

    Unique identifier: DOI

    Dataset updated: 2023

    Authors: Haoyang Mi

    In this dataset, we have two dataset:

    1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

    2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

    Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

  8. t

    30 years of synoptic observations from Neumayer Station with links to...

    • service.tib.eu
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). 30 years of synoptic observations from Neumayer Station with links to datasets - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-150017
    Explore at:
    Dataset updated
    Nov 29, 2024
    Description

    The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.

  9. Disentangling Multidimensional Spatio-Temporal Data into Their Common and...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young Hwan Chang; James Korkola; Dhara N. Amin; Mark M. Moasser; Jose M. Carmena; Joe W. Gray; Claire J. Tomlin (2023). Disentangling Multidimensional Spatio-Temporal Data into Their Common and Aberrant Responses [Dataset]. http://doi.org/10.1371/journal.pone.0121607
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Young Hwan Chang; James Korkola; Dhara N. Amin; Mark M. Moasser; Jose M. Carmena; Joe W. Gray; Claire J. Tomlin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations over different cell lines, or neural spike trains across many experimental trials, have the potential to acquire insight about the dynamic behavior of the system. For this potential to be realized, we need a suitable representation to understand the data. A general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity measure. A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. Since the wide range of experiments and unknown complexity of the underlying system contribute to the heterogeneity of biological data, we develop a new method by proposing an extension of Robust Principal Component Analysis (RPCA), which models common variations across multiple experiments as the lowrank component and anomalies across these experiments as the sparse component. We show that the proposed method is able to find distinct subtypes and classify data sets in a robust way without any prior knowledge by separating these common responses and abnormal responses. Thus, the proposed method provides us a new representation of these data sets which has the potential to help users acquire new insight from data.

  10. H

    Area Resource File (ARF)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D

  11. 🖼️ Famous Paintings

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). 🖼️ Famous Paintings [Dataset]. https://www.kaggle.com/datasets/mexwell/famous-paintings
    Explore at:
    zip(6681482 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    mexwell
    Description

    Famous paintings and their artists. This data set is published to help students have interesting data to practice SQL

    Original Data

    Acknowlegement

    Foto von Steve Johnson auf Unsplash

  12. Empirical overall power of the CTOT, MC, and CO methods with analysis on...

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Zhuang; Luísa Camacho; Camila S. Silva; Michael Thomson; Kevin Snyder (2023). Empirical overall power of the CTOT, MC, and CO methods with analysis on benchmark data. [Dataset]. http://doi.org/10.1371/journal.pone.0263070.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Wei Zhuang; Luísa Camacho; Camila S. Silva; Michael Thomson; Kevin Snyder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BFD stands for the benchmark with full data analyzed with the current standard method, which includes t-tests for two-group comparisons. CTOT stands for the cycle-to-threshold method, while CO denotes the complete-observation method and MC denotes the method that sets uncertain observations equal to the assay-specific maximum cycle threshold C1.

  13. f

    Data from: Implications of Peak Selection in the Interpretation of...

    • acs.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa Murta; Rory T. Steven; Chelsea J. Nikula; Spencer A. Thomas; Lucas B. Zeiger; Alex Dexter; Efstathios A. Elia; Bin Yan; Andrew D. Campbell; Richard J. A. Goodwin; Zoltan Takáts; Owen J. Sansom; Josephine Bunch (2023). Implications of Peak Selection in the Interpretation of Unsupervised Mass Spectrometry Imaging Data Analyses [Dataset]. http://doi.org/10.1021/acs.analchem.0c04179.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Teresa Murta; Rory T. Steven; Chelsea J. Nikula; Spencer A. Thomas; Lucas B. Zeiger; Alex Dexter; Efstathios A. Elia; Bin Yan; Andrew D. Campbell; Richard J. A. Goodwin; Zoltan Takáts; Owen J. Sansom; Josephine Bunch
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Mass spectrometry imaging can produce large amounts of complex spectral and spatial data. Such data sets are often analyzed with unsupervised machine learning approaches, which aim at reducing their complexity and facilitating their interpretation. However, choices made during data processing can impact the overall interpretation of these analyses. This work investigates the impact of the choices made at the peak selection step, which often occurs early in the data processing pipeline. The discussion is done in terms of visualization and interpretation of the results of two commonly used unsupervised approaches: t-distributed stochastic neighbor embedding and k-means clustering, which differ in nature and complexity. Criteria considered for peak selection include those based on hypotheses (exemplified herein in the analysis of metabolic alterations in genetically engineered mouse models of human colorectal cancer), particular molecular classes, and ion intensity. The results suggest that the choices made at the peak selection step have a significant impact in the visual interpretation of the results of either dimensionality reduction or clustering techniques and consequently in any downstream analysis that relies on these. Of particular significance, the results of this work show that while using the most abundant ions can result in interesting structure-related segmentation patterns that correlate well with histological features, using a smaller number of ions specifically selected based on prior knowledge about the biochemistry of the tissues under investigation can result in an easier-to-interpret, potentially more valuable, hypothesis-confirming result. Findings presented will help researchers understand and better utilize unsupervised machine learning approaches to mine high-dimensionality data.

  14. MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristen M. Naegle; Roy E. Welsch; Michael B. Yaffe; Forest M. White; Douglas A. Lauffenburger (2023). MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and Insights from High-Throughput Proteomic Datasets [Dataset]. http://doi.org/10.1371/journal.pcbi.1002119
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kristen M. Naegle; Roy E. Welsch; Michael B. Yaffe; Forest M. White; Douglas A. Lauffenburger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology (‘MCAM’) employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.

  15. Human Resources Data Set

    • kaggle.com
    zip
    Updated Oct 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Rich (2020). Human Resources Data Set [Dataset]. https://www.kaggle.com/datasets/rhuebner/human-resources-data-set/discussion
    Explore at:
    zip(17041 bytes)Available download formats
    Dataset updated
    Oct 19, 2020
    Authors
    Dr. Rich
    Description

    Updated 30 January 2023

    Version 14 of Dataset

    License Update:

    There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the original authors of this dataset.

    We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing, please follow this license:

    CC-BY-NC-ND This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

    Codebook

    https://rpubs.com/rhuebner/hrd_cb_v14

    PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were identified between the codebook and the dataset. Please feel free to contact me through LinkedIn (www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

    Context

    HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business. We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in Tableau Desktop - a data visualization tool that's easy to learn.

    This version provides a variety of features that are useful for both data visualization AND creating machine learning / predictive analytics models. We are working on expanding the data set even further by generating even more records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

    Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a teaching data set - to teach human resources professionals how to work with data and analytics.

    Content

    We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

    Recent additions to the data include: - Absences - Most Recent Performance Review Date - Employee Engagement Score

    Acknowledgements

    Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over 200 Human Resource Management students at the college. Students in the course learn data visualization techniques with Tableau Desktop and use this data set to complete a series of assignments.

    Inspiration

    We've included some open-ended questions that you can explore and try to address through creating Tableau visualizations, or R or Python analyses. Good luck and enjoy the learning!

    • Is there any relationship between who a person works for and their performance score?
    • What is the overall diversity profile of the organization?
    • What are our best recruiting sources if we want to ensure a diverse organization?
    • Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
    • Are there areas of the company where pay is not equitable?

    There are so many other interesting questions that could be addressed through this interesting data set. Dr. Patalano and I look forward to seeing what we can come up with.

    If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn: http://www.linkedin.com/in/RichHuebner

    You can also reach me via email at: Richard.Huebner@go.cambridgecollege.edu

  16. Association of Protein Translation and Extracellular Matrix Gene Sets with...

    • plos.figshare.com
    tiff
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nilotpal Chowdhury; Shantanu Sapru (2023). Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach [Dataset]. http://doi.org/10.1371/journal.pone.0129610
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nilotpal Chowdhury; Shantanu Sapru
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionMicroarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis.AimThe aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets.MethodsFour microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA).ResultsGene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed.ConclusionTo the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.

  17. Data from: InterHub: A Naturalistic Trajectory Dataset with Dense...

    • figshare.com
    csv
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun (2025). InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving [Dataset]. http://doi.org/10.6084/m9.figshare.27899754.v6
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 24, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We provide a dense interaction dataset, InterHub, derived from extensive naturalistic driving records to address the scarcity of real-world datasets capturing rich interaction events.The dataset provided on this page include:A CSV file (Interactive_Segments_Index.csv) containing the indexed list of the extracted interaction events. In addition to indexing and tracing information about interaction scenarios, we also provide some interesting labels to facilitate more targeted retrieval and utilization of interaction scenarios.(For detailed information, please refer to https://github.com/zxc-tju/InterHub.)Relevant unified data cache files (InterHub_cache_files.zip that includes cache files of lyft_train_full, nuplan_train).The Python codes used to process and analyze the dataset can be found at https://github.com/zxc-tju/InterHub. The tools for navigating InterHub involve the following three parts:0_data_unify.py converts various data resources into a unified format for seamless interaction event extraction.1_interaction_extract.py extracts interactive segments from unified driving records.2_case_visualize.py showcases interaction scenarios in InterHub.You can refer to the data structure of cache files presented in dataset.md, and after extracting the InterHub_cache_files.zip file, put it in the corresponding folder.Statement: All third-party data redistributions included in the interhub_cache_files.zip repository are carried out in full compliance with the original licensing terms of the respective source datasets, as required by their mandatory licensing conditions. This portion of the data remains subject to its original licenses, and users of the data are required to comply with these original licensing terms in any subsequent use or redistribution.

  18. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  19. MIND: Multilingual Imaging Neuro Dataset

    • openneuro.org
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuanyi Jessica Chen; Maxwell Salvadore; Esti Blanco-Elorrieta (2025). MIND: Multilingual Imaging Neuro Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds006391.v2.0.0
    Explore at:
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Xuanyi Jessica Chen; Maxwell Salvadore; Esti Blanco-Elorrieta
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MIND: Multilingual Imaging Neuro Dataset

    This repository contains structural and functional MRI data of 126 monolingual and bilingual participants with varying language backgrounds and proficiencies.

    This README is organized into two sections:

    1. Usage describes how one can go about recreating data derivatives and brain measures from start to finish.
    2. Directories gives information on the file structure of the dataset.

    If you just want access to the processed brain and language data, go to Quick Start.

    Usage

    There are two ways that one can go about this dataset. If you want to jump immediately into analyzing participants and their language profiles, then go to Quick Start. If instead you are looking to go from low-level MRI data to cleaned CSVs with various brain measure types, either to learn the process or double check our work, then go to Data Replication.

    Quick Start

    If you just want access to cleaned brain measure and language history data of 126 participants, they can be found in the following folders:

    Each folder has a metadata.xlsx file that gives more information on the files and their fields. Have fun, go nuts.

    Data Replication

    If you are looking to go through the steps required to create the data from start to finish, we first start with the raw structural and functional MRI data, which can be found in ./sub-EBE{XXXX}. Information on the data in this folder, which follows BIDS, can be found here.

    The data in ./sub-EBE{XXXX} is then inputted into various processing pipelines, the versions for which can be found at Dependency versions. The following processing pipelines are used:

    • fMRIprep

      fMRIprep is a neuroimaging procesing tool used for task-based and resting-state fMRI data. fMRIprep is not used directly to create brain measure CSVs used in analysis, but instead creates processed fMRI data used in the CONN toolbox. For more information on fMRIprep and how to use it, click here.

    • CAT12

      We use the CAT12 toolbox, which stands for Computational Anatomy Toolbox, to calculate brain region volumes using voxel-based morphometry (VBM). CAT12 works through SPM12 and Matlab, and requires that both be installed. We have included the Matlab scripts used to create the files in ./derivatives/CAT12 in preprocessing_scripts/cat12. To use it, install necessary dependencies (CAT12, SPM12, and Matlab) and run preprocessing_scripts/cat12/CAT12_segmentation_n2.m in Matlab. You will also need to update for your local path to Matlab on lines 12, 24, and 41. For more information on CAT12 and how to use it to calculate brain region volumes using VBM, click here.

    • CONN

      CONN is a functional connectivity toolbox, which we used to generate participant brain connectivity measures. CONN requires first that you run the fMRIprep pipeline, as it uses some of fMRIprep's outputs as input. Like CAT12, CONN works through SPM12 and Matlab and requires that both be installed. For more information on CONN and how to use it, click here.

    • FDT

      We used FMRIB's Diffusion Toolbox (FDT) for extracting values from diffusion weighted images. For more information on FDT and how to use it, click here.

    • Freesurfer

      FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data, which we use to extract region volumes and cortical thickness through surface-based morphometry (SBM). For more information on Freesurfer and how to use it, click here.


    The results from these pipelines, which use the data in ./sub-EBE{XXXX} as input, are then outputted into folders in ./derivatives. For information on which folder stores each pipeline result, see Directories.

    After running these pipelines, we can take their outputs and convert them into CSVs for analysis. To do this, we use preprocessing_scripts/brain_data_preprocessing.ipynb. This Python notebook takes the data in ./derivatives as input and outputs CSVs to processing_output. Outputted from this notebook are CSVs containing brain volumes, cortical thicknesses, fractional anisotropy values, and connectivity measures. Information on the outputted CSVs can be found at processing_output/metadata.xlsx.

    Dependency versions

    1. MATLAB v. R2023a
    2. SPM12
    3. CAT12 v8.2
    4. CONN v22a
    5. FSL v6.0.2
    6. Freesurfer v7.4.1
    7. fMRIprep v23.0.2

    Chen, Salvadore, & Blanco-Elorrieta Paper Replication

    Also included in this dataset is code used in the analyses of Chen, Salvaore, & Blanco-Elorrieta (submitted). If you are interested in running analyses used in that paper, see the README in chen_salvadore_elorrieta/code.


    Directories

    • participants.tsv: Subject demographic information.
    • participants.json: Describes participants.tsv.

    • sub-EBE

      Each of these directories contain the BIDS formatted anatomical and functional MRI data, with the name of the directory corresponding to the subject's unique identifier. For more information on the subfolders, see BIDS information here.

    • derivatives

      This directory contains outputs of common processing pipelines run on the raw MRI data from ./sub-EBE{XXXX}.

      • CAT12

        Results of the CAT12 toolbox, which stands for Computational Anatomy Toolbox, and is used to calculate brain region volumes using voxel-based morphometry (VBM).

      • conn

        Results of the CONN toolbox, used to generate data on functional connectivity from brain fMRI sequences.

      • fdt

        Results of the FMRIB's Diffusion Toolbox (FDT), used for extracting values from diffusion weighted images.

      • fMRIprep

        Results from fMRIprep, a preprocessing pipeline for task-based and resting-state functional MRI data.

      • freesurfer

        Results from FreeSurfer, a software package for the analysis and visualization of structural and functional neuroimaging data.

    • language_background

      Participant information is kept on the first level of the dataset and includes information on language history, demographics, and their composite multilingualism score. Below is a list of all participant information files.

      • language_background.csv: Full subject language information and history.

      • metadata.xlsx: Metadata on each file in this directory.

      • multilingual_measure.csv: Each participant’s composite multilingualism score specified in Chen & Blanco-Elorrieta (in review).

    • processing_output

      This directory contains processed brain measure data for brain volumes, cortical thickness, FA, and connectivity. The CSVs are created from scripts in the directory processing_scripts using files in the derivatives directory as input. Descriptions of each file can be found below.

      • connectivity_network.csv: Contains 36 Network-to-Network connectivity values for each participant.

      • connectivity_roi.csv: Contains 13,336 ROI-to-ROI connectivity values for each participant.

      • dti.csv: Contains averaged white matter FA values for 76 brain regions for each participant based on Diffusion tensor imaging.

      • metadata.xlsx: Metadata on each file in this directory.

      • sbm_thickness.csv: Contains cortical thickness values for 68 brain regions for each participant based on Surface-based morphometry.

      • sbm_volume.csv: Contains volume values for 165 brain regions for each participant based on Surface-based morphometry.

      • tiv.csv: Contains two total intracranial volumes for each subject, calculated using SBM and VBM respectively

      • vbm_volume.csv: Contains volume values for 153 brain regions for each participant based on Voxel-based morphometry. `

    • preprocessing_scripts

      Code involved in processing raw MRI data.

      • brain_data_preprocessing.ipynb Python notebook used to create CSVs with brain measure values used in analyses. For more information on the code and how to use it, read Data Replication.
      • ### raw_mri_preprocessing Scripts used to create files some files in ./derviatives folder from raw MRI data in ./sub-EBE{XXXX}. For more information on the scripts, read Data Replication.
      • ### toolbox_outputs Intermediary files created and used by analysis/processing_scripts/brain_data_preprocessing.ipynb.
    • ##

  20. u

    Data from: Plant Expression Database

    • agdatacommons.nal.usda.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    bin
    Updated Feb 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson (2024). Plant Expression Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Plant_Expression_Database/24661179
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    PLEXdb
    Authors
    Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    [NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
Organization logo

Top 2500 Kaggle Datasets

Explore, Analyze, Innovate: The Best of Kaggle's Data at Your Fingertips

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saket Kumar
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

Column Definitions:

Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

Search
Clear search
Close search
Google apps
Main menu