100+ datasets found

Top 2500 Kaggle Datasets
kaggle.com
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7637365
Dataset updated
Feb 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saket Kumar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

Column Definitions:

Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Powerful Data for Power BI
kaggle.com
zip
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shiv_D24Coder (2023). Powerful Data for Power BI [Dataset]. https://www.kaggle.com/datasets/shivd24coder/powerful-data-for-power-bi
Explore at:
zip(907404 bytes)Available download formats
Dataset updated
Aug 28, 2023
Authors
Shiv_D24Coder
Description
Explore the world of data visualization with this Power BI dataset containing HR Analytics and Sales Analytics datasets. Gain insights, create impactful reports, and craft engaging dashboards using real-world data from HR and sales domains. Sharpen your Power BI skills and uncover valuable data-driven insights with this powerful dataset. Happy analyzing!
All Seaborn Built-in Datasets 📊✨
kaggle.com
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets 📊✨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
Explore at:
zip(1383218 bytes)Available download formats
Dataset updated
Aug 27, 2024
Authors
Abdelrahman Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

Included Datasets:

Anagrams: Analysis of word anagram patterns.

Anscombe: Anscombe's quartet demonstrating the importance of data visualization.

Attention: Data on attention span variations in different scenarios.

Brain Networks: Connectivity data within brain networks.

Car Crashes: US car crash statistics.

Diamonds: Data on diamond properties including price, cut, and clarity.

Dots: Randomly generated data for scatter plot visualization.

Dow Jones: Historical records of the Dow Jones Industrial Average.

Exercise: The relationship between exercise and health metrics.

Flights: Monthly passenger numbers on flights.

FMRI: Functional MRI data capturing brain activity.

Geyser: Eruption times of the Old Faithful geyser.

Glue: Strength of glue under different conditions.

Health Expenditure: Health expenditure statistics across countries.

Iris: Famous dataset for classifying Iris species.

MPG: Miles per gallon for various vehicles.

Penguins: Data on penguin species and their features.

Planets: Characteristics of discovered exoplanets.

Sea Ice: Measurements of sea ice extent.

Taxis: Taxi trips data in a city.

Tips: Tipping data collected from a restaurant.

Titanic: Survival data from the Titanic disaster.

This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.
N
Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...
neilsberg.com
csv, json
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Advance from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/advance-in-population-by-year/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jul 30, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
IN, Advance
Variables measured
Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
Measurement technique
The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Advance population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Advance across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2023, the population of Advance was 505, a 0.40% increase year-by-year from 2022. Previously, in 2022, Advance population was 503, a decline of 0.59% compared to a population of 506 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Advance decreased by 54. In this period, the peak population was 598 in the year 2009. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2023

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2023)

Population: The population for the specific year for the Advance is shown in this column.

Year on Year Change: This column displays the change in Advance population for each year compared to the previous year.

Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Advance Population by Year. You can refer the same here
u
Optimization and Evaluation Datasets for PiMine
fdr.uni-hamburg.de
md, zip
Updated Jan 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias (2024). Optimization and Evaluation Datasets for PiMine [Dataset]. http://doi.org/10.25592/uhhfdm.13972
Explore at:
md, zipAvailable download formats
Unique identifier
https://doi.org/10.25592/uhhfdm.13972
Dataset updated
Jan 22, 2024
Dataset provided by
ZBH Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany
Authors
Graef, Joel; Ehrt, Christiane; Reim, Thorben; Rarey, Matthias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1]

The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities.

In addition, we added the results of the case studies analyzed in [1] to enable readers to follow the discussion and investigate the results individually.

Data Set description:

The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine.

The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated.

The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments.

Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities.

References:

[1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted)
[2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012.
[3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265.
[4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055.
Data from: Clinical Dataset
kaggle.com
zip
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
Explore at:
zip(16220 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Mohamadreza Momeni
Description
The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

About Dataset:

333 scholarly articles cite this dataset.

Unique identifier: DOI

Dataset updated: 2023

Authors: Haoyang Mi

In this dataset, we have two dataset:

1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.
t
30 years of synoptic observations from Neumayer Station with links to...
service.tib.eu
Updated Nov 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 30 years of synoptic observations from Neumayer Station with links to datasets - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-150017
Explore at:
Dataset updated
Nov 29, 2024
Description
The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.
Disentangling Multidimensional Spatio-Temporal Data into Their Common and...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young Hwan Chang; James Korkola; Dhara N. Amin; Mark M. Moasser; Jose M. Carmena; Joe W. Gray; Claire J. Tomlin (2023). Disentangling Multidimensional Spatio-Temporal Data into Their Common and Aberrant Responses [Dataset]. http://doi.org/10.1371/journal.pone.0121607
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0121607
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Young Hwan Chang; James Korkola; Dhara N. Amin; Mark M. Moasser; Jose M. Carmena; Joe W. Gray; Claire J. Tomlin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations over different cell lines, or neural spike trains across many experimental trials, have the potential to acquire insight about the dynamic behavior of the system. For this potential to be realized, we need a suitable representation to understand the data. A general question is how to organize the observed data into meaningful structures and how to find an appropriate similarity measure. A natural way of viewing these complex high dimensional data sets is to examine and analyze the large-scale features and then to focus on the interesting details. Since the wide range of experiments and unknown complexity of the underlying system contribute to the heterogeneity of biological data, we develop a new method by proposing an extension of Robust Principal Component Analysis (RPCA), which models common variations across multiple experiments as the lowrank component and anomalies across these experiments as the sparse component. We show that the proposed method is able to find distinct subtypes and classify data sets in a robust way without any prior knowledge by separating these common responses and abnormal responses. Thus, the proposed method provides us a new representation of these data sets which has the potential to help users acquire new insight from data.
H
Area Resource File (ARF)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8NMSFV
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D
🖼️ Famous Paintings
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2023). 🖼️ Famous Paintings [Dataset]. https://www.kaggle.com/datasets/mexwell/famous-paintings
Explore at:
zip(6681482 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
mexwell
Description
Famous paintings and their artists. This data set is published to help students have interesting data to practice SQL

Original Data

Acknowlegement

Foto von Steve Johnson auf Unsplash
Empirical overall power of the CTOT, MC, and CO methods with analysis on...
plos.figshare.com
xls
Updated Jun 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Zhuang; Luísa Camacho; Camila S. Silva; Michael Thomson; Kevin Snyder (2023). Empirical overall power of the CTOT, MC, and CO methods with analysis on benchmark data. [Dataset]. http://doi.org/10.1371/journal.pone.0263070.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0263070.t006
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Wei Zhuang; Luísa Camacho; Camila S. Silva; Michael Thomson; Kevin Snyder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BFD stands for the benchmark with full data analyzed with the current standard method, which includes t-tests for two-group comparisons. CTOT stands for the cycle-to-threshold method, while CO denotes the complete-observation method and MC denotes the method that sets uncertain observations equal to the assay-specific maximum cycle threshold C1.
f
Data from: Implications of Peak Selection in the Interpretation of...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teresa Murta; Rory T. Steven; Chelsea J. Nikula; Spencer A. Thomas; Lucas B. Zeiger; Alex Dexter; Efstathios A. Elia; Bin Yan; Andrew D. Campbell; Richard J. A. Goodwin; Zoltan Takáts; Owen J. Sansom; Josephine Bunch (2023). Implications of Peak Selection in the Interpretation of Unsupervised Mass Spectrometry Imaging Data Analyses [Dataset]. http://doi.org/10.1021/acs.analchem.0c04179.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.0c04179.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Teresa Murta; Rory T. Steven; Chelsea J. Nikula; Spencer A. Thomas; Lucas B. Zeiger; Alex Dexter; Efstathios A. Elia; Bin Yan; Andrew D. Campbell; Richard J. A. Goodwin; Zoltan Takáts; Owen J. Sansom; Josephine Bunch
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Mass spectrometry imaging can produce large amounts of complex spectral and spatial data. Such data sets are often analyzed with unsupervised machine learning approaches, which aim at reducing their complexity and facilitating their interpretation. However, choices made during data processing can impact the overall interpretation of these analyses. This work investigates the impact of the choices made at the peak selection step, which often occurs early in the data processing pipeline. The discussion is done in terms of visualization and interpretation of the results of two commonly used unsupervised approaches: t-distributed stochastic neighbor embedding and k-means clustering, which differ in nature and complexity. Criteria considered for peak selection include those based on hypotheses (exemplified herein in the analysis of metabolic alterations in genetically engineered mouse models of human colorectal cancer), particular molecular classes, and ion intensity. The results suggest that the choices made at the peak selection step have a significant impact in the visual interpretation of the results of either dimensionality reduction or clustering techniques and consequently in any downstream analysis that relies on these. Of particular significance, the results of this work show that while using the most abundant ions can result in interesting structure-related segmentation patterns that correlate well with histological features, using a smaller number of ions specifically selected based on prior knowledge about the biochemistry of the tissues under investigation can result in an easier-to-interpret, potentially more valuable, hypothesis-confirming result. Findings presented will help researchers understand and better utilize unsupervised machine learning approaches to mine high-dimensionality data.
MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kristen M. Naegle; Roy E. Welsch; Michael B. Yaffe; Forest M. White; Douglas A. Lauffenburger (2023). MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and Insights from High-Throughput Proteomic Datasets [Dataset]. http://doi.org/10.1371/journal.pcbi.1002119
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1002119
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Kristen M. Naegle; Roy E. Welsch; Michael B. Yaffe; Forest M. White; Douglas A. Lauffenburger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology (‘MCAM’) employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.
Human Resources Data Set
kaggle.com
zip
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Rich (2020). Human Resources Data Set [Dataset]. https://www.kaggle.com/datasets/rhuebner/human-resources-data-set/discussion
Explore at:
zip(17041 bytes)Available download formats
Dataset updated
Oct 19, 2020
Authors
Dr. Rich
Description
Updated 30 January 2023

Version 14 of Dataset

License Update:

There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the original authors of this dataset.

We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing, please follow this license:

CC-BY-NC-ND This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Codebook

https://rpubs.com/rhuebner/hrd_cb_v14

PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were identified between the codebook and the dataset. Please feel free to contact me through LinkedIn (www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

Context

HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business. We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in Tableau Desktop - a data visualization tool that's easy to learn.

This version provides a variety of features that are useful for both data visualization AND creating machine learning / predictive analytics models. We are working on expanding the data set even further by generating even more records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a teaching data set - to teach human resources professionals how to work with data and analytics.

Content

We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

Recent additions to the data include: - Absences - Most Recent Performance Review Date - Employee Engagement Score

Acknowledgements

Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over 200 Human Resource Management students at the college. Students in the course learn data visualization techniques with Tableau Desktop and use this data set to complete a series of assignments.

Inspiration

We've included some open-ended questions that you can explore and try to address through creating Tableau visualizations, or R or Python analyses. Good luck and enjoy the learning!

Is there any relationship between who a person works for and their performance score?

What is the overall diversity profile of the organization?

What are our best recruiting sources if we want to ensure a diverse organization?

Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?

Are there areas of the company where pay is not equitable?

There are so many other interesting questions that could be addressed through this interesting data set. Dr. Patalano and I look forward to seeing what we can come up with.

If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn: http://www.linkedin.com/in/RichHuebner

You can also reach me via email at: Richard.Huebner@go.cambridgecollege.edu
Association of Protein Translation and Extracellular Matrix Gene Sets with...
plos.figshare.com
tiff
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nilotpal Chowdhury; Shantanu Sapru (2023). Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach [Dataset]. http://doi.org/10.1371/journal.pone.0129610
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0129610
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Nilotpal Chowdhury; Shantanu Sapru
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionMicroarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis.AimThe aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets.MethodsFour microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA).ResultsGene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed.ConclusionTo the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.
Data from: InterHub: A Naturalistic Trajectory Dataset with Dense...
figshare.com
csv
Updated May 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun (2025). InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving [Dataset]. http://doi.org/10.6084/m9.figshare.27899754.v6
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27899754.v6
Dataset updated
May 24, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Xiyan Jiang; Xiaocong Zhao; Yiru Liu; Zirui Li; Peng Hang; Lu Xiong; Jian Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We provide a dense interaction dataset, InterHub, derived from extensive naturalistic driving records to address the scarcity of real-world datasets capturing rich interaction events.The dataset provided on this page include:A CSV file (Interactive_Segments_Index.csv) containing the indexed list of the extracted interaction events. In addition to indexing and tracing information about interaction scenarios, we also provide some interesting labels to facilitate more targeted retrieval and utilization of interaction scenarios.(For detailed information, please refer to https://github.com/zxc-tju/InterHub.)Relevant unified data cache files (InterHub_cache_files.zip that includes cache files of lyft_train_full, nuplan_train).The Python codes used to process and analyze the dataset can be found at https://github.com/zxc-tju/InterHub. The tools for navigating InterHub involve the following three parts:0_data_unify.py converts various data resources into a unified format for seamless interaction event extraction.1_interaction_extract.py extracts interactive segments from unified driving records.2_case_visualize.py showcases interaction scenarios in InterHub.You can refer to the data structure of cache files presented in dataset.md, and after extracting the InterHub_cache_files.zip file, put it in the corresponding folder.Statement: All third-party data redistributions included in the interhub_cache_files.zip repository are carried out in full compliance with the original licensing terms of the respective source datasets, as required by their mandatory licensing conditions. This portion of the data remains subject to its original licenses, and users of the data are required to comply with these original licensing terms in any subsequent use or redistribution.
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
MIND: Multilingual Imaging Neuro Dataset
openneuro.org
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuanyi Jessica Chen; Maxwell Salvadore; Esti Blanco-Elorrieta (2025). MIND: Multilingual Imaging Neuro Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds006391.v2.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds006391.v2.0.0
Dataset updated
Aug 6, 2025
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Xuanyi Jessica Chen; Maxwell Salvadore; Esti Blanco-Elorrieta
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
MIND: Multilingual Imaging Neuro Dataset

This repository contains structural and functional MRI data of 126 monolingual and bilingual participants with varying language backgrounds and proficiencies.

This README is organized into two sections:

Usage describes how one can go about recreating data derivatives and brain measures from start to finish.

Directories gives information on the file structure of the dataset.

If you just want access to the processed brain and language data, go to Quick Start.

Usage

There are two ways that one can go about this dataset. If you want to jump immediately into analyzing participants and their language profiles, then go to Quick Start. If instead you are looking to go from low-level MRI data to cleaned CSVs with various brain measure types, either to learn the process or double check our work, then go to Data Replication.

Quick Start

If you just want access to cleaned brain measure and language history data of 126 participants, they can be found in the following folders:

Brain Data: processing_output

Language Data: language_background

Each folder has a metadata.xlsx file that gives more information on the files and their fields. Have fun, go nuts.

Data Replication

If you are looking to go through the steps required to create the data from start to finish, we first start with the raw structural and functional MRI data, which can be found in ./sub-EBE{XXXX}. Information on the data in this folder, which follows BIDS, can be found here.

The data in ./sub-EBE{XXXX} is then inputted into various processing pipelines, the versions for which can be found at Dependency versions. The following processing pipelines are used:

fMRIprep

fMRIprep is a neuroimaging procesing tool used for task-based and resting-state fMRI data. fMRIprep is not used directly to create brain measure CSVs used in analysis, but instead creates processed fMRI data used in the CONN toolbox. For more information on fMRIprep and how to use it, click here.

CAT12

We use the CAT12 toolbox, which stands for Computational Anatomy Toolbox, to calculate brain region volumes using voxel-based morphometry (VBM). CAT12 works through SPM12 and Matlab, and requires that both be installed. We have included the Matlab scripts used to create the files in ./derivatives/CAT12 in preprocessing_scripts/cat12. To use it, install necessary dependencies (CAT12, SPM12, and Matlab) and run preprocessing_scripts/cat12/CAT12_segmentation_n2.m in Matlab. You will also need to update for your local path to Matlab on lines 12, 24, and 41. For more information on CAT12 and how to use it to calculate brain region volumes using VBM, click here.

CONN

CONN is a functional connectivity toolbox, which we used to generate participant brain connectivity measures. CONN requires first that you run the fMRIprep pipeline, as it uses some of fMRIprep's outputs as input. Like CAT12, CONN works through SPM12 and Matlab and requires that both be installed. For more information on CONN and how to use it, click here.

FDT

We used FMRIB's Diffusion Toolbox (FDT) for extracting values from diffusion weighted images. For more information on FDT and how to use it, click here.

Freesurfer

FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data, which we use to extract region volumes and cortical thickness through surface-based morphometry (SBM). For more information on Freesurfer and how to use it, click here.

The results from these pipelines, which use the data in ./sub-EBE{XXXX} as input, are then outputted into folders in ./derivatives. For information on which folder stores each pipeline result, see Directories.

After running these pipelines, we can take their outputs and convert them into CSVs for analysis. To do this, we use preprocessing_scripts/brain_data_preprocessing.ipynb. This Python notebook takes the data in ./derivatives as input and outputs CSVs to processing_output. Outputted from this notebook are CSVs containing brain volumes, cortical thicknesses, fractional anisotropy values, and connectivity measures. Information on the outputted CSVs can be found at processing_output/metadata.xlsx.

Dependency versions

MATLAB v. R2023a

SPM12

CAT12 v8.2

CONN v22a

FSL v6.0.2

Freesurfer v7.4.1

fMRIprep v23.0.2

Chen, Salvadore, & Blanco-Elorrieta Paper Replication

Also included in this dataset is code used in the analyses of Chen, Salvaore, & Blanco-Elorrieta (submitted). If you are interested in running analyses used in that paper, see the README in chen_salvadore_elorrieta/code.

Directories

participants.tsv: Subject demographic information.

participants.json: Describes participants.tsv.

sub-EBE

Each of these directories contain the BIDS formatted anatomical and functional MRI data, with the name of the directory corresponding to the subject's unique identifier. For more information on the subfolders, see BIDS information here.

derivatives

This directory contains outputs of common processing pipelines run on the raw MRI data from ./sub-EBE{XXXX}.

CAT12

Results of the CAT12 toolbox, which stands for Computational Anatomy Toolbox, and is used to calculate brain region volumes using voxel-based morphometry (VBM).

conn

Results of the CONN toolbox, used to generate data on functional connectivity from brain fMRI sequences.

fdt

Results of the FMRIB's Diffusion Toolbox (FDT), used for extracting values from diffusion weighted images.

fMRIprep

Results from fMRIprep, a preprocessing pipeline for task-based and resting-state functional MRI data.

freesurfer

Results from FreeSurfer, a software package for the analysis and visualization of structural and functional neuroimaging data.

language_background

Participant information is kept on the first level of the dataset and includes information on language history, demographics, and their composite multilingualism score. Below is a list of all participant information files.

language_background.csv: Full subject language information and history.

metadata.xlsx: Metadata on each file in this directory.

multilingual_measure.csv: Each participant’s composite multilingualism score specified in Chen & Blanco-Elorrieta (in review).

processing_output

This directory contains processed brain measure data for brain volumes, cortical thickness, FA, and connectivity. The CSVs are created from scripts in the directory processing_scripts using files in the derivatives directory as input. Descriptions of each file can be found below.

connectivity_network.csv: Contains 36 Network-to-Network connectivity values for each participant.

connectivity_roi.csv: Contains 13,336 ROI-to-ROI connectivity values for each participant.

dti.csv: Contains averaged white matter FA values for 76 brain regions for each participant based on Diffusion tensor imaging.

metadata.xlsx: Metadata on each file in this directory.

sbm_thickness.csv: Contains cortical thickness values for 68 brain regions for each participant based on Surface-based morphometry.

sbm_volume.csv: Contains volume values for 165 brain regions for each participant based on Surface-based morphometry.

tiv.csv: Contains two total intracranial volumes for each subject, calculated using SBM and VBM respectively

vbm_volume.csv: Contains volume values for 153 brain regions for each participant based on Voxel-based morphometry. `

preprocessing_scripts

Code involved in processing raw MRI data.

brain_data_preprocessing.ipynb Python notebook used to create CSVs with brain measure values used in analyses. For more information on the code and how to use it, read Data Replication.

### raw_mri_preprocessing Scripts used to create files some files in ./derviatives folder from raw MRI data in ./sub-EBE{XXXX}. For more information on the scripts, read Data Replication.

### toolbox_outputs Intermediary files created and used by analysis/processing_scripts/brain_data_preprocessing.ipynb.

##
u
Data from: Plant Expression Database
agdatacommons.nal.usda.gov
datasetcatalog.nlm.nih.gov
+2more
bin
Updated Feb 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson (2024). Plant Expression Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Plant_Expression_Database/24661179
Explore at:
binAvailable download formats
Dataset updated
Feb 9, 2024
Dataset provided by
PLEXdb
Authors
Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
[NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.

Facebook

Twitter

Click to copy link

Link copied

Cite

Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365

Top 2500 Kaggle Datasets

Explore, Analyze, Innovate: The Best of Kaggle's Data at Your Fingertips

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/7637365

Dataset updated

Feb 16, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Saket Kumar

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

Column Definitions:

Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

Clear search

Close search

Google apps

Main menu

Top 2500 Kaggle Datasets

Political Analysis Using R: Example Code and Data, Plus Data for Practice...

Powerful Data for Power BI

All Seaborn Built-in Datasets 📊✨

Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...

About this dataset

Content

Inspiration

Recommended for further research

Optimization and Evaluation Datasets for PiMine

Data from: Clinical Dataset

30 years of synoptic observations from Neumayer Station with links to...

Disentangling Multidimensional Spatio-Temporal Data into Their Common and...

Area Resource File (ARF)

🖼️ Famous Paintings

Acknowlegement

Empirical overall power of the CTOT, MC, and CO methods with analysis on...

Data from: Implications of Peak Selection in the Interpretation of...

MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and...

Human Resources Data Set

Version 14 of Dataset

License Update:

Codebook

Context

Content

Acknowledgements

Inspiration

Association of Protein Translation and Extracellular Matrix Gene Sets with...

Data from: InterHub: A Naturalistic Trajectory Dataset with Dense...

Data from: imdb

MIND: Multilingual Imaging Neuro Dataset

MIND: Multilingual Imaging Neuro Dataset

Usage

Quick Start

Data Replication

fMRIprep

CAT12

CONN

FDT

Freesurfer

Dependency versions

Chen, Salvadore, & Blanco-Elorrieta Paper Replication

Directories

sub-EBE

derivatives

CAT12

conn

fdt

fMRIprep

freesurfer

language_background

processing_output

preprocessing_scripts

Data from: Plant Expression Database

Top 2500 Kaggle Datasets

Explore, Analyze, Innovate: The Best of Kaggle's Data at Your Fingertips