100+ datasets found

o
Data Manipulation on Heart Disease Dataset Using Pandas Library.
explore.openaire.eu
Updated Jul 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa Saif; Janat Alkhuld M. (2023). Data Manipulation on Heart Disease Dataset Using Pandas Library. [Dataset]. http://doi.org/10.5281/zenodo.8113014
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8113014
Dataset updated
Jul 4, 2023
Authors
Alaa Saif; Janat Alkhuld M.
Description
With the constant development our world is facing, new diseases and dangers are marked down in human history as "Modern Day Diseases". In the developing world, the risk of heart diseas and related cardiovascular diseases are on the rise. This dataset aquired contains a dataset that is considered a stepping stone in the work to be done ahead in order to prevent the development or the occurance of a heart attack or stroke.
Sample data files for Python Course
figshare.com
txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Verhaar (2022). Sample data files for Python Course [Dataset]. http://doi.org/10.6084/m9.figshare.21501549.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21501549.v1
Dataset updated
Nov 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peter Verhaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample data set used in an introductory course on Programming in Python
Learn Pandas
kaggle.com
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vaidik Patel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
figshare.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.20416092.v1
Dataset updated
Aug 1, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elizabeth Szkirpan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Keith Galli's Sales Analysis Exercise
kaggle.com
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zulkhairee Sulaiman (2022). Keith Galli's Sales Analysis Exercise [Dataset]. https://www.kaggle.com/datasets/zulkhaireesulaiman/sales-analysis-2019-excercise/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zulkhairee Sulaiman
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.

Content

12 .csv files containing sales data for each month of 2019.

Acknowledgements

Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
o
Introduction to Machine Learning using Python: Classification
explore.openaire.eu
Updated Jan 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khuong Tran; Dr Ghulam Murtaza; Dr Anastasios Papaioannou (2020). Introduction to Machine Learning using Python: Classification [Dataset]. http://doi.org/10.5281/zenodo.6423725
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6423725
Dataset updated
Jan 1, 2020
Authors
Khuong Tran; Dr Ghulam Murtaza; Dr Anastasios Papaioannou
Description
About this course Machine Learning (ML) is a new way to program computers to solve real world problems. It has gained popularity over the last few years by achieving tremendous success in tasks that we believed only humans could solve, from recognising images to self-driving cars. In this course, we will explore the fundamentals of Machine Learning from a practical perspective with the help of the Python programming language and its scientific computing libraries. Learning Outcomes Comprehensive introduction to Machine Learning models and techniques such as Logistic Regression, Decision Trees and Ensemble Learning. Know the differences between various core Machine Learning models. Understand the Machine Learning modelling workflows. Use Python and scikit-learn to process real datasets, train and apply Machine Learning models Prerequisites Either Learn to Program: Python, Data Manipulation in Python and Introduction to ML using Python: Introduction & Linear Regression or Learn to Program: Python, Data Manipulation and Visualisation in Python and Introduction to ML using Python: Introduction & Linear Regression needed to attend this course. If you already have experience with programming, please check the topics covered in the Learn to Program: Python, Data Manipulation in Python, Data Manipulation and Visualisation in Python and Introduction to ML using Python: Introduction & Linear Regression courses to ensure that you are familiar with the knowledge needed for this course, such as good understanding of Python syntax, basic programming concepts and familiarity with Pandas, Numpy and Seaborn libraries, and basic understanding of Machine Learning and Model Training. Maths knowledge is not required. There are only a few Math formula that you are going to see in this course, however references to Mathematics required for learning about Machine Learning will be provided. Having an understanding of the Mathematics behind each Machine Learning algorithms is going to make you appreciate the behaviour of the model and know its pros/cons when using them. Why do this course? Useful for anyone who wants to learn about Machine Learning but are overwhelmed with the tremendous amount of resources. It does not go in depth into mathematical concepts and formula, however formal intuitions and references are provided to guide the participants for further learning. We do have applications on real datasets! Machine Learning models are introduced in this course together with important feature engineering techniques that are guaranteed to be useful in your own projects. Give you enough background to kickstart your own Machine Learning journey, or transition yourself into Deep Learning. For a better and more complete understanding of the most popular Machine Learning models and techniques please consider attending all three Introduction to Machine Learning using Python workshops: Introduction to Machine Learning using Python: Introduction & Linear Regression Introduction to Machine Learning using Python: Classification Introduction to Machine Learning using Python: SVM & Unsupervised Learning Licence Copyright © 2021 Intersect Australia Ltd. All rights reserved.
Explore data formats and ingestion methods
kaggle.com
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Test Data Formats in Python

Test Data Formats in R

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.
d
(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...
search.dataone.org
hydroshare.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Explore at:
Unique identifier
https://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Dataset updated
Oct 19, 2024
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Description
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...
zenodo.org
pdf, zip
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7875495
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Please also see the latest version of the repository:
https://doi.org/10.5281/zenodo.6374011 and
our website: https://ilandavis.com/jcb2023-yfp

The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
u
Data from: dblp XML dataset as CSV for Python Data Analysis Library
observatorio-cientifico.ua.es
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo (2021). dblp XML dataset as CSV for Python Data Analysis Library [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc45db9e7c03b01bdb2d0
Explore at:
Dataset updated
2021
Authors
Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo
Description
Based on the dblp XML file, this dataset consists on a CSV file that has been extracted using a python script. The dataset can be easily loaded in a Python Data Analysis Library dataframe.
m
Data from: pyerrors: A python framework for error analysis of Monte Carlo...
data.mendeley.com
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Joswig (2023). pyerrors: A python framework for error analysis of Monte Carlo data [Dataset]. http://doi.org/10.17632/7ncw242ymh.1
Explore at:
Unique identifier
https://doi.org/10.17632/7ncw242ymh.1
Dataset updated
May 8, 2023
Authors
Fabian Joswig
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
We present the pyerrors python package for statistical error analysis of Monte Carlo data. Linear error propagation using automatic differentiation in an object oriented framework is combined with the Γ-method for a reliable estimation of autocorrelation times. Data from different sources can easily be combined, keeping the information on the origin of error components intact throughout the analysis. pyerrors can be smoothly integrated into the existing scientific python ecosystem which allows for efficient and compact analyses.
H
Using Python Packages and HydroShare to Advance Open Data Science and...
beta.hydroshare.org
hydroshare.org
zip
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery S. Horsburgh; Amber Spackman Jones; Anthony M. Castronova; Scott Black (2023). Using Python Packages and HydroShare to Advance Open Data Science and Analytics for Water [Dataset]. https://beta.hydroshare.org/resource/4f4acbab5a8c4c55aa06c52a62a1d1fb/
Explore at:
zip(31.0 MB)Available download formats
Dataset updated
Sep 28, 2023
Dataset provided by
HydroShare
Authors
Jeffery S. Horsburgh; Amber Spackman Jones; Anthony M. Castronova; Scott Black
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific and management challenges in the water domain require synthesis of diverse data. Many data analysis tasks are difficult because datasets are large and complex; standard data formats are not always agreed upon or mapped to efficient structures for analysis; scientists may lack training for tackling large and complex datasets; and it can be difficult to share, collaborate around, and reproduce scientific work. Overcoming barriers to accessing, organizing, and preparing datasets for analyses can transform the way water scientists work. Building on the HydroShare repository’s cyberinfrastructure, we have advanced two Python packages that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS) (i.e., a Python equivalent of USGS’ R dataRetrieval package), loading data into performant structures that integrate with existing visualization, analysis, and data science capabilities available in Python, and writing analysis results back to HydroShare for sharing and publication. While these Python packages can be installed for use within any Python environment, we will demonstrate how the technical burden for scientists associated with creating a computational environment for executing analyses can be reduced and how sharing and reproducibility of analyses can be enhanced through the use of these packages within CUAHSI’s HydroShare-linked JupyterHub server.

This HydroShare resource includes all of the materials presented in a workshop at the 2023 CUAHSI Biennial Colloquium.
Benchmark data set for MSPypeline, a python package for streamlined mass...
data.niaid.nih.gov
xml
Updated Jul 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Held; Ursula Klingmüller (2021). Benchmark data set for MSPypeline, a python package for streamlined mass spectrometry-based proteomics data analysis [Dataset]. https://data.niaid.nih.gov/resources?id=pxd025792
Explore at:
xmlAvailable download formats
Dataset updated
Jul 22, 2021
Dataset provided by
Division Systems Biology of Signal Transduction, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
DKFZ Heidelberg
Authors
Alexander Held; Ursula Klingmüller
Variables measured
Proteomics
Description
Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
d
cmomy: A python package to calculate and manipulate Central (co)moments.
datasets.ai
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
0, 33
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). cmomy: A python package to calculate and manipulate Central (co)moments. [Dataset]. https://datasets.ai/datasets/cmomy-a-python-package-to-calculate-and-manipulate-central-comoments-dcd00
Explore at:
0, 33Available download formats
Dataset updated
Aug 27, 2024
Dataset authored and provided by
National Institute of Standards and Technology
Description
cmomy is a python package to calculate central moments and co-moments in a numerical stable and direct way. Behind the scenes, cmomy makes use of Numba to rapidly calculate moments. cmomy provides utilities to calculate central moments from individual samples, precomputed central moments, and precomputed raw moments. It also provides routines to perform bootstrap resampling based on raw data, or precomputed moments. cmomy has numpy array and xarray DataArray interfaces.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Z
MASCDB, a database of images, descriptors and microphysical properties of...
data.niaid.nih.gov
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grazioli, Jacopo (2023). MASCDB, a database of images, descriptors and microphysical properties of individual snowflakes in free fall [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5578920
Explore at:
Dataset updated
Jul 5, 2023
Dataset provided by
Berne, Alexis
Ghiggi, Gionata
Grazioli, Jacopo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset overview

This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:

A triplet of gray-scale images corresponding to the three cameras of the MASC

A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.

The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:

Textural and geometrical descriptors as in Praz et al 2017

Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017

Blowing snow identification, as in Schaer et al 2020

Mass, volume, gyration estimation, as in Leinonen et al 2021

Data format and structure

The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.

Supporting code

A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.

Download notes

All files available here for download should be stored in the same folder, if the python-based API is used

MASCdb.zarr.zip must be unzipped after download

Field campaigns

A list of campaigns included in the dataset, with a minimal description is given in the following table

Campaign_name Information

Shielded / Not shielded

DFIR = Double Fence Intercomparison Reference

APRES3-2016 & APRES3-2017

Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017 Not shielded Davos-2015 Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment) Shielded (DFIR) Davos-2019 Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow) Not shielded ICEGENESIS-2021 Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023 Not shielded ICEPOP-2018 Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021. Shielded (DFIR) Jura-2019 & Jura-2023 Instrument installed in the Swiss Jura within a MeteoSwiss measurement site Not shielded Norway-2016 Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022. Not shielded PLATO-2019 Instrument installed in the "Davis" Antarctic base during the PLATO field campaign Not shielded POPE-2020 Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023. Not shielded Remoray-2022 Instrument installed in the French Jura. Not shielded Valais-2016 Instrument installed in the Swiss Alps in a ski resort. Not shielded

Version

1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.

0.3 - a new campaign is added to the dataset ("Remoray-2022")

0.2 - rename of variables. Variable precision (digits) standardized

0.1 - first upload
Avokado gelişim
kaggle.com
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABD�LKAD�R UY�UR (2025). Avokado gelişim [Dataset]. https://www.kaggle.com/datasets/abdlkadruyur/avokado-geliim/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ABD�LKAD�R UY�UR
Description
This dataset has been created for educational purposes, specifically to help learners practice SQL-like operations using Python’s pandas library. It is ideal for beginners who want to improve their data manipulation, querying, and transformation skills in a notebook environment such as Kaggle.

The dataset simulates a simple personnel and department system. It includes two tables:

personel: Contains employee data such as names, ages, salaries, and department IDs. departman: Contains department IDs and corresponding department names. Throughout this project, key SQL operations have been demonstrated with their pandas equivalents. These include:

Basic commands like SELECT, INSERT, UPDATE, DELETE Table structure operations: ALTER, DROP, TRUNCATE, COPY Filtering and logical expressions: WHERE, AND, OR, IN, IS NULL, BETWEEN, LIKE Aggregations and sorting: COUNT(), ORDER BY, LIMIT, DISTINCT String functions: LOWER, TRIM, REPLACE, SPLIT, LENGTH Joins: INNER JOIN, LEFT JOIN Comparison operators: =, !=, <, > The goal is to provide a hands-on, interactive environment for practicing SQL logic using real Python code. This dataset does not represent real individuals or businesses — it is entirely fictional and meant for training, teaching, and experimentation purposes only.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alaa Saif; Janat Alkhuld M. (2023). Data Manipulation on Heart Disease Dataset Using Pandas Library. [Dataset]. http://doi.org/10.5281/zenodo.8113014

Data Manipulation on Heart Disease Dataset Using Pandas Library.

Explore at:

21 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.8113014

Dataset updated

Jul 4, 2023

Authors

Alaa Saif; Janat Alkhuld M.

Description

With the constant development our world is facing, new diseases and dangers are marked down in human history as "Modern Day Diseases". In the developing world, the risk of heart diseas and related cardiovascular diseases are on the rise. This dataset aquired contains a dataset that is considered a stepping stone in the work to be done ahead in order to prevent the development or the occurance of a heart attack or stroke.

Clear search

Close search

Google apps

Main menu

Data Manipulation on Heart Disease Dataset Using Pandas Library.

Sample data files for Python Course

Learn Pandas

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

Datasets for manuscript "A data engineering framework for chemical flow...

Keith Galli's Sales Analysis Exercise

Context

Content

Acknowledgements

Storage and Transit Time Data and Code

Introduction to Machine Learning using Python: Classification

Explore data formats and ingestion methods

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration

(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

Data from: dblp XML dataset as CSV for Python Data Analysis Library

Data from: pyerrors: A python framework for error analysis of Monte Carlo...

Using Python Packages and HydroShare to Advance Open Data Science and...

Benchmark data set for MSPypeline, a python package for streamlined mass...

cmomy: A python package to calculate and manipulate Central (co)moments.

Reddit r/AskScience Flair Dataset

UCI and OpenML Data Sets for Ordinal Quantification

MASCDB, a database of images, descriptors and microphysical properties of...

Avokado gelişim

Data Manipulation on Heart Disease Dataset Using Pandas Library.