40 datasets found

Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
GDSC Data Science (Pandas)
kaggle.com
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed56668999898 (2024). GDSC Data Science (Pandas) [Dataset]. https://www.kaggle.com/datasets/mohamed56668999898/gdsc-data-science-pandas
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed56668999898
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Mohamed56668999898

Released under Database: Open Database, Contents: © Original Authors

Contents
Python frameworks used in data science 2021
statista.com
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Python frameworks used in data science 2021 [Dataset]. https://www.statista.com/statistics/1338424/python-use-frameworks-data-science/
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2021 - Dec 2021
Area covered
Worldwide
Description
Python is one of the most popular programming languages among data scientists, partly due to its varied packages and capabilities. In 2021, Numpy and Pandas were the most used Python frameworks for data science, with a ** percent and ** percent share respectively.
Z
Flow map data of the singel pendulum, double pendulum and 3-body problem
data.niaid.nih.gov
zenodo.org
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horn, Philipp (2024). Flow map data of the singel pendulum, double pendulum and 3-body problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11032351
Explore at:
Dataset updated
Apr 23, 2024
Dataset provided by
Koren, Barry
Horn, Philipp
Veronica, Saz Ulibarrena
Simon, Portegies Zwart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.

The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.

For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.

Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.

The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.

Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.

Single pendulum Double pendulum 3-body problem

Number of trajectories 500 2000 5000

final time in all_runs T (one period of the pendulum) 10 10

final time in training data 0.25*T 5 5

step size in training data 0.1 0.1 0.5
Most popular AI frameworks in data science in Poland 2020
statista.com
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Most popular AI frameworks in data science in Poland 2020 [Dataset]. https://www.statista.com/statistics/1228698/poland-most-popular-ai-frameworks-in-data-science/
Explore at:
Dataset updated
Jan 20, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
Poland
Description
In 2020, the most popular data science libraries/development environments among AI experts in Poland were MapReduce - used to process large data sets (big data), followed by NumPy, Hive, and pandas.
Sales Analysis
kaggle.com
zip
Updated Jun 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinay Shaw (2020). Sales Analysis [Dataset]. https://www.kaggle.com/vinayshaw/sales-analysis
Explore at:
zip(2492073 bytes)Available download formats
Dataset updated
Jun 30, 2020
Authors
Vinay Shaw
Description
Dataset

This dataset was created by Vinay Shaw

Contents

It contains the following files:
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Python package Datatable
kaggle.com
Updated Oct 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaihua Zhang (2020). Python package Datatable [Dataset]. https://www.kaggle.com/zhangkaihua88/python-datatable/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaihua Zhang
Description
Context

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Content

The wheel file for installing datatable v0.11.0

Installation

!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/null

Using

import datatable as dt data = dt.fread("filename").to_pandas()

Acknowledgements

https://github.com/h2oai/datatable

Documentation

https://datatable.readthedocs.io/en/latest/index.html

License

https://github.com/h2oai/datatable/blob/main/LICENSE
Z
Data from: Dataset: 295,416 RiPP BGCs from BiG-FAM version 1.0
data.niaid.nih.gov
zenodo.org
Updated Oct 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kautsar, Satria (2020). Dataset: 295,416 RiPP BGCs from BiG-FAM version 1.0 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4106679
Explore at:
Dataset updated
Oct 21, 2020
Dataset authored and provided by
Kautsar, Satria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
295,416 RiPP BGCs from BiG-FAM version 1 (https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa812/5917658) to be used for downstream analyses. Data includes:

all antiSMASH5 regiongbk files in one folder

GCF assignment (t=900)

BiG-SLiCE's BGC feature matrix (see https://www.biorxiv.org/content/10.1101/2020.08.17.240838v2, pandas dataframe pickled in Python 3.6.7, pandas 1.0.4)

taxonomy information (GTDB)

RiPP subclasses information
Data from: Population estimation from mobile network traffic metadata
zenodo.org
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghazaleh Khodabandelou; Vincent Gauthier; Vincent Gauthier; Mounim El Yacoubi; Marco Fiore; Ghazaleh Khodabandelou; Mounim El Yacoubi; Marco Fiore (2020). Population estimation from mobile network traffic metadata [Dataset]. http://doi.org/10.5281/zenodo.1037577
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1037577
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ghazaleh Khodabandelou; Vincent Gauthier; Vincent Gauthier; Mounim El Yacoubi; Marco Fiore; Ghazaleh Khodabandelou; Mounim El Yacoubi; Marco Fiore
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Please cite our paper if you publish material based on those datasets

G. Khodabandelou, V. Gauthier, M. El-Yacoubi, M. Fiore, "Estimation of Static and Dynamic Urban Populations with Mobile Network Metadata", in IEEE Trans. on Mobile Computing, 2018 (in Press). 10.1109/TMC.2018.2871156

Abstract

Communication-enabled devices that are physically carried by individuals are today pervasive,
which opens unprecedented opportunities for collecting digital metadata about the mobility of large populations. In this paper, we propose a novel methodology for the estimation of people density at metropolitan scales, using subscriber presence metadata collected by a mobile operator. We show that our approach suits the estimation of static population densities, i.e., of the distribution of dwelling units per urban area contained in traditional censuses. Specifically, it achieves higher accuracy than that granted by previous equivalent solutions. In addition, our approach enables the estimation of dynamic population densities, i.e., the time-varying distributions of people in a conurbation. Our results build on significant real-world mobile network metadata and relevant ground-truth information in multiple urban scenarios.

Dataset Columns

This dataset cover one month of data taken during the month of April 2015 for three Italian cities: Rome, Milan, Turin. The raw data has been provided during the Telecom Italia Big Data Challenge (http://www.telecomitalia.com/tit/en/innovazione/archivio/big-data-challenge-2015.html)

1. grid_id: the coordinate of the grid can be retrieved with the shapefile of a given city
2. date: format Y-M-D H:M:S
4. landuse_label: the land use label has been computed by through method described in [2]
5. presence: presence data of a given grid id as provided by the Telecom Italia Big Data Challenge
6. population: Census population of a given grid block as defined by the Istituto nazionale di statistica (ISTAT https://www.istat.it/en/censuses) in 2011
7. estimation: Dynamics density population estimation (in person) as the result of the method described in [1]
8. area: surface of the "grid id" considered in km^2
9. geometry: the shape of the area considered with the EPSG:3003 coordinate system (only with quilt)

Note

Due to legal constraints, we cannot share directly the original data from Telecom Italia Big Data Challenge we used to build this dataset.

Easy access to this dataset with quilt

Install the dataset repository:

$ quilt install vgauthier/DynamicPopEstimate

Use the dataset with a Panda Dataframe

>>> from quilt.data.vgauthier import DynamicPopEstimate
>>> import pandas as pd
>>> df = pd.DataFrame(DynamicPopEstimate.rome())

Use the dataset with a GeoPanda Dataframe

>>> from quilt.data.vgauthier import DynamicPopEstimate
>>> import geopandas as gpd
>>> df = gpd.DataFrame(DynamicPopEstimate.rome())

References

[1] G. Khodabandelou, V. Gauthier, M. El-Yacoubi, M. Fiore, "Population estimation from mobile network traffic metadata", in proc of the 17th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pp. 1 - 9, 2016.

[2] A. Furno, M. Fiore, R. Stanica, C. Ziemlicki, and Z. Smoreda, "A tale of ten cities: Characterizing signatures of mobile traffic in urban areas," IEEE Transactions on Mobile Computing, Volume: 16, Issue: 10, 2017.
m
COVID-19 Scholarly Production Dataset
data.mendeley.com
Updated Jul 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gisliany Alves (2020). COVID-19 Scholarly Production Dataset [Dataset]. http://doi.org/10.17632/kx7wwc8dzp.5
Explore at:
Unique identifier
https://doi.org/10.17632/kx7wwc8dzp.5
Dataset updated
Jul 7, 2020
Authors
Gisliany Alves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.
Data from: Large-scale fMRI dataset for the design of motor-based...
openneuro.org
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magnus S. Bom; Annette M.A. Brak; Mathijs A.H.L.L. Raemaekers; Nick F. Ramsey; Mariska J. Vansteensel; Mariana P. Branco (2024). Large-scale fMRI dataset for the design of motor-based Brain-Computer Interfaces [Dataset]. http://doi.org/10.18112/openneuro.ds005366.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005366.v1.0.0
Dataset updated
Jul 25, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Magnus S. Bom; Annette M.A. Brak; Mathijs A.H.L.L. Raemaekers; Nick F. Ramsey; Mariska J. Vansteensel; Mariana P. Branco
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Large-scale fMRI dataset for the design of motor-based Brain-Computer Interfaces

Full description of the data in our dataset paper: []

This dataset is part of the PANDA project. PANDA aims at attesting the feasibility of implanted communication Brain-Computer Interface (cBCI) technology to establish communication in children with severe physical impairments, such as due to CP. This was supported by the Dutch Technology Foundation STW.

Utrecht-BCI and NeuroSafari teams https://www.nick-ramsey.eu/ https://www.neurosafari.nl/
Pandas Cheat Sheet 2018
kaggle.com
Updated Nov 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gębala (2018). Pandas Cheat Sheet 2018 [Dataset]. https://www.kaggle.com/bastekforever/pandas-cheat-sheet-2018/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sebastian Gębala
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Sebastian Gębala

Released under CC0: Public Domain

Contents
Aluminum alloy industrial materials defect
figshare.com
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27922929.v3
Dataset updated
Dec 3, 2024
Dataset provided by
figshare
Authors
Ying Han; Yugang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200，patience: 50，batch: 16，imgsz: 640，pretrained: true，optimizer: SGD，close_mosaic: 10，iou: 0.7，momentum: 0.937，weight_decay: 0.0005，box: 7.5，cls: 0.5，dfl: 1.5，pose: 12.0，kobj: 1.0，save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.
P
Python Package Software Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Python Package Software Report [Dataset]. https://www.marketresearchforecast.com/reports/python-package-software-59302
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 26, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Python Package Software market is experiencing robust growth, driven by the increasing adoption of Python in various industries and the rising demand for efficient and specialized software solutions. The market's expansion is fueled by the large and active Python community constantly developing and refining packages for diverse applications, from web development and data science to machine learning and automation. While precise market sizing is unavailable, considering the widespread use of Python and the significant contribution of open-source packages, a reasonable estimate for the 2025 market size could be around $5 billion, projecting a Compound Annual Growth Rate (CAGR) of 15% over the forecast period (2025-2033). This growth is primarily driven by the increasing complexity of software projects demanding specialized functionality readily available through packages, the need for faster development cycles, and the cost-effectiveness of leveraging pre-built components. Key trends include the rise of cloud-based Python package management, the growing importance of security and maintainability in package selection, and the increasing specialization of packages for niche applications. Constraints on market growth might include challenges in ensuring package quality and security, as well as the learning curve associated with integrating and managing diverse packages within large projects. The market is segmented into cloud-based and web-based solutions, catering to large enterprises and SMEs, with North America and Europe currently holding the largest market shares. The diverse range of packages, from those focusing on data manipulation (Pandas, NumPy) and web frameworks (Django, Flask) to machine learning libraries (Scikit-learn, TensorFlow) and GUI development (Tkinter, PyQt), underscores the market's versatility. The significant contribution of open-source packages fosters a collaborative environment and continuous improvement. However, challenges remain in effectively managing the vast ecosystem of packages, addressing security vulnerabilities, and ensuring interoperability. The future growth will hinge on addressing these challenges, fostering standardization, and further improving the accessibility and user experience of Python package management systems. Continued innovation within the Python ecosystem and broader industry trends such as the rise of AI and big data will further propel the market's expansion.
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Demand and availability of AI specislists in Poland 2020, by data science...
statista.com
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Demand and availability of AI specislists in Poland 2020, by data science framework [Dataset]. https://www.statista.com/statistics/1228730/poland-demand-and-availability-of-ai-specislists-by-data-science-framework/
Explore at:
Dataset updated
Apr 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
Poland
Description
There was a definite lack of experts with knowledge of data science libraries/software environments in 2020. The greatest need was for specialists in pandas and Matplolib libraries.
e
Large-scale structure of M31 halo. II. PAndAS - Dataset - B2FIND
b2find.eudat.eu
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Large-scale structure of M31 halo. II. PAndAS - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9ef40312-b311-501b-9d40-2f2ddf586069
Explore at:
Dataset updated
May 8, 2023
Description
The Pan-Andromeda Archaeological Survey is a survey of >400deg^2^ centered on the Andromeda (M31) and Triangulum (M33) galaxies that has provided the most extensive panorama of an L* galaxy group to large projected galactocentric radii. Here, we collate and summarize the current status of our knowledge of the substructures in the stellar halo of M31, and discuss connections between these features. We estimate that the 13 most distinctive substructures were produced by at least 5 different accretion events, all in the last 3 or 4Gyr. We suggest that a few of the substructures farthest from M31 may be shells from a single accretion event. We calculate the luminosities of some prominent substructures for which previous estimates were not available, and we estimate the stellar mass budget of the outer halo of M31. We revisit the problem of quantifying the properties of a highly structured data set; specifically, we use the OPTICS clustering algorithm to quantify the hierarchical structure of M31's stellar halo and identify three new faint structures. M31's halo, in projection, appears to be dominated by two "mega-structures", which can be considered as the two most significant branches of a merger tree produced by breaking M31's stellar halo into increasingly smaller structures based on the stellar spatial clustering. We conclude that OPTICS is a powerful algorithm that could be used in any astronomical application involving the hierarchical clustering of points.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 30, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rupesh Kumar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas
Chapter 2: Analysis: Bringing it all together and making decisions
Chapter 3: Appending to DataFrame
Chapter 4: Boolean indexing of dataframes
Chapter 5: Categorical data
Chapter 6: Computational Tools
Chapter 7: Creating DataFrames
Chapter 8: Cross sections of different axes with MultiIndex
Chapter 9: Data Types
Chapter 10: Dealing with categorical variables
Chapter 11: Duplicated data
Chapter 12: Getting information about DataFrames
Chapter 13: Gotchas of pandas
Chapter 14: Graphs and Visualizations
Chapter 15: Grouping Data
Chapter 16: Grouping Time Series Data
Chapter 17: Holiday Calendars
Chapter 18: Indexing and selecting data
Chapter 19: IO for Google BigQuery
Chapter 20: JSON
Chapter 21: Making Pandas Play Nice With Native Python Datatypes
Chapter 22: Map Values
Chapter 23: Merge, join, and concatenate
Chapter 24: Meta: Documentation Guidelines
Chapter 25: Missing Data
Chapter 26: MultiIndex
Chapter 27: Pandas Datareader
Chapter 28: Pandas IO tools (reading and saving data sets)
Chapter 29: pd.DataFrame.apply
Chapter 30: Read MySQL to DataFrame
Chapter 31: Read SQL Server to Dataframe
Chapter 32: Reading files into pandas DataFrame
Chapter 33: Resampling
Chapter 34: Reshaping and pivoting
Chapter 35: Save pandas dataframe to a csv file
Chapter 36: Series
Chapter 37: Shifting and Lagging Data
Chapter 38: Simple manipulation of DataFrames
Chapter 39: String manipulation
Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
Chapter 41: Working with Time Series

Clear search

Close search

Google apps

Main menu

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

GDSC Data Science (Pandas)

Dataset

Contents

Python frameworks used in data science 2021

Flow map data of the singel pendulum, double pendulum and 3-body problem

Most popular AI frameworks in data science in Poland 2020

Sales Analysis

Dataset

Contents

Exploratory Data Analysis (EDA) Tools Report

Reddit r/AskScience Flair Dataset

Python package Datatable

Context

Content

Installation

Using

Acknowledgements

Documentation

License

Data from: Dataset: 295,416 RiPP BGCs from BiG-FAM version 1.0

Data from: Population estimation from mobile network traffic metadata

COVID-19 Scholarly Production Dataset

Exploratory Data Analysis (EDA) Tools Report

Data from: Large-scale fMRI dataset for the design of motor-based...

Pandas Cheat Sheet 2018

Dataset

Contents

Aluminum alloy industrial materials defect

Python Package Software Report

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

Demand and availability of AI specislists in Poland 2020, by data science...

Large-scale structure of M31 halo. II. PAndAS - Dataset - B2FIND

Learn Data Science Series Part 1

This module contains learning material to master Pandas

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview: