MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Saubhagya Mishra
Released under MIT
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.
What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.
Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.
Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.
The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.
Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.
This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.
Key Features:
Technical Details:
Acknowledgments:
The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Kunal Khurana
Released under MIT
Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.
I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.
12 .csv files containing sales data for each month of 2019.
Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.
To explore and learn more on Multiple Linear Regression.
The dataset consists of house prices across the USA. It has the following columns: - Avg. Area Income: Numerical data about the average area of the income where the house is located. - House Age: Age of the house in years. - Number of Rooms - Number of Bedrooms - Area Population: Population of the area where the house is located. - Price - Address: The only textual data in the dataset consisting of the address of the house.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data_Analysis.ipynb
: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/
directory.Dataset_Extension.ipynb
: A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv
` and produces the Inference_data_Extended.csv
by adding detailed hardware specifications, cost estimates, and derived energy metrics.Optimization_Model.ipynb
: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.Inference_data.csv
: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.Inference_data_Extended.csv
: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb
notebook.eda_log.txt
: A text log file containing summary statistics generated during the exploratory data analysis.requirements.txt
: A list of all necessary Python libraries and their versions required to run the code in this repository.eda_plots/
: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.optimization_models_final/
: A directory where the trained and saved final model files (.joblib
) are stored after running the optimization notebook.pareto_validation_plot_fold_0.png
: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.shap_waterfall_final_model.png
: The SHAP plot used for the model interpretability analysis, as presented in the thesis.
bash
git clone
cd
bash
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
bash
pip install -r requirements.txt
Inference_data_Extended.csv
`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb
`** notebook. It will take `Inference_data.csv` as input and generate the extended version.eda_plots/
` directory. To regenerate them, run the **`Data_Analysis.ipynb
`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.Optimization_Model.ipynb
notebook will execute the entire pipeline described in the paper:optimization_models_final/
directory.pareto_validation_plot_fold_0.png
and shap_waterfall_final_model.png
.This dataset was created by Rahul Sharma
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).
You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:
Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.
Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris
Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
The file downloaded is iris.data and is formatted as a comma delimited file.
This small data collection was created to help you test your skills with ingesting various data formats.
This file was processed to convert the data in the following formats:
* csv - comma separated values format
* tsv - tab separated values format
* parquet - parquet format
* feather - feather format
* parquet.gzip - compressed parquet format
* h5 - hdf5 format
* pickle - Python binary object file - pickle format
* xslx - Excel format
* npy - Numpy (Python library) binary format
* npz - Numpy (Python library) binary compressed format
* rds - Rds (R specific data format) binary format
I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.
Use these data formats to test your skills in ingesting data in various formats.
An exploratory analysis of multiple machine learning models for predicting end-of-treatment acute pain intensity and opioid doses (represented as the total morphine equivalent daily dose (MEDD)) and analgesic efficacy in a large scare retrospective cohort of oral cavity and oropharyngeal cancer patients received radiation therapy (RT).
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This is the first public release of the RICardo dataset under the licence odbl v1.0. This dataset is precisely described un der the data package format.
This release includes 368,871 bilateral or total trade flows from 1787 to 1938 for 373 reporting entities. It also contains python scripts used to compile and filter the flows to fuel our exploratory data analysis online tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).
Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.
2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.
2024-07-09 update : fix sometimes invalid valid_yaml flag.
The dataset was created as follow :
First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).
We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.
We added the column uid via a script available on GitHub.
Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)
Using the extracted data, the following files were created :
workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
workflows.csv.gz contains the metadata for the extracted workflow files.
workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.
The metadata is separated in different columns:
repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
commit_hash: The commit hash returned by git
author_name: The name of the author that changed this file
author_email: The email of the author that changed this file
committer_name: The name of the committer
committer_email: The email of the committer
committed_date: The committed date of the commit
authored_date: The authored date of the commit
file_path: The path to this file in the repository
previous_file_path: The path to this file before it has been touched
file_hash: The name of the related workflow file in the dataset
previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.
valid_yaml: A boolean indicating if the file is a valid YAML file.
probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.
Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Dataset is the cleaned up version of the Google Play Store Data dataset , available on Kaggle. The EDA and data cleaning was performed using Python .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.
It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.
The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:
• The evolution and spread of musical instruments across Europe and the Mediterranean.
• Performer typologies and their representation in medieval art.
• The relationships between musical practices and social or religious contexts.
Contents:
• Musiconis Dataset (JSON format, December 2024 version):
• Musical scenes and their descriptions
• Performer metadata (roles, social status, gender, interactions)
• Instrument classifications (without detailed organological descriptions)
• Colab Notebook (Python):
• Data processing and structuring
• Visualization of performer distributions and instrument usage
• Exploratory statistics and mapping
Tools Used:
• Python (Pandas, Seaborn, Matplotlib, Plotly)
• Statistical and exploratory data analysis
• Visualization of instrument distributions, performer interactions, and musical context
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
During the first half of 2020, the COVID-19 pandemic changed the social gathering lifestyle to online business and social interaction. The worldwide imposed travel bans and national lockdown prevented social gatherings, making learning institutions and businesses to adopt an online platform for learning and business transactions. This development led to the incorporation of video conferencing into daily activities. This data article presents broadband data usage measurement collected using Glasswire software on various conference calls made between July and August. The services considered in this work are Google Meet, Zoom, Mixir, and Hangout. The data were recorded in Microsoft Excel 2016, running on a personal computer. The data was cleaned and processed using google colaboratory, which runs Python scripts on the browser. Exploratory data analysis is conducted on the data set using linear regression to model a predictive model to assess the best performance model that offers the best quality of service for online video and voice conferencing. The data is necessary to learning institutions using online programs and to learners accessing online programs in a smart city and developing countries. The data is presented in tables and graphs.
http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
This is the 4.2.0.2019f version of the HadISDH (Integrated Surface Database Humidity) land data. These data are provided by the Met Office Hadley Centre. This version spans 1/1/1973 to 31/12/2019.
The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD). Data are provided in either NetCDF or ASCII format.
This version extends the 4.1.0.2018f version to the end of 2019 and constitutes a minor update to HadISDH due to changing some of the code base from IDL to Python 3 and detecting and fixing various bugs in the process. These have led to small changes in regional and global average values and coverage. All other processing steps for HadISDH remain identical. Users are advised to read the update document in the Docs section for full details.
As in previous years, the annual scrape of NOAA’s Integrated Surface Dataset for HadISD.3.1.0.2019f, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. There has been an issue with data for April 2015 whereby it is missing for most of the globe. This will hopefully be resolved by next year’s update. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data.
To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.
For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/
References:
When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):
Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.
Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.
Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1
We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:
Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Saubhagya Mishra
Released under MIT