41 datasets found
  1. Exploratory Data Analysis

    • kaggle.com
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saubhagya Mishra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Saubhagya Mishra

    Released under MIT

    Contents

  2. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  3. Dataset for "Machine learning predictions on an extensive geotechnical...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico Soranzo; Enrico Soranzo (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. http://doi.org/10.5281/zenodo.14251191
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Enrico Soranzo; Enrico Soranzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 30, 2024
    Description

    This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

    The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

    Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

    This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

    Key Features:

    • Temporal Coverage: Over 20 years of data.
    • Geographical Coverage: Vienna, Lower Austria, and Burgenland.
    • Tests Included:
      • Particle Size Distribution
      • Atterberg Limits
      • Proctor Tests
      • Permeability Tests
      • Direct Shear Tests
    • Number of Variables: 24
    • Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

    Technical Details:

    • Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.
    • Data normalization and standardization steps are recommended for specific analyses.

    Acknowledgments:
    The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).

  4. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  5. Files Python

    • kaggle.com
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunal Khurana (2024). Files Python [Dataset]. https://www.kaggle.com/kunalkhurana007/files-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kunal Khurana
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Kunal Khurana

    Released under MIT

    Contents

  6. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  7. Keith Galli's Sales Analysis Exercise

    • kaggle.com
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zulkhairee Sulaiman (2022). Keith Galli's Sales Analysis Exercise [Dataset]. https://www.kaggle.com/datasets/zulkhaireesulaiman/sales-analysis-2019-excercise/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zulkhairee Sulaiman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

    I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.

    Content

    12 .csv files containing sales data for each month of 2019.

    Acknowledgements

    Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.

  8. House Prices

    • kaggle.com
    Updated May 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanya Chawla (2021). House Prices [Dataset]. https://www.kaggle.com/tanyachawla412/house-prices/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 13, 2021
    Dataset provided by
    Kaggle
    Authors
    Tanya Chawla
    Description

    Context

    To explore and learn more on Multiple Linear Regression.

    Content

    The dataset consists of house prices across the USA. It has the following columns: - Avg. Area Income: Numerical data about the average area of the income where the house is located. - House Age: Age of the house in years. - Number of Rooms - Number of Bedrooms - Area Population: Population of the area where the house is located. - Price - Address: The only textual data in the dataset consisting of the address of the house.

  9. f

    Python scripts with instructions for the extraction and transformation of...

    • plos.figshare.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timur Olzhabaev; Lukas Müller; Daniel Krause; Dominik Schwudke; Andrew Ernest Torda (2025). Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012892.s006
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    PLOS Computational Biology
    Authors
    Timur Olzhabaev; Lukas Müller; Daniel Krause; Dominik Schwudke; Andrew Ernest Torda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints.

  10. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  11. Sales Data (Project1 IIITD)

    • kaggle.com
    Updated Jan 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Sharma (2022). Sales Data (Project1 IIITD) [Dataset]. https://www.kaggle.com/rahultheogre/iiitd-project1/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rahul Sharma
    Description

    Dataset

    This dataset was created by Rahul Sharma

    Contents

  12. Open Data Package for Article "Exploring Complexity Issues in Junior...

    • figshare.com
    xlsx
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur-Jozsef Molnar (2024). Open Data Package for Article "Exploring Complexity Issues in Junior Developer Code using Static Analysis and FCA" [Dataset]. http://doi.org/10.6084/m9.figshare.25729587.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Arthur-Jozsef Molnar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.

  13. Explore data formats and ingestion methods

    • kaggle.com
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Preda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Why this Dataset

    This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

    You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

    Iris Dataset

    Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

    Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

    Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

    The file downloaded is iris.data and is formatted as a comma delimited file.

    This small data collection was created to help you test your skills with ingesting various data formats.

    Content

    This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
    * feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
    * npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

    Acknowledgements

    I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

    Inspiration

    Use these data formats to test your skills in ingesting data in various formats.

  14. f

    Python codes for ML-Pain-MEDD

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salama, Vivian (2024). Python codes for ML-Pain-MEDD [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001412595
    Explore at:
    Dataset updated
    Apr 29, 2024
    Authors
    Salama, Vivian
    Description

    An exploratory analysis of multiple machine learning models for predicting end-of-treatment acute pain intensity and opioid doses (represented as the total morphine equivalent daily dose (MEDD)) and analgesic efficacy in a large scare retrospective cohort of oral cavity and oropharyngeal cancer patients received radiation therapy (RT).

  15. RICardo dataset 2017.12

    • zenodo.org
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger (2020). RICardo dataset 2017.12 [Dataset]. http://doi.org/10.5281/zenodo.1119592
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Béatrice Dedinger; Paul Girard; Paul Girard; Béatrice Dedinger
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This is the first public release of the RICardo dataset under the licence odbl v1.0. This dataset is precisely described un der the data package format.

    This release includes 368,871 bilateral or total trade flows from 1787 to 1938 for 373 reporting entities. It also contains python scripts used to compile and filter the flows to fuel our exploratory data analysis online tool.

  16. Z

    Data from: A dataset of GitHub Actions workflow histories

    • data.niaid.nih.gov
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    Cardoen, Guillaume
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

    Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

    2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

    2024-07-09 update : fix sometimes invalid valid_yaml flag.

    The dataset was created as follow :

    First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

    We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

    We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

    We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

    We added the column uid via a script available on GitHub.

    Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

    Using the extracted data, the following files were created :

    workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

    workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

    workflows.csv.gz contains the metadata for the extracted workflow files.

    workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

    repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

    The metadata is separated in different columns:

    repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

    commit_hash: The commit hash returned by git

    author_name: The name of the author that changed this file

    author_email: The email of the author that changed this file

    committer_name: The name of the committer

    committer_email: The email of the committer

    committed_date: The committed date of the commit

    authored_date: The authored date of the commit

    file_path: The path to this file in the repository

    previous_file_path: The path to this file before it has been touched

    file_hash: The name of the related workflow file in the dataset

    previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

    git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

    valid_yaml: A boolean indicating if the file is a valid YAML file.

    probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

    valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

    uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

    Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

  17. Google Play Store_Cleaned

    • kaggle.com
    Updated Mar 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash (2023). Google Play Store_Cleaned [Dataset]. https://www.kaggle.com/datasets/yash16jr/google-play-store-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Dataset is the cleaned up version of the Google Play Store Data dataset , available on Kaggle. The EDA and data cleaning was performed using Python .

  18. Representations of Sound and Music in the Middle Ages: Analysis and...

    • zenodo.org
    json
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho (2025). Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database (Records and Performances) [Dataset]. http://doi.org/10.5281/zenodo.15037823
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xavier Fresquet; Xavier Fresquet; Frederic BILLIET; Frederic BILLIET; Edmundo Camacho; Edmundo Camacho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of the study “Representations of Sound and Music in the Middle Ages: Analysis and Visualization of the Musiconis Database”, authored by Edmundo Camacho, Xavier Fresquet, and Frédéric Billiet.

    It contains structured descriptions of musical performances, performers, and instruments extracted from the Musiconis database (December 2024 version). This dataset does not include organological descriptions, which are available in a separate dataset.

    The Musiconis database provides a structured and interoperable framework for studying medieval music iconography. It enables investigations into:

    • The evolution and spread of musical instruments across Europe and the Mediterranean.

    • Performer typologies and their representation in medieval art.

    • The relationships between musical practices and social or religious contexts.

    Contents:

    Musiconis Dataset (JSON format, December 2024 version):

    • Musical scenes and their descriptions

    • Performer metadata (roles, social status, gender, interactions)

    • Instrument classifications (without detailed organological descriptions)

    Colab Notebook (Python):

    • Data processing and structuring

    • Visualization of performer distributions and instrument usage

    • Exploratory statistics and mapping

    Tools Used:

    • Python (Pandas, Seaborn, Matplotlib, Plotly)

    • Statistical and exploratory data analysis

    • Visualization of instrument distributions, performer interactions, and musical context

  19. m

    Bandwidth Measurement for Conference Call Data Usage

    • data.mendeley.com
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dikamsiyochi Young UDOCHI (2020). Bandwidth Measurement for Conference Call Data Usage [Dataset]. http://doi.org/10.17632/8sp4nxj8m3.1
    Explore at:
    Dataset updated
    Aug 25, 2020
    Authors
    Dikamsiyochi Young UDOCHI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    During the first half of 2020, the COVID-19 pandemic changed the social gathering lifestyle to online business and social interaction. The worldwide imposed travel bans and national lockdown prevented social gatherings, making learning institutions and businesses to adopt an online platform for learning and business transactions. This development led to the incorporation of video conferencing into daily activities. This data article presents broadband data usage measurement collected using Glasswire software on various conference calls made between July and August. The services considered in this work are Google Meet, Zoom, Mixir, and Hangout. The data were recorded in Microsoft Excel 2016, running on a personal computer. The data was cleaned and processed using google colaboratory, which runs Python scripts on the browser. Exploratory data analysis is conducted on the data set using linear regression to model a predictive model to assess the best performance model that offers the best quality of service for online video and voice conferencing. The data is necessary to learning institutions using online programs and to learners accessing online programs in a smart city and developing countries. The data is presented in tables and graphs.

  20. Data from: HadISDH land: gridded global monthly land surface humidity data...

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Aug 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr. (2020). HadISDH land: gridded global monthly land surface humidity data version 4.2.0.2019f [Dataset]. https://catalogue.ceda.ac.uk/uuid/3e9f387293294f3b8a850524fcfc0c9c
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr.
    License

    http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/

    Time period covered
    Jan 1, 1973 - Dec 31, 2019
    Area covered
    Earth
    Variables measured
    time, latitude, longitude, month of year, air_temperature, relative_humidity, dew_point_depression, wet_bulb_temperature, dew_point_temperature, time period boundaries, and 40 more
    Description

    This is the 4.2.0.2019f version of the HadISDH (Integrated Surface Database Humidity) land data. These data are provided by the Met Office Hadley Centre. This version spans 1/1/1973 to 31/12/2019.

    The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD). Data are provided in either NetCDF or ASCII format.

    This version extends the 4.1.0.2018f version to the end of 2019 and constitutes a minor update to HadISDH due to changing some of the code base from IDL to Python 3 and detecting and fixing various bugs in the process. These have led to small changes in regional and global average values and coverage. All other processing steps for HadISDH remain identical. Users are advised to read the update document in the Docs section for full details.

    As in previous years, the annual scrape of NOAA’s Integrated Surface Dataset for HadISD.3.1.0.2019f, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. There has been an issue with data for April 2015 whereby it is missing for most of the globe. This will hopefully be resolved by next year’s update. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data.

    To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.

    For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/

    References:

    When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):

    Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.

    Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.

    Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1

    We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:

    Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saubhagya Mishra (2025). Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/saubhagyamishra1992/exploratory-data-analysis
Organization logo

Exploratory Data Analysis

Exploratory Data Analysis using Python

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saubhagya Mishra
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Saubhagya Mishra

Released under MIT

Contents

Search
Clear search
Close search
Google apps
Main menu