41 datasets found
  1. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  2. Files Python

    • kaggle.com
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunal Khurana (2024). Files Python [Dataset]. https://www.kaggle.com/kunalkhurana007/files-python/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kunal Khurana
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Kunal Khurana

    Released under MIT

    Contents

  3. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  4. o

    Global iPhone Reviews Dataset

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global iPhone Reviews Dataset [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Reviews & Ratings
    Description

    This dataset provides customer reviews for Apple iPhones, sourced from Amazon. It is designed to facilitate in-depth analysis of user feedback, enabling insights into product sentiment, feature performance, and underlying discussion themes. The dataset is ideal for understanding customer satisfaction and market trends related to iPhone products.

    Columns

    • productAsin: Amazon's unique identifier for a product.
    • country: The country where the review was submitted.
    • date: The date the review was submitted.
    • isVerified: A boolean indicator showing if the reviewer is a verified purchaser. Approximately 93% of reviewers are verified.
    • ratingScore: The numerical rating given to the product, typically on a scale from 1 to 5.
    • reviewTitle: The title of the customer's review.
    • reviewDescription: The detailed text content of the review.
    • reviewUrl: The specific URL of the individual review.
    • reviewedIn: The particular product or category for which the review was left.
    • variant: If applicable, details of the specific product variant or version reviewed, such as 'Colour: BlueSize: 128 GB'.

    Distribution

    The dataset is typically provided in a CSV file format. While specific record counts are not available, data points related to verified purchasers indicate over 3,000 entries. The dataset's quality is rated as 5 out of 5.

    Usage

    This dataset is well-suited for various analytical projects, including: * Sentiment analysis: To determine overall sentiment and identify trends in customer opinions. * Feature analysis: To analyse user satisfaction with specific iPhone features. * Topic modelling: To discover underlying themes and common discussion points within customer reviews. * Exploratory Data Analysis (EDA): For initial investigations and pattern discovery. * Natural Language Processing (NLP) tasks: For text analysis and understanding.

    Coverage

    The dataset has a global regional coverage. While a specific time range for the reviews is not detailed, the dataset itself was listed on 08/06/2025.

    License

    CCO

    Who Can Use It

    • Data Scientists: For developing and applying machine learning models for sentiment analysis and topic modelling.
    • Product Managers: To gain insights into customer satisfaction and identify areas for product improvement.
    • Market Researchers: To understand market trends, competitor analysis, and consumer preferences for electronics.
    • Academics and Students: For research projects focused on consumer behaviour, text analysis, and data science.

    Dataset Name Suggestions

    • iPhone Customer Review Data
    • Apple iPhone Review Dataset
    • Smartphone User Feedback Data
    • Global iPhone Reviews
    • Amazon iPhone Review Data

    Attributes

    Original Data Source: Apple IPhone Customer Reviews

  5. o

    Mobile Device Customer Feedback

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Mobile Device Customer Feedback [Dataset]. https://www.opendatabay.com/data/ai-ml/8496ac33-2bc1-4401-868d-3cc6c5369f16
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset is a valuable resource for conducting sentiment analysis, feature analysis, and topic modelling on customer reviews. It includes essential details such as product ASIN, country, and date, which aid in assessing customer trust and engagement. Each review features a numerical rating score, a concise review title, and a detailed description, offering insight into customer emotions and preferences. Additionally, the review URL, language/region of review, and product variant information enrich the analysis, enabling a deeper understanding of how different product versions resonate with consumers across various markets. This approach not only highlights customer sentiments but also reveals key insights that can inform product development and marketing strategies.

    • Columns

      • productAsin: A unique identifier for the product.
      • country: The location where the review was submitted.
      • date: The date when the review was submitted.
      • isVerified: A boolean flag indicating whether the reviewer is a verified purchaser.
      • ratingScore: The numerical score given by the reviewer, typically ranging from 1 to 5.
      • reviewTitle: A brief summary or headline for the review.
      • reviewDescription: The detailed feedback provided by the reviewer.
      • reviewUrl: A link to the full online review.
      • reviewedIn: The language or region in which the review was written.
      • variant: The specific version of the product that was reviewed.
    • Distribution

      The dataset supports sentiment and feature analysis of customer reviews. It contains 2,850 instances where the reviewer is a verified purchaser (93%), and 212 instances where they are not (7%).

      Rating scores show the following distribution:

      • 1.00 - 1.40: 587 reviews
      • 1.80 - 2.20: 171 reviews
      • 3.00 - 3.40: 239 reviews
      • 3.80 - 4.20: 461 reviews
      • 4.60 - 5.00: 1,604 reviews

      Regarding product variants, some notable examples include B09G9D8KRQ (31%) and B0BN72MLT2 (19%), with 50% falling into other variants. There are 789 unique product ASIN values and 1,255 unique review titles. Specific colour and size variants are also detailed, such as 'Colour: Blue Size: 128 GB' (10%) and 'Colour: Midnight Size: 128 GB' (8%), with 82% distributed among other variants. The dataset contains 2,461 unique values for the detailed review description.

    • Usage

      This dataset is ideal for conducting sentiment analysis, feature analysis, and topic modelling on customer reviews. It can be used to gauge customer trust and engagement, provide insights into customer emotions and preferences, and understand how different product versions resonate with consumers in various markets. The insights derived can directly inform and drive product development and marketing strategies.

    • Coverage

      The dataset offers global coverage and was listed on 17th June 2025. It is indicated to be of high quality (5/5) and is available as version 1.0.

    • License

      CC0

    • Who Can Use It

      • Data Scientists and Analysts: For sentiment analysis, feature analysis, and topic modelling on customer feedback.
      • Product Developers: To understand customer preferences and drive product improvements.
      • Marketing Strategists: To tailor marketing campaigns based on customer sentiments and engagement.
      • Researchers: For academic studies on consumer behaviour and text analytics (NLP).
    • Dataset Name Suggestions

      • Customer Product Review Insights
      • Mobile Device Customer Feedback
      • NLP Product Sentiment Data
      • Global Customer Review Analysis
      • Verified Purchaser Reviews
    • Attributes

    Original Data Source:IPhone Customer Survey | NLP

  6. o

    Regional YouTube Viral Content Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Regional YouTube Viral Content Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    YouTube, Data Science and Analytics
    Description

    This dataset contains YouTube trending video statistics for various Mediterranean countries. Its primary purpose is to provide insights into popular video content, channels, and viewer engagement across the region over specific periods. It is valuable for analysing content trends, understanding regional audience preferences, and assessing video performance metrics on the YouTube platform.

    Columns

    • country: The nation where the video was published.
    • video_id: A unique identification number assigned to each video.
    • title: The name of the video.
    • publishedAt: The publication date of the video.
    • channelId: The unique identification number for the channel that published the video.
    • channelTitle: The name of the channel that published the video.
    • categoryId: The category identification number of the video (e.g., '10' for 'music').
    • trending_date: The date on which the video was observed to be trending.
    • tags: Keywords or phrases associated with the video.
    • view_count: The total number of views the video has accumulated.
    • comment_count: The total number of comments received on the video.
    • thumbnail_link: The URL for the image displayed before the video is played.
    • comments_disabled: A boolean indicator showing if comments are disabled for the video.
    • ratings_disabled: A boolean indicator showing if ratings (likes/dislikes) are disabled for the video.
    • description: The explanatory text provided below the video.

    Distribution

    The dataset is structured in a tabular format, typically provided as a CSV file. It consists of 15 distinct columns detailing various aspects of YouTube trending videos. While the exact total number of rows or records is not specified, the data includes trending video counts for several date ranges in 2022: * 06/04/2022 - 06/08/2022: 31 records * 06/08/2022 - 06/11/2022: 56 records * 06/11/2022 - 06/15/2022: 57 records * 06/15/2022 - 06/19/2022: 111 records * 06/19/2022 - 06/22/2022: 130 records * 06/22/2022 - 06/26/2022: 207 records * 06/26/2022 - 06/29/2022: 321 records * 06/29/2022 - 07/03/2022: 523 records * 07/03/2022 - 07/07/2022: 924 records * 07/07/2022 - 07/10/2022: 861 records The dataset features 19 unique countries and 1347 unique video IDs. View counts for videos in the dataset range from approximately 20.9 thousand to 123 million.

    Usage

    This dataset is well-suited for a variety of analytical applications and use cases: * Exploratory Data Analysis (EDA): Discovering patterns, anomalies, and relationships within YouTube trending content. * Data Manipulation and Querying: Practising data handling using libraries such as Pandas or Numpy in Python, or executing queries with SQL. * Natural Language Processing (NLP): Analysing video titles, tags, and descriptions to extract key themes, sentiment, and trending topics. * Trend Prediction: Developing models to forecast future trending videos or content categories. * Cross-Country Comparison: Examining how trending content varies across different Mediterranean nations.

    Coverage

    • Geographic Scope: The dataset covers YouTube trending video statistics for 19 specific Mediterranean countries. These include Italy (IT), Spain (ES), Greece (GR), Croatia (HR), Turkey (TR), Albania (AL), Algeria (DZ), Egypt (EG), Libya (LY), Tunisia (TN), Morocco (MA), Israel (IL), Montenegro (ME), Lebanon (LB), France (FR), Bosnia and Herzegovina (BA), Malta (MT), Slovenia (SI), Cyprus (CY), and Syria (SY).
    • Time Range: The data primarily spans from 2022-06-04 to 2022-07-10, providing detailed daily trending information. A specific snapshot of the dataset is also available for 2022-11-07.

    License

    CC0

    Who Can Use It

    • Data Scientists and Analysts: For conducting in-depth research, building predictive models, and generating insights on social media trends.
    • Researchers: Those studying online content consumption patterns, regional cultural influences, and digital media behaviour.
    • Marketing Professionals: To identify popular content types, inform content strategy, and understand audience engagement on YouTube.
    • Students: For academic projects focusing on web data analysis, natural language processing, and statistical modelling.

    Dataset Name Suggestions

    • Mediterranean YouTube Trends 2022
    • YouTube Trending Videos: Mediterranean Insights
    • Regional YouTube Viral Content
    • Mediterranean Social Media Video Data
    • YouTube Trends in Southern Europe & North Africa

    Attributes

    Original Data Source: YouTube Trending Videos of the Day

  7. singapore

    • kaggle.com
    Updated Jul 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saibharath (2020). singapore [Dataset]. https://www.kaggle.com/datasets/saibharath12/singapore/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    saibharath
    Area covered
    Singapore
    Description

    This dataset has total population of dingapore basing on their ethnicity,gender . It is raw data which has mixed entities in columns . from year 1957 to 2018 population data is given . The main aim in uploading this data is to get skilled in python pandas for exploratory data analysis.

  8. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  9. Replication Package for 'Data-Driven Analysis and Optimization of Machine...

    • zenodo.org
    zip
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño; Joel Castaño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

    This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.
    The framework considers the trade-offs between three key objectives:
    1. Performance (maximizing throughput)
    2. Energy Efficiency (minimizing estimated energy per unit)
    3. Cost (minimizing estimated hardware cost)

    Repository Structure

    This repository is organized as follows:
    • Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.
    • Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.
    • Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.
    • Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.
    • Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.
    • eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.
    • requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.
    • eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.
    • optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.
    • pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.
    • shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

    Requirements and Installation

    To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.
    1. Clone the repository:
    bash
    git clone
    cd
    2. **Create and activate a virtual environment (optional but recommended):
    bash
    python -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    3. Install the required packages:
    All dependencies are listed in the `requirements.txt` file. Install them using pip:
    bash
    pip install -r requirements.txt

    Step-by-Step Reproduction Workflow

    The notebooks are designed to be run in a logical sequence.

    Step 1: Data Enrichment (Optional)

    The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

    Step 2: Exploratory Data Analysis (Optional)

    All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

    Step 3: Main Model Training, Validation, and Recommendation

    This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:
    1. It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.
    2. It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.
    3. It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.
    4. It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
  10. Invoices Dataset

    • kaggle.com
    Updated Jan 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cankat Saraç (2022). Invoices Dataset [Dataset]. https://www.kaggle.com/datasets/cankatsrc/invoices/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cankat Saraç
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.

  11. o

    Replication package for "An Exploratory Study on the Predominant Programming...

    • explore.openaire.eu
    Updated Mar 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Dyer; Jigyasa Chauhan (2022). Replication package for "An Exploratory Study on the Predominant Programming Paradigms in Python Code" [Dataset]. http://doi.org/10.5281/zenodo.6975558
    Explore at:
    Dataset updated
    Mar 17, 2022
    Authors
    Robert Dyer; Jigyasa Chauhan
    Description

    This dataset includes scripts and data files used to generate all analysis and results from the paper. A README.md file is included for details on using the scripts - though all of the data the scripts generate should already be cached and none of the scripts actually need run. It also includes a spreadsheet containing the human judgements from Table 4 of the paper. Always current source for the scripts is available on GitHub: https://github.com/psybers/python-paradigms

  12. Open Data Package for Article "Exploring Complexity Issues in Junior...

    • figshare.com
    xlsx
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur-Jozsef Molnar (2024). Open Data Package for Article "Exploring Complexity Issues in Junior Developer Code using Static Analysis and FCA" [Dataset]. http://doi.org/10.6084/m9.figshare.25729587.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Arthur-Jozsef Molnar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.

  13. Supplementary data on journal quartiles and citation indicators across...

    • zenodo.org
    png
    Updated Apr 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serhii Nazarovets; Serhii Nazarovets (2025). Supplementary data on journal quartiles and citation indicators across disciplines [Dataset]. http://doi.org/10.5281/zenodo.15206056
    Explore at:
    pngAvailable download formats
    Dataset updated
    Apr 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Serhii Nazarovets; Serhii Nazarovets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides supplementary data extracted and processed from the SCImago Journal Rank portal (2023) and the Scopus Discontinued Titles list (February 2025). It includes journal-level metrics such as SJR and h-index, quartile assignments, and subject category information. The files are intended to support exploratory analysis of citation patterns, disciplinary variations, and structural characteristics of journal evaluation systems. The dataset also contains Python code and visual materials used to examine relationships between prestige metrics and cumulative citation indicators.

    Contents:

    • Scimago Journal Rank 2023.xlsx – full SJR dataset with quartile and h-index data.
    • Q1 journals with h-index below 5 (SJR 2023).xlsx – filtered subset of Q1 journals with low citation impact.
    • Relationship between journal h-index and SJR 2023.png – visualization of SJR vs h-index by quartile.
    • Scopus Discontinued Titles (Feb 2025) – list of discontinued sources from Scopus used for consistency checks.
    • Python script for data processing and visualization.
  14. Explore data formats and ingestion methods

    • kaggle.com
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Preda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Why this Dataset

    This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

    You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

    Iris Dataset

    Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

    Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

    Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

    The file downloaded is iris.data and is formatted as a comma delimited file.

    This small data collection was created to help you test your skills with ingesting various data formats.

    Content

    This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
    * feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
    * npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

    Acknowledgements

    I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

    Inspiration

    Use these data formats to test your skills in ingesting data in various formats.

  15. News Mining Dataset for Sentiment and Topic Analysis: 300K Articles...

    • zenodo.org
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Veríssimo; Hugo Veríssimo (2025). News Mining Dataset for Sentiment and Topic Analysis: 300K Articles Extracted across 20 News Sources [Dataset]. http://doi.org/10.5281/zenodo.15231163
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Veríssimo; Hugo Veríssimo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    This dataset was created using Arquivo.pt, the Portuguese web archive, as the primary source for extracting and analyzing news-related links. Over 3 million archived URLs were collected and processed, resulting in a curated collection of approximately 300,000 high-quality news articles from 20 different news sources.

    Each article has been processed to extract key information such as:

    • Publication date

    • News source

    • Mentioned topics

    • Sentiment analysis

    The dataset was built as part of a web-based application for relationship detection and exploratory analysis of news content. It can support research in areas such as natural language processing (NLP), computational journalism, network analysis, topic modeling, and sentiment tracking.

    All articles are in Portuguese, and the dataset is structured for easy use with tools like Python (e.g., Pandas, Spark) and machine learning workflows.

    Dataset Structure

    The dataset consists of two main folders:

    1. news/
      Contains all ~3 million processed URLs, organized by folders based on processing status:

      • success/ — articles successfully extracted

      • duplicated/ — duplicate content detected

      • not_news/ — filtered out as non-news

      • error/ — extraction or parsing failures
        Each subfolder contains JSON files, partitioned as outputted by Spark. These represent the raw extracted content.

    2. news_processed/
      Contains 8 Parquet files, which are partitions of a cleaned and enriched dataset with approximately 300,000 high-quality news articles. These include structured fields ready for analysis.

  16. Z

    Dataset for "Machine learning predictions on an extensive geotechnical...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soranzo, Enrico (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14251190
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset authored and provided by
    Soranzo, Enrico
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Austria
    Description

    This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

    The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

    Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

    This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

    Key Features:

    Temporal Coverage: Over 20 years of data.

    Geographical Coverage: Vienna, Lower Austria, and Burgenland.

    Tests Included:

    Particle Size Distribution

    Atterberg Limits

    Proctor Tests

    Permeability Tests

    Direct Shear Tests

    Number of Variables: 24

    Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

    Technical Details:

    Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.

    Data normalization and standardization steps are recommended for specific analyses.

    Acknowledgments:The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).

  17. Salaries of developers in Ukraine

    • kaggle.com
    Updated Nov 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mysha Rysh (2022). Salaries of developers in Ukraine [Dataset]. https://www.kaggle.com/datasets/mysha1rysh/salaries-of-developers-in-ukraine
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2022
    Dataset provided by
    Kaggle
    Authors
    Mysha Rysh
    Area covered
    Ukraine
    Description

    This data was collected by the team https://dou.ua/ . This resource is very popular in Ukraine. It provides salary statistics, shows current vacancies and publishes useful articles related to the life of an IT specialist. This dataset was taken from the public repository https://github.com/devua/csv/tree/master/salaries . This dataset will include the following data for each of the developer: salary, position (f.e. Junior, Middle), experience, city, tech (f.e C#/.NET, JavaScript, Python). I think this dataset will be useful to our community. Thank you.

  18. Z

    Data from: A dataset of GitHub Actions workflow histories

    • data.niaid.nih.gov
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    Cardoen, Guillaume
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

    Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

    2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

    2024-07-09 update : fix sometimes invalid valid_yaml flag.

    The dataset was created as follow :

    First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

    We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

    We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

    We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

    We added the column uid via a script available on GitHub.

    Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

    Using the extracted data, the following files were created :

    workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

    workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

    workflows.csv.gz contains the metadata for the extracted workflow files.

    workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

    repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

    The metadata is separated in different columns:

    repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

    commit_hash: The commit hash returned by git

    author_name: The name of the author that changed this file

    author_email: The email of the author that changed this file

    committer_name: The name of the committer

    committer_email: The email of the committer

    committed_date: The committed date of the commit

    authored_date: The authored date of the commit

    file_path: The path to this file in the repository

    previous_file_path: The path to this file before it has been touched

    file_hash: The name of the related workflow file in the dataset

    previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

    git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

    valid_yaml: A boolean indicating if the file is a valid YAML file.

    probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

    valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

    uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

    Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

  19. Real State Website Data

    • kaggle.com
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M. Mazhar
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

    This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

    The key columns in the dataset are as follows:

    1. Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

    In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

    By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

    This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.

  20. Data from: HadISDH land: gridded global monthly land surface humidity data...

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Jun 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr. (2020). HadISDH land: gridded global monthly land surface humidity data version 4.2.0.2019f [Dataset]. https://catalogue.ceda.ac.uk/uuid/3e9f387293294f3b8a850524fcfc0c9c
    Explore at:
    Dataset updated
    Jun 29, 2020
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr.
    License

    http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/

    Time period covered
    Jan 1, 1973 - Dec 31, 2019
    Area covered
    Earth
    Variables measured
    time, latitude, longitude, month of year, air_temperature, relative_humidity, dew_point_depression, wet_bulb_temperature, dew_point_temperature, time period boundaries, and 40 more
    Description

    This is the 4.2.0.2019f version of the HadISDH (Integrated Surface Database Humidity) land data. These data are provided by the Met Office Hadley Centre. This version spans 1/1/1973 to 31/12/2019.

    The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD). Data are provided in either NetCDF or ASCII format.

    This version extends the 4.1.0.2018f version to the end of 2019 and constitutes a minor update to HadISDH due to changing some of the code base from IDL to Python 3 and detecting and fixing various bugs in the process. These have led to small changes in regional and global average values and coverage. All other processing steps for HadISDH remain identical. Users are advised to read the update document in the Docs section for full details.

    As in previous years, the annual scrape of NOAA’s Integrated Surface Dataset for HadISD.3.1.0.2019f, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. There has been an issue with data for April 2015 whereby it is missing for most of the globe. This will hopefully be resolved by next year’s update. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data.

    To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.

    For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/

    References:

    When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):

    Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.

    Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.

    Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1

    We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:

    Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
Organization logo

IMDb Top 4070: Explore the Cinema Data

Python - Exploratory Data Analysis

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
K.T.S. Prabhu
Description

Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

Search
Clear search
Close search
Google apps
Main menu