41 datasets found

IMDb Top 4070: Explore the Cinema Data
kaggle.com
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
K.T.S. Prabhu
Description
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
Files Python
kaggle.com
Updated Jan 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kunal Khurana (2024). Files Python [Dataset]. https://www.kaggle.com/kunalkhurana007/files-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kunal Khurana
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Kunal Khurana

Released under MIT

Contents
f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
o
Global iPhone Reviews Dataset
opendatabay.com
.undefined
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global iPhone Reviews Dataset [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Reviews & Ratings
Description
This dataset provides customer reviews for Apple iPhones, sourced from Amazon. It is designed to facilitate in-depth analysis of user feedback, enabling insights into product sentiment, feature performance, and underlying discussion themes. The dataset is ideal for understanding customer satisfaction and market trends related to iPhone products.

Columns

productAsin: Amazon's unique identifier for a product.

country: The country where the review was submitted.

date: The date the review was submitted.

isVerified: A boolean indicator showing if the reviewer is a verified purchaser. Approximately 93% of reviewers are verified.

ratingScore: The numerical rating given to the product, typically on a scale from 1 to 5.

reviewTitle: The title of the customer's review.

reviewDescription: The detailed text content of the review.

reviewUrl: The specific URL of the individual review.

reviewedIn: The particular product or category for which the review was left.

variant: If applicable, details of the specific product variant or version reviewed, such as 'Colour: BlueSize: 128 GB'.

Distribution

The dataset is typically provided in a CSV file format. While specific record counts are not available, data points related to verified purchasers indicate over 3,000 entries. The dataset's quality is rated as 5 out of 5.

Usage

This dataset is well-suited for various analytical projects, including: * Sentiment analysis: To determine overall sentiment and identify trends in customer opinions. * Feature analysis: To analyse user satisfaction with specific iPhone features. * Topic modelling: To discover underlying themes and common discussion points within customer reviews. * Exploratory Data Analysis (EDA): For initial investigations and pattern discovery. * Natural Language Processing (NLP) tasks: For text analysis and understanding.

Coverage

The dataset has a global regional coverage. While a specific time range for the reviews is not detailed, the dataset itself was listed on 08/06/2025.

License

CCO

Who Can Use It

Data Scientists: For developing and applying machine learning models for sentiment analysis and topic modelling.

Product Managers: To gain insights into customer satisfaction and identify areas for product improvement.

Market Researchers: To understand market trends, competitor analysis, and consumer preferences for electronics.

Academics and Students: For research projects focused on consumer behaviour, text analysis, and data science.

Dataset Name Suggestions

iPhone Customer Review Data

Apple iPhone Review Dataset

Smartphone User Feedback Data

Global iPhone Reviews

Amazon iPhone Review Data

Attributes

Original Data Source: Apple IPhone Customer Reviews
o
Mobile Device Customer Feedback
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Mobile Device Customer Feedback [Dataset]. https://www.opendatabay.com/data/ai-ml/8496ac33-2bc1-4401-868d-3cc6c5369f16
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset is a valuable resource for conducting sentiment analysis, feature analysis, and topic modelling on customer reviews. It includes essential details such as product ASIN, country, and date, which aid in assessing customer trust and engagement. Each review features a numerical rating score, a concise review title, and a detailed description, offering insight into customer emotions and preferences. Additionally, the review URL, language/region of review, and product variant information enrich the analysis, enabling a deeper understanding of how different product versions resonate with consumers across various markets. This approach not only highlights customer sentiments but also reveals key insights that can inform product development and marketing strategies.

Columns

productAsin: A unique identifier for the product.

country: The location where the review was submitted.

date: The date when the review was submitted.

isVerified: A boolean flag indicating whether the reviewer is a verified purchaser.

ratingScore: The numerical score given by the reviewer, typically ranging from 1 to 5.

reviewTitle: A brief summary or headline for the review.

reviewDescription: The detailed feedback provided by the reviewer.

reviewUrl: A link to the full online review.

reviewedIn: The language or region in which the review was written.

variant: The specific version of the product that was reviewed.

Distribution

The dataset supports sentiment and feature analysis of customer reviews. It contains 2,850 instances where the reviewer is a verified purchaser (93%), and 212 instances where they are not (7%).

Rating scores show the following distribution:

1.00 - 1.40: 587 reviews

1.80 - 2.20: 171 reviews

3.00 - 3.40: 239 reviews

3.80 - 4.20: 461 reviews

4.60 - 5.00: 1,604 reviews

Regarding product variants, some notable examples include B09G9D8KRQ (31%) and B0BN72MLT2 (19%), with 50% falling into other variants. There are 789 unique product ASIN values and 1,255 unique review titles. Specific colour and size variants are also detailed, such as 'Colour: Blue Size: 128 GB' (10%) and 'Colour: Midnight Size: 128 GB' (8%), with 82% distributed among other variants. The dataset contains 2,461 unique values for the detailed review description.

Usage

This dataset is ideal for conducting sentiment analysis, feature analysis, and topic modelling on customer reviews. It can be used to gauge customer trust and engagement, provide insights into customer emotions and preferences, and understand how different product versions resonate with consumers in various markets. The insights derived can directly inform and drive product development and marketing strategies.

Coverage

The dataset offers global coverage and was listed on 17th June 2025. It is indicated to be of high quality (5/5) and is available as version 1.0.

License

CC0

Who Can Use It

Data Scientists and Analysts: For sentiment analysis, feature analysis, and topic modelling on customer feedback.

Product Developers: To understand customer preferences and drive product improvements.

Marketing Strategists: To tailor marketing campaigns based on customer sentiments and engagement.

Researchers: For academic studies on consumer behaviour and text analytics (NLP).

Dataset Name Suggestions

Customer Product Review Insights

Mobile Device Customer Feedback

NLP Product Sentiment Data

Global Customer Review Analysis

Verified Purchaser Reviews

Attributes

Original Data Source:IPhone Customer Survey | NLP
o
Regional YouTube Viral Content Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Regional YouTube Viral Content Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
YouTube, Data Science and Analytics
Description
This dataset contains YouTube trending video statistics for various Mediterranean countries. Its primary purpose is to provide insights into popular video content, channels, and viewer engagement across the region over specific periods. It is valuable for analysing content trends, understanding regional audience preferences, and assessing video performance metrics on the YouTube platform.

Columns

country: The nation where the video was published.

video_id: A unique identification number assigned to each video.

title: The name of the video.

publishedAt: The publication date of the video.

channelId: The unique identification number for the channel that published the video.

channelTitle: The name of the channel that published the video.

categoryId: The category identification number of the video (e.g., '10' for 'music').

trending_date: The date on which the video was observed to be trending.

tags: Keywords or phrases associated with the video.

view_count: The total number of views the video has accumulated.

comment_count: The total number of comments received on the video.

thumbnail_link: The URL for the image displayed before the video is played.

comments_disabled: A boolean indicator showing if comments are disabled for the video.

ratings_disabled: A boolean indicator showing if ratings (likes/dislikes) are disabled for the video.

description: The explanatory text provided below the video.

Distribution

The dataset is structured in a tabular format, typically provided as a CSV file. It consists of 15 distinct columns detailing various aspects of YouTube trending videos. While the exact total number of rows or records is not specified, the data includes trending video counts for several date ranges in 2022: * 06/04/2022 - 06/08/2022: 31 records * 06/08/2022 - 06/11/2022: 56 records * 06/11/2022 - 06/15/2022: 57 records * 06/15/2022 - 06/19/2022: 111 records * 06/19/2022 - 06/22/2022: 130 records * 06/22/2022 - 06/26/2022: 207 records * 06/26/2022 - 06/29/2022: 321 records * 06/29/2022 - 07/03/2022: 523 records * 07/03/2022 - 07/07/2022: 924 records * 07/07/2022 - 07/10/2022: 861 records The dataset features 19 unique countries and 1347 unique video IDs. View counts for videos in the dataset range from approximately 20.9 thousand to 123 million.

Usage

This dataset is well-suited for a variety of analytical applications and use cases: * Exploratory Data Analysis (EDA): Discovering patterns, anomalies, and relationships within YouTube trending content. * Data Manipulation and Querying: Practising data handling using libraries such as Pandas or Numpy in Python, or executing queries with SQL. * Natural Language Processing (NLP): Analysing video titles, tags, and descriptions to extract key themes, sentiment, and trending topics. * Trend Prediction: Developing models to forecast future trending videos or content categories. * Cross-Country Comparison: Examining how trending content varies across different Mediterranean nations.

Coverage

Geographic Scope: The dataset covers YouTube trending video statistics for 19 specific Mediterranean countries. These include Italy (IT), Spain (ES), Greece (GR), Croatia (HR), Turkey (TR), Albania (AL), Algeria (DZ), Egypt (EG), Libya (LY), Tunisia (TN), Morocco (MA), Israel (IL), Montenegro (ME), Lebanon (LB), France (FR), Bosnia and Herzegovina (BA), Malta (MT), Slovenia (SI), Cyprus (CY), and Syria (SY).

Time Range: The data primarily spans from 2022-06-04 to 2022-07-10, providing detailed daily trending information. A specific snapshot of the dataset is also available for 2022-11-07.

License

CC0

Who Can Use It

Data Scientists and Analysts: For conducting in-depth research, building predictive models, and generating insights on social media trends.

Researchers: Those studying online content consumption patterns, regional cultural influences, and digital media behaviour.

Marketing Professionals: To identify popular content types, inform content strategy, and understand audience engagement on YouTube.

Students: For academic projects focusing on web data analysis, natural language processing, and statistical modelling.

Dataset Name Suggestions

Mediterranean YouTube Trends 2022

YouTube Trending Videos: Mediterranean Insights

Regional YouTube Viral Content

Mediterranean Social Media Video Data

YouTube Trends in Southern Europe & North Africa

Attributes

Original Data Source: YouTube Trending Videos of the Day
singapore
kaggle.com
Updated Jul 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saibharath (2020). singapore [Dataset]. https://www.kaggle.com/datasets/saibharath12/singapore/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
saibharath
Area covered
Singapore
Description
This dataset has total population of dingapore basing on their ethnicity,gender . It is raw data which has mixed entities in columns . from year 1957 to 2018 population data is given . The main aim in uploading this data is to get skilled in python pandas for exploratory data analysis.
Cyclistic Bike - Data Analysis (Python)
kaggle.com
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirthavarshini (2024). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirthavarshini
Description
Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
Invoices Dataset
kaggle.com
Updated Jan 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cankat Saraç (2022). Invoices Dataset [Dataset]. https://www.kaggle.com/datasets/cankatsrc/invoices/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cankat Saraç
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
o
Replication package for "An Exploratory Study on the Predominant Programming...
explore.openaire.eu
Updated Mar 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Dyer; Jigyasa Chauhan (2022). Replication package for "An Exploratory Study on the Predominant Programming Paradigms in Python Code" [Dataset]. http://doi.org/10.5281/zenodo.6975558
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6975558
Dataset updated
Mar 17, 2022
Authors
Robert Dyer; Jigyasa Chauhan
Description
This dataset includes scripts and data files used to generate all analysis and results from the paper. A README.md file is included for details on using the scripts - though all of the data the scripts generate should already be cached and none of the scripts actually need run. It also includes a spreadsheet containing the human judgements from Table 4 of the paper. Always current source for the scripts is available on GitHub: https://github.com/psybers/python-paradigms
Open Data Package for Article "Exploring Complexity Issues in Junior...
figshare.com
xlsx
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arthur-Jozsef Molnar (2024). Open Data Package for Article "Exploring Complexity Issues in Junior Developer Code using Static Analysis and FCA" [Dataset]. http://doi.org/10.6084/m9.figshare.25729587.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25729587.v1
Dataset updated
Jul 9, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Arthur-Jozsef Molnar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The present dataset include the SonarQube issues uncovered as part of our exploratory research targeting code complexity issues in junior developer code written in the Python or Java programming languages. The dataset also includes the actual rule configurations and thresholds used for the Python and Java languages during source code analysis.
Supplementary data on journal quartiles and citation indicators across...
zenodo.org
png
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serhii Nazarovets; Serhii Nazarovets (2025). Supplementary data on journal quartiles and citation indicators across disciplines [Dataset]. http://doi.org/10.5281/zenodo.15206056
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15206056
Dataset updated
Apr 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Serhii Nazarovets; Serhii Nazarovets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides supplementary data extracted and processed from the SCImago Journal Rank portal (2023) and the Scopus Discontinued Titles list (February 2025). It includes journal-level metrics such as SJR and h-index, quartile assignments, and subject category information. The files are intended to support exploratory analysis of citation patterns, disciplinary variations, and structural characteristics of journal evaluation systems. The dataset also contains Python code and visual materials used to examine relationships between prestige metrics and cumulative citation indicators.

Contents:

Scimago Journal Rank 2023.xlsx – full SJR dataset with quartile and h-index data.

Q1 journals with h-index below 5 (SJR 2023).xlsx – filtered subset of Q1 journals with low citation impact.

Relationship between journal h-index and SJR 2023.png – visualization of SJR vs h-index by quartile.

Scopus Discontinued Titles (Feb 2025) – list of discontinued sources from Scopus used for consistency checks.

Python script for data processing and visualization.
Explore data formats and ingestion methods
kaggle.com
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Why this Dataset

This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

Test Data Formats in Python

Test Data Formats in R

Iris Dataset

Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

The file downloaded is iris.data and is formatted as a comma delimited file.

This small data collection was created to help you test your skills with ingesting various data formats.

Content

This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
* feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
* npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

Acknowledgements

I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

Inspiration

Use these data formats to test your skills in ingesting data in various formats.
News Mining Dataset for Sentiment and Topic Analysis: 300K Articles...
zenodo.org
zip
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Veríssimo; Hugo Veríssimo (2025). News Mining Dataset for Sentiment and Topic Analysis: 300K Articles Extracted across 20 News Sources [Dataset]. http://doi.org/10.5281/zenodo.15231163
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15231163
Dataset updated
Apr 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugo Veríssimo; Hugo Veríssimo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
This dataset was created using Arquivo.pt, the Portuguese web archive, as the primary source for extracting and analyzing news-related links. Over 3 million archived URLs were collected and processed, resulting in a curated collection of approximately 300,000 high-quality news articles from 20 different news sources.

Each article has been processed to extract key information such as:

Publication date

News source

Mentioned topics

Sentiment analysis

The dataset was built as part of a web-based application for relationship detection and exploratory analysis of news content. It can support research in areas such as natural language processing (NLP), computational journalism, network analysis, topic modeling, and sentiment tracking.

All articles are in Portuguese, and the dataset is structured for easy use with tools like Python (e.g., Pandas, Spark) and machine learning workflows.

Dataset Structure

The dataset consists of two main folders:

news/
Contains all ~3 million processed URLs, organized by folders based on processing status:

success/ — articles successfully extracted

duplicated/ — duplicate content detected

not_news/ — filtered out as non-news

error/ — extraction or parsing failures
Each subfolder contains JSON files, partitioned as outputted by Spark. These represent the raw extracted content.

news_processed/
Contains 8 Parquet files, which are partitions of a cleaned and enriched dataset with approximately 300,000 high-quality news articles. These include structured fields ready for analysis.
Z
Dataset for "Machine learning predictions on an extensive geotechnical...
data.niaid.nih.gov
zenodo.org
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soranzo, Enrico (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14251190
Explore at:
Dataset updated
Dec 5, 2024
Dataset authored and provided by
Soranzo, Enrico
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Austria
Description
This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

Key Features:

Temporal Coverage: Over 20 years of data.

Geographical Coverage: Vienna, Lower Austria, and Burgenland.

Tests Included:

Particle Size Distribution

Atterberg Limits

Proctor Tests

Permeability Tests

Direct Shear Tests

Number of Variables: 24

Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

Technical Details:

Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.

Data normalization and standardization steps are recommended for specific analyses.

Acknowledgments:The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).
Salaries of developers in Ukraine
kaggle.com
Updated Nov 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mysha Rysh (2022). Salaries of developers in Ukraine [Dataset]. https://www.kaggle.com/datasets/mysha1rysh/salaries-of-developers-in-ukraine
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2022
Dataset provided by
Kaggle
Authors
Mysha Rysh
Area covered
Ukraine
Description
This data was collected by the team https://dou.ua/ . This resource is very popular in Ukraine. It provides salary statistics, shows current vacancies and publishes useful articles related to the life of an IT specialist. This dataset was taken from the public repository https://github.com/devua/csv/tree/master/salaries . This dataset will include the following data for each of the developer: salary, position (f.e. Junior, Middle), experience, city, tech (f.e C#/.NET, JavaScript, Python). I think this dataset will be useful to our community. Thank you.
Z
Data from: A dataset of GitHub Actions workflow histories
data.niaid.nih.gov
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
Explore at:
Dataset updated
Oct 25, 2024
Dataset authored and provided by
Cardoen, Guillaume
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

2024-07-09 update : fix sometimes invalid valid_yaml flag.

The dataset was created as follow :

First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

We added the column uid via a script available on GitHub.

Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

Using the extracted data, the following files were created :

workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

workflows.csv.gz contains the metadata for the extracted workflow files.

workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

The metadata is separated in different columns:

repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

commit_hash: The commit hash returned by git

author_name: The name of the author that changed this file

author_email: The email of the author that changed this file

committer_name: The name of the committer

committer_email: The email of the committer

committed_date: The committed date of the commit

authored_date: The authored date of the commit

file_path: The path to this file in the repository

previous_file_path: The path to this file before it has been touched

file_hash: The name of the related workflow file in the dataset

previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

valid_yaml: A boolean indicating if the file is a valid YAML file.

probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
Real State Website Data
kaggle.com
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M. Mazhar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
Data from: HadISDH land: gridded global monthly land surface humidity data...
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Jun 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr. (2020). HadISDH land: gridded global monthly land surface humidity data version 4.2.0.2019f [Dataset]. https://catalogue.ceda.ac.uk/uuid/3e9f387293294f3b8a850524fcfc0c9c
Explore at:
Dataset updated
Jun 29, 2020
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Kate M. Willett; Robert J. H. Dunn; Peter W. Thorne; Stephanie Bell; Michael de Podesta; David E. Parker; Philip D. Jones; Claude N. Williams Jr.
License
http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
Time period covered
Jan 1, 1973 - Dec 31, 2019
Area covered
Earth
Variables measured
time, latitude, longitude, month of year, air_temperature, relative_humidity, dew_point_depression, wet_bulb_temperature, dew_point_temperature, time period boundaries, and 40 more
Description
This is the 4.2.0.2019f version of the HadISDH (Integrated Surface Database Humidity) land data. These data are provided by the Met Office Hadley Centre. This version spans 1/1/1973 to 31/12/2019.

The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD). Data are provided in either NetCDF or ASCII format.

This version extends the 4.1.0.2018f version to the end of 2019 and constitutes a minor update to HadISDH due to changing some of the code base from IDL to Python 3 and detecting and fixing various bugs in the process. These have led to small changes in regional and global average values and coverage. All other processing steps for HadISDH remain identical. Users are advised to read the update document in the Docs section for full details.

As in previous years, the annual scrape of NOAA’s Integrated Surface Dataset for HadISD.3.1.0.2019f, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. There has been an issue with data for April 2015 whereby it is missing for most of the globe. This will hopefully be resolved by next year’s update. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data.

To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.

For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/

References:

When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference):

Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014.

Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491.

Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1

We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication:

Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.

Facebook

Twitter

Click to copy link

Link copied

Cite

K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data

IMDb Top 4070: Explore the Cinema Data

Python - Exploratory Data Analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 15, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

K.T.S. Prabhu

Description

Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

Clear search

Close search

Google apps

Main menu

IMDb Top 4070: Explore the Cinema Data

Files Python

Dataset

Contents

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

Global iPhone Reviews Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Mobile Device Customer Feedback

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Regional YouTube Viral Content Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

singapore

Cyclistic Bike - Data Analysis (Python)

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

Invoices Dataset

Replication package for "An Exploratory Study on the Predominant Programming...

Open Data Package for Article "Exploring Complexity Issues in Junior...

Supplementary data on journal quartiles and citation indicators across...

Explore data formats and ingestion methods

Why this Dataset

Iris Dataset

Content

Acknowledgements

Inspiration

News Mining Dataset for Sentiment and Topic Analysis: 300K Articles...

Dataset Structure

Dataset for "Machine learning predictions on an extensive geotechnical...

Salaries of developers in Ukraine

Data from: A dataset of GitHub Actions workflow histories

Real State Website Data

Data from: HadISDH land: gridded global monthly land surface humidity data...

IMDb Top 4070: Explore the Cinema Data

Python - Exploratory Data Analysis