7 datasets found

Assessing Computational Notebook Understandability through Code Metrics...
zenodo.org
zip
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8435192
Dataset updated
Oct 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mojtaba Mostafavi; Mojtaba Mostafavi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.
House Prices: Advanced Regression 'solution' file
kaggle.com
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2020). House Prices: Advanced Regression 'solution' file [Dataset]. https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.

Content

This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.

Acknowledgements

Ames Housing Dataset (on kaggle) by @prevek18

First place is meaningless in this way! by @diegojohnson
Bland-Altman Analysis
kaggle.com
Updated Oct 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marília Prata (2020). Bland-Altman Analysis [Dataset]. https://www.kaggle.com/mpwolke/cusersmarildownloadsaltmancsv/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2020
Dataset provided by
Kaggle
Authors
Marília Prata
Description
Context

In 1983 Altman and Bland (B&A) proposed an alternative analysis, based on the quantification of the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement. https://www.kaggle.com/mpwolke/snip-test-bland-altman-analysis/notebook

Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.

Content

The B&A plot analysis is a simple way to evaluate a bias between the mean differences, and to estimate an agreement interval, within which 95% of the differences of the second method, compared to the first one, fall. Data can be analyzed both as unit differences plot and as percentage differences plot. The B&A plot method only defines the intervals of agreements, it does not say whether those limits are acceptable or not. Acceptable limits must be defined a priori, based on clinical necessity, biological considerations or other goals.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/

Acknowledgements

Altman and Bland (B&A)

Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.

https://www.kaggle.com/anaalinemarcelino/reiability-of-snip-test-in-healthy-children/metadata

Photo by Chromatograph on Unsplash

Inspiration

Bland Altman analysis.
Fine-tuned Llama2 for financial sentiment analysis
kaggle.com
Updated Mar 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luca Massaron (2024). Fine-tuned Llama2 for financial sentiment analysis [Dataset]. https://www.kaggle.com/datasets/lucamassaron/fine-tuned-llama2-for-financial-sentiment-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Luca Massaron
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Based on the notebook https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis this dataset contains the fine-tuned Llama 2 model trained on the annotated dataset of approximately 5,000 sentences from the Aalto University School of Business (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.
U.S. Tobacco Use Data
kaggle.com
Updated Jan 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). U.S. Tobacco Use Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/u-s-tobacco-use-data-1995-2010
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
U.S. Tobacco Use Data

Prevalence and Trends by State

By Health [source]

About this dataset

This dataset provides insight into the prevalence and trends in tobacco use across the United States. By breaking down this data by state, you can see how tobacco has been used and changed over time. Smoking is a major contributor to premature deaths and health complications, so understanding historic usage rates can help us analyze and hopefully reduce those negative impacts. Drawing from the Behavioral Risk Factor Surveillance System, this dataset gives us an unparalleled look at both current and historical smoking habits in each of our states. With this data, we can identify high risk areas and track changes throughout the years for better health outcomes overall

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains information on the prevalence and trends of tobacco use in the United States. The data is broken down by state, and includes percentages of smokers, former smokers, and those who have never smoked. With this dataset you can explore how smoking habits have changed over time as well as what regions of the country have seen more or less consistent smoking trends.

To begin using this dataset, you will first want to familiarize yourself with the columns included within it and their associated values. There is a “State” column that provides the US state for which each row refers to; there are also columns detailing percentages for those who smoke every day (Smoke Everyday), some days (Smoke Some Days), previously smoked (Former Smoker) and those who have never smoked (Never Smoked). The “Location 1” column indicates each geographic region that falls into one of either four US census divisions or eight regions based upon where each state lies in relation to one another.

Once you understand the data presented within these columns, there are a few different ways to begin exploring how tobacco use has changed throughout time including plotting prevalence data over different periods such as decades or specific years; compiling descriptive statistics such as percentiles or mean values; contrasting between states based on any relevant factors such as urban/rural population size or economic/political standing; and lastly looking at patterns developing throughout multiple years via various visualisations like box-and-whisker plots amongst other alternatives.

This wide set of possibilities makes this dataset interesting enough regardless if you are looking at regional differences across single points in time or long-term changes regarding national strategies around reducing nicotine consumption. With all its nuances uncovered hopefully your results can lead towards further research uncovering any aspect about smoking culture you may find fascinating!

Research Ideas

Comparing regional and state-level smoking rates and trends over time.

Analyzing how different demographics are affected by state-level smoking trends, such as comparing gender or age-based differences in prevalence and/or decreasing or increasing rates of tobacco use at the regional level over time.

Developing visualization maps that show changes in tobacco consumption prevalence (and related health risk factors) by location on an interactive website or tool for public consumption of data insights from this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Columns

File: BRFSS_Prevalence_and_Trends_Data_Tobacco_Use_-_Four_Level_Smoking_Data_for_1995-2010.csv | Column name | ...
AI-Based Job Site Matching
kaggle.com
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). AI-Based Job Site Matching [Dataset]. https://www.kaggle.com/datasets/thedevastator/ai-based-job-site-matching/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AI-Based Job Site Matching

Leveraging 400k+ Hours of Resource & Performance Data

By [source]

About this dataset

As you savvy job-seekers know, selecting an optimal site for GlideinWMS jobs is no small feat -weighing so many critical variables, and performing the highly sophisticated calculations needed to maximize the gains can be a tall order. Our dataset offers a valuable helping hand: with detailed insight into resource metrics and time-series analysis of over 400K hours of data, this treasure trove of information will hasten your journey towards finding just the right spot for all your job needs.

Specifically, our dataset contains three files: dataset_classification.csv, which provides information on critical elements such as disk usage and CPU cache size; dataset_time_series_analysis.csv featuring in-depth takeaways from careful time series analysis; And finally dataset_400k_hour.csv gathering computation results from over 400K hours of testing! With columns such as Failure (indicating whether or not the job failed) TotalCpus (the total number of CPUs used by the job), CpuIsBusy (whether or not the CPU is busy), and SlotType (the type of slot used by the job), it's easier than ever to plot that perfect path to success!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to help identify the most suitable site for GlideinWMS jobs. It contains resource metrics and time-series analysis, which can provide useful insight into the suitability of each potential site. The dataset consists of three sets: dataset_classification.csv, dataset_time_series_analysis.csv and dataset_400k_hour.csv.

The first set provides a high-level view of the critical resource metrics that are essential when matching a job and a site: DiskUsage, TotalCpus, TotalMemory, TotalDisk, CpuCacheSize and TotalVirtualMemoryTotalSlots as well as total slot information all important criteria for any job matching process - including whether or not the CpuIsBusy - along with information about the SlotType for each job at each potential site; additionally there is also data regarding Failure should an issue arise during this process; finally Site is provided so that users can ensure they are matching jobs to sites within their own specific environment if required by policy or business rules.

The second set provides detailed time-series analysis related to these metrics over longer timeframes as well LastUpdate indicating when this analysis was generated (without date), ydate indicating year of last update (without date), mdate indicating month of last update (without date) and hdate indicating hour at which data is refreshed on a regular basis without errors so that up-to-the minute decisions can be made during busy times like peak workloads or reallocations caused by anomalies in usage patterns within existing systems/environments;

Finally our third set takes things one step further with detailed information related to our 400k+ hours analytical data collection allowing you maximize efficiency while selecting best possible matches across multiple sites/criteria using only one tool (which we have conveniently packaged together in this impressive kaggle datasets :)

By taking advantage of our AI driven approach you will be able benefit from optimal job selection across many different scenarios such maximum efficiency scenarios with boosts in throughput through realtime scaling along with accountability boost ensuring proper system governance when moving from static systems utilizing static strategies towards ones more reactive working utilization dynamics within new agile deployments increasing stability while lowering maintenance costs over longer run!

Research Ideas

Use the total CPU, memory and disk usage metrics to identify jobs that need additional resources to complete quickly and suggest alternatives sites with more optimal resource availability

Utilize the time-series analysis using failure rate, last update time series, as well as month/hour/year of last update metrics to create predictive models for job site matching and failure avoidance on future jobs

Identify inefficiencies in scheduling by cross-examining job types (slot type), CPU caching size requirements against historical data to find opportunities for optimization or new approaches to job organization

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1....
Chicago Crime
kaggle.com
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Ranjbar (2025). Chicago Crime [Dataset]. https://www.kaggle.com/ashkanranjbar/chicago-crime/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ashkan Ranjbar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
Chicago
Description
This dataset has gained popularity over time and is widely known. While Kaggle courses teach how to use Google BigQuery to extract a sample from it, this notebook provides a HOW-TO guide to access the dataset directly within your own notebook. Instead of uploading the entire dataset here, which is quite large, I offer several alternatives to work with a smaller portion of it. My main focus was to demonstrate various techniques to make the dataset more manageable on your own laptop, ensuring smoother operations. Additionally, I've included some interesting insights on basic descriptive statistics and even a modeling example, which can be further explored based on your preferences. I intend to revisit and refine it in the near future to enhance its rigor. Meanwhile, I welcome any suggestions to improve the notebook!

Here are the columns that I have chosen to include (after carefully eliminating a few others):

Date: This column represents the timestamp of the incident. From this column, I have extracted the Month, Day, and Hour information. We can also add additional time-based columns such as Week and Day of the Week, among others.

Block: This column provides a partially redacted address where the incident occurred, indicating the same block as the actual address.

IUCR: The acronym stands for Illinois Uniform Crime Reporting. This code is directly linked to the Primary Type and Description. You can find more information about it in this link.

Primary Type: This column describes the primary category of the IUCR code mentioned above.

Description: This column provides a secondary description of the IUCR code, serving as a subcategory of the primary description.

Location Description: Here, you can find the description of the location where the incident took place.

Arrest: This column indicates whether an arrest was made in relation to the incident.

Domestic: It shows whether the incident was domestic-related, as defined by the Illinois Domestic Violence Act.

Beat: The beat refers to the smallest police geographic area, with each beat having a dedicated territory. You can find more information about it in this link.

District: This column represents the police district where the incident occurred.

Ward: It refers to the number that labels the City Council district where the incident took place.

Community Areas: This column indicates the community area where the incident occurred. Chicago has a total of 77 community areas.

FBI Code: The crime classification outlined in the FBI's National Incident-Based Reporting System (NIBRS).

X-Coordinate, Y-Coordinate, Latitude, Longitude, Location: These columns provide information about the geographical coordinates of the incident location, including latitude and longitude. The "Location" column contains just the latitude and longitude coordinates.

Year, Updated On: These columns represent the year of the incident and the date on which the dataset was last updated.

Feel free to explore the notebook and provide any suggestions for improvement. Your feedback is highly appreciated!
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192

Assessing Computational Notebook Understandability through Code Metrics Analysis

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8435192

Dataset updated

Oct 12, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Mojtaba Mostafavi; Mojtaba Mostafavi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.

Clear search

Close search

Google apps

Main menu

Assessing Computational Notebook Understandability through Code Metrics...

House Prices: Advanced Regression 'solution' file

Context

Content

Acknowledgements

Bland-Altman Analysis

Context

Content

Acknowledgements

Inspiration

Fine-tuned Llama2 for financial sentiment analysis

U.S. Tobacco Use Data

U.S. Tobacco Use Data

Prevalence and Trends by State

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

AI-Based Job Site Matching

AI-Based Job Site Matching

Leveraging 400k+ Hours of Resource & Performance Data

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Chicago Crime

Assessing Computational Notebook Understandability through Code Metrics Analysis