Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
But you need to download other notebooks result, then upload it if you want to use within your notebook. So i create this dataset for anyone who want to use directly notebook result without download/upload. Please upvote if it help you
This dataset contain 5 results as input using for a hybrid approach in this notebook: * https://www.kaggle.com/titericz/h-m-ensembling-how-to/notebook. * https://www.kaggle.com/code/atulverma/h-m-ensembling-with-lstm
If you want to use this notebook but can't access to private dataset, please add my dataset to your notebook, than change file path.
It has 5 files:
* submissio_byfone_chris.csv: Submission result from: https://www.kaggle.com/lichtlab/0-0226-byfone-chris-combination-approach
* submission_exponential_decay.csv: Submission result from: https://www.kaggle.com/tarique7/hnm-exponential-decay-with-alternate-items/notebook
* submission_trending.csv: Submission result from: https://www.kaggle.com/lunapandachan/h-m-trending-products-weekly-add-test/notebook
* submission_sequential_model.csv: Submission result from: https://www.kaggle.com/code/astrung/sequential-model-fixed-missing-last-item/notebook
* submission_sequential_with_item_feature.csv: Submission result from: https://www.kaggle.com/code/astrung/lstm-model-with-item-infor-fix-missing-last-item/notebook
Facebook
Twitter"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes information about:
To explore this type of models and learn more about the subject.
New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
Facebook
TwitterThis dataset was modified from @iafoss 's notebook to create full sized 512x512px images. It has been derived from HuBMAP's competition data. By using this dataset, you acknowledge and accept the rules of the competition, which is non-exhaustively summarized below:
DATA ACCESS AND USE: Open Source
Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, Cuba, Iran, Syria, North Korea, Sudan, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Sponsor reserves the right to award alternative Prizes where needed to comply with local laws.
"Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.
A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.
B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.
C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your models and Submissions. However, you will (i) ensure the External Data is available to use by all participants of the competition for purposes of the competition at no cost to the other participants and (ii) post such access to the External Data for the participants to the official competition forum prior to the Entry Deadline.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.
To use this dataset for summarization tasks: - Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). - Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. - Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry..
- Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content
- Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset.
- Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance.
- Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: comparisons_validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split | Split of the dataset between training and validation sets. (String) | | extra | Additional information about the given source material available. (String) |
File: comparisons_train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------------------| | info | Text to be summarized. (String) | | summaries | Summaries generated by workers. (String) | | choice | The chosen summary. (String) | | batch | Batch for which it was created. (Integer) | | split ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
One of the most popular competitions on kaggle is the House Prices: Advanced Regression Techniques. The original data comes from the publication Dean De Cock "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project", Journal of Statistics Education, Volume 19, Number 3 (2011). Recently a 'demonstration' notebook has been published "First place is meaningless in this way!" that extracts the 'solution' from the full dataset. Now that the 'solution' is readily available the possibility has opened for people to reproduce the competition at home without any daily submission limit. This will open up the possibility of experimenting with advanced techniques such as pipelines with/or various estimators/models in the same notebook, extensive hyper-parameter tuning etc. And all without the risk of 'upsetting' the public leaderboard. Simply download this solution.csv file and import it into your script or notebook and evaluate the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the data in this file.
This dataset is the submission.csv file that will produce a public leaderboard score of 0.00000.
Facebook
TwitterBy State of New York [source]
The dataset includes more than twenty years of Workers’ Compensation Claims records starting from the year 2000, giving a comprehensive overview of this important segment of the labor market. From information on claimants' age, gender and zip code to details on claim type, injury type, injury source and event exposure they each hold invaluable insights into the health of workers' compensation. Stay up-to-date with WIBC's constantly growing database featuring essential data that can help you in making informed decisions on how to manage claims and look after your workforce. Learn what types of injuries lead to successful claims; understand which carrier types are most often involved in claims; research claim assembly process times; gain an understanding of slow or disputed claims; find out about wage averages for claimants; and numerous other aspects related to workers’ compensation. With such insight available at your fingertips make sure you capitalize on its potential as you work towards better management and protection of your workforce - Now with complete data from 2000 all the way up till today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Welcome to the Workers’ Compensation Claims in New York State: 2000-Present dataset! This dataset contains information about workers’ compensation claims in New York State between the years 2000 and present.
This guide will provide you with an overview of the data included, as well as how to use this information for your own research and exploration.
Firstly, you should familiarize yourself with the data columns. This dataset includes a variety of fields related to workers’ compensation claims such as claim type, injury type, district name and current claim status, age at injury, assembly date and ANCR date (claim acceptance or denial). Additionally it contains details about medical costs covered by WCB (Average Weekly Wage) , dispute resolution mechanisms (Alternative Dispute Resolution), legal representatives handling the case(Attorney/Representative), insurance carrier involved(Carrier Name)and other useful details pertinent for understanding workers' compenstation cases such as Hearing count & Closed count.
Once you understand all available fields/columns and their respective values/labels in this dataset you can start exploring them one by one & create custom queries based on specific parameters within each field. For example some common analysis could include:
- Analyzing worker benefits based on salary ranges or specific professions; - Comparing survival rates of injured employees across different regions; - Seeing how injuries vary across gender lines; - Studying dispute resolution patterns over time; - Examining attorney or representative impact on settlement outcomes; And much more!With this guide hopefully you have been equipped with a basic understanding of how to use the Workers’ Compensation Claims in New York State: 2000-Present Kaggle database so that your explorations become more fruitful! Enjoy!
- Identifying potential problem areas in the workers’ compensation system and illustrating how to best resolve those issues.
- Demonstrating potential correlations between types of injuries, claim types, and outcomes in order to inform better decision-making with regards to workplace safety.
- Estimating the financial impact of future claims based on current trends in workers' compensation data
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: Assembled_Workers_Compensation_Claims_Beginning_2000.csv | Column ...
Facebook
TwitterBy US Open Data Portal, data.gov [source]
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset
- Accessing the Data: To access this dataset you can visit the website Data.cdc.gov where it is publicly available or download it directly from Kaggle at [https://www.kaggle.com/cdc/us-national-cardiovascular-disease].
- Exploring the data: There are 20 columns/variables that make up this dataset which include Year, LocationAbbr,LocationDesc,DataSource PriorityArea1 through PriorityArea4,CategoryTopicIndicatorData_Value_TypeData_Value_UnitData_Value_Alt FootnoteSymbol BreakOutCategory GeoLocation etc.(see above for full list). You can explore the data however you want by looking at one variable or multiple variables simultaneously in order to gain insight about CVDs in America such as their rates across different locations over years or prevalence of certain risk factors among different age groups and gender etc . 3 . The Uses of This Dataset: This dataset can be used by researchers who are interested in improving our understanding of CVDs in America through accessing its vital statistics such as assessing disease burden and monitoring trends over time across different population subgroups etc., health authorities attempting to publicize vital health related knowledge via data dissemination tactics such as outreach programs or policy makers who intend on informing community level interventions based upon insights extracted from this powerful tool For example - Someone may look at a comparison between smoking prevalence between males & females within one state countrywide or they could further investigate that comparison into doing a time series analysis looking at smoking prevalence trends since 2001 onwards across both genders nationally until present day
- Creating a real-time cardiovascular disease surveillance system that can send updates and alert citizens about risks in their locale.
- Generating targeted public health campaigns for different demographic groups by drawing insights from the dataset to reach those most at risk of CVDs.
- Developing an app or software interface to allow users to visualize data trends around CVD prevalence and risk factors between different locations, age groups and ethnicities quickly, easily and accurately
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: csv-1.csv | Column name | Description | |:-------------------------------|:----------------------------------------------------------------| | Year | Year of the survey. (Integer) | | LocationAbbr | Abbreviation of the location. (String) | | LocationDesc | Description of the location. (String) | | DataSource | Source of the data. (String) | | PriorityArea1 | Priority area 1. (String) | | PriorityArea2 | Priority area 2. (String) | | PriorityArea3 | Priority area 3. (String) | | PriorityArea4 | Priority area 4. (String) | | Category | Category of the data value type. (String) | | Topic | Topic related to the indicator of the data value unit. (String) | | Indicator | Indicator of the data value unit. (String) | | Data_Value_Type | Type of data value. (String) | | Data_Value_Unit | Unit of the data value. (String) | | Data_Value_Alt | Alternative value of the data value. (Float) | | Data_Value_Footnote_Symbol | Footnote symbol of the data value. (String) | | Break_Out_Category | Break out category of the data value. (String) | | GeoLocation | Geographic location associated with the survey d...
Facebook
TwitterThis data comes entirely from the TensorFlow - Help Protect the Great Barrier Reef competition and should not be used outside of the competition! I do not own these images and to the extent possible want to ensure this complies with terms of the competition - I believe it does. All users/viewers of this dataset should adhere to the terms&conditions of the competition.
I wanted an easily accessible repository of the cots images and not cots images to help with data augmentation and possibly improving the models in other ways. In the spirit of the competition I thought it made the most sense to make this available to the other competitors.
This notebook was used to pre-process / create this dataset: Cropped Crown of Thorns Dataset Builder. It walks through the steps in a readable way.
About the dataset: * This dataset contains an equal number (11,898) of images of COTS and Not Cots .jpg images. * These images come from cropping out the bounding box regions from each video frame in the competition. * Use this for data augmentation * Alternatively, if you're just getting started, try building binary classifiers for COTS vs. Not COTS if you want to build up the skill to create more complicated object detection models.
This comes directly from the TensorFlow - Help Protect the Great Barrier Reef](https://www.kaggle.com/c/tensorflow-great-barrier-reef) competition. Alternative citations include:
Liu, J., Kusy, B., Marchant, R., Do, B., Merz, T., Crosswell, J., ... & Malpani, M. (2021). The CSIRO Crown-of-Thorn Starfish Detection Dataset. arXiv preprint arXiv:2111.14311.
See Notebook used to build this dataset here: Cropped Crown of Thorns Dataset Builder
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
Do you love spending time in the kitchen whipping up delicious, homemade meals? If so, then you'll love this dataset! The Ultimate Recipe Recommendation Dataset contains a large variety of recipes that are delicious, nutritious, and easy to prepare. This dataset is perfect for researchers who are interested in exploring recipe recommendations, nutrition data, and cooking times. With this dataset, you can answer important questions such as: What are the most popular recipes? What are the most nutritional recipes? What are the quickest recipes to prepare? So whether you're a seasoned chef or a beginner cook, this dataset is sure to have something for everyone!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
If you're interested in exploring recipe recommendations, nutrition data, and cooking times, this is the dataset for you! With over 100,000 recipes, there's something for everyone. Whether you're looking for a quick and easy meal or something more substantial, you'll find it here. You can use this dataset to answer important questions such as: What are the most popular recipes? What are the most nutritional recipes? What are the quickest recipes to prepare?
- Recipe recommendations: With this dataset, you can recommend recipes to users based on their preferences. For example, if a user likes quick and easy recipes, you can recommend recipes that have a short prep time and cook time.
- Nutrition data: This dataset contains nutrition information for each recipe. This data can be used to recommend recipes that are high in protein, low in fat, etc.
- Cooking times: With this dataset, you can find recipes that are quick and easy to prepare. This is perfect for busy home cooks who don't have a lot of time to spend in the kitchen!
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: recipes.csv | Column name | Description | |:-----------------|:----------------------------------------------------------------------------| | recipe_name | The name of the recipe. (String) | | prep_time | The amount of time required to prepare the recipe. (Integer) | | cook_time | The amount of time required to cook the recipe. (Integer) | | total_time | The total amount of time required to prepare and cook the recipe. (Integer) | | servings | The number of servings the recipe yields. (Integer) | | ingredients | A list of ingredients required to make the recipe. (List) | | directions | A list of directions for preparing and cooking the recipe. (List) | | rating | The recipe rating. (Float) | | url | The recipe URL. (String) | | cuisine_path | The recipe cuisine path. (String) | | nutrition | The recipe nutrition information. (Dictionary) | | timing | The recipe timing information. (Dictionary) |
File: test_recipes.csv | Column name | Description | |:----------------|:---------------------------------| | url | The recipe URL. (String) | | Name | The recipe name. (String) | | Prep Time | The recipe prep time. (String) | | Cook Time | The recipe cook time. (String) | | Total Time | The recipe total time. (String) | | Servings | The recipe servings. (String) | | Yield | The recipe yield. (String) | | Ingredients | The recipe ingredients. (String) | | Directions | The recipe directions. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterIn 1983 Altman and Bland (B&A) proposed an alternative analysis, based on the quantification of the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement. https://www.kaggle.com/mpwolke/snip-test-bland-altman-analysis/notebook
Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.
The B&A plot analysis is a simple way to evaluate a bias between the mean differences, and to estimate an agreement interval, within which 95% of the differences of the second method, compared to the first one, fall. Data can be analyzed both as unit differences plot and as percentage differences plot. The B&A plot method only defines the intervals of agreements, it does not say whether those limits are acceptable or not. Acceptable limits must be defined a priori, based on clinical necessity, biological considerations or other goals.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4470095/
Altman and Bland (B&A)
Vanessa Resqueti, Gulherme Fregonezi, Layana Marques, Ana Lista-Paz, Ana Aline Marcelino and Rodrigo Torres-Castro, “Reliability of SNIP test in healthy children.” Kaggle, doi: 10.34740/KAGGLE/DSV/1539628.
https://www.kaggle.com/anaalinemarcelino/reiability-of-snip-test-in-healthy-children/metadata
Photo by Chromatograph on Unsplash
Bland Altman analysis.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The project presents Sea Around Us Global Fisheries Catch Data aggregated at EEZ level. The data are computed from reconstructed catches from various official fisheries statistics, scientific, technical and policy reports about the fisheries, and include estimation of discards, unreported and illegal catch data from all maritime countries and major territories of the world.
https://imgur.com/lg5OEAJ.png">
This project was the result of work between Sea Around Us and the CIC programme, a collaborative programme between the University of British Columbia (UBC) and AWS.
https://d1ewbp317vsrbd.cloudfront.net/dd5cc201-3c74-476b-a955-06697f3030f1.png" alt="AWS ODS">
https://www.seaaroundus.org/wp-content/uploads/2015/03/LogoSeaAroundUs.png" alt="Sea around us">
This dataset was sourced from https://www.seaaroundus.org/ via Amazon AWS-ODS
Alternative AWS CLI Acess : No AWS account required >
aws s3 ls --no-sign-request s3://fisheries-catch-data/ [local-data-folder eg: C:/] --recursive
replace ls [list-files] to cp [copy-files-to-local-system] once satisfied with the files, use --dryrun to test
Sea Around Us Global Fisheries Catch Data was accessed on 20th October 2022 from https://registry.opendata.aws/sau-global-fisheries-catch-data .
Facebook
Twittergame simulator (basketball): NBA 2022-2023
The aim of this project is to generate simulations of basketball games between NBA teams for 2022-2023 for the purpose of modeling predicted outcomes from a player efficiency metric (the "r metric").
A champion will be determined for the simulated season using an elimination format, with teams eliminated from contention upon recording 20 losses until only one team remains.
Box score statistics for players (on a per 100 possessions basis) were gathered from https://www.basketball-reference.com/ from the 2021-2022 season. In some cases, for players for whom 2021-2022 stats are not available due to injury, stats were pulled from prior seasons. Rookie stats are projected, based on their final year college stats.
The players stats were filtered and transformed to reflect a focus on box score stats measuring playing efficiency, as opposed to measures of volume. For example, Real Shooting Percentage (True Shooting Percentage adjusted for volume, based on points generated above average) was incorporated into the metric as opposed to Points Per Game; Assist to Turnover Ratio was incorporated as opposed to Assists Per Game. The complete list of stats used for the r metric is as follows:
The r metric efficiency rating was derived from performing a linear regression on the overall team stats for a selection of teams for NBA seasons from 1980 to the present against their Point Differential and then applying the resulting predicted values to individual players.
An R function was created to generate simulated game outcomes from a Kaggle notebook. The output is produced as a ggplot (visualizing the r metric (in pink) against the traditional box score stats (coded by team in blue/red) and a csv file as a box score. The notebook is scheduled to run daily, randomly selecting teams to play against one another and generating outcomes based on the player stats and metric for each team with an element of random variation.
The simulated games are based on fictional rotations reduced to an equal distribution of minutes to the teams' six most productive players.
The predicted point differential populated in the standings file (mega_nba2023.csv) are, conversely, predictions of actual 2022-2023 results based on their expected actual minute distribution on teams' full rotations.
The difference between the two alternate methods of projection above is intended rhetorically, to highlight the argument implied by the metric as to how players' effectiveness should be judged, and, consequently, their playing time allocated.
[SPURS](https://www.kaggle.com/dat...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides a collection of images with accompanying alternative texts and Nsfw prediction labels for the purpose of allowing accurate classification and prediction. Each image contains enough data points such as Sha-256 hash, URL, automatically generated caption, predicted NSFW label, alternative text similarity score, dimensions, and EXIF data to provide comprehensive details that can be utilized for a variety of image classification tasks. This dataset serves as an ideal resource for any project or endeavor that relies on accurately classifying and detecting images
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides image data with alt texts and NSFW predictions for the purpose of accurately classifying images. To use this dataset, first take a look at the columns provided and familiarize yourself with their contents. Key is a unique identifier for each image, sha256 provides the SHA-256 hash of the image, url provides a link to where the image can be accessed online, llava_caption is an automatically generated caption for each image based on its contents.
NSFW prediction is used to signal whether or not content in each photo may introduce unpleasant topics like violence or mature content such as nudity that would make it unsuitable for certain audiences while alt_txt contains alternative text associated with photos. Alt_txt similarity describes how closely related alternative text provided by users is to automatically generated captions from Laion-Pop. Height and original height provide information about how tall each image file is since some formats have different heights than others. Lastly, exif stands for exchangeable image file format which contains metadata attached to pictures by digital camera manufacturers.
With this information in mind you will be able to explore and examine your data efficiently in order to classify images according your own specifications!
- The dataset can be used for image recognition and classification, by running machine learning algorithms to build models that will predict the class of the images based on their alt texts and NSFW predictions.
- This dataset allows developers to create tools for filtering out NSFW images from content being produced by users.
- This dataset can also be used for creating AI-assisted applications that enrich user's images with captions related to what they are seeing in the picture, providing a more immersive experience
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:-----------------------|:-----------------------------------------------------------------------------------------------| | key | Unique identifier for each image. (String) | | sha256 | SHA-256 hash of the image. (String) | | url | URL of the image. (String) | | llava_caption | Automatically generated caption for the image. (String) | | nsfw_prediction | Prediction of whether the image is NSFW or not. (Boolean) | | alt_txt | Alternative text for the image. (String) | | alt_txt_similarity | Similarity score between the automatically generated caption and the alternative text. (Float) | | height | Height of the image. (Integer) | | original_height | Original height of the image. (Integer) | | exif | EX...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Based on the notebook https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis this dataset contains the fine-tuned Llama 2 model trained on the annotated dataset of approximately 5,000 sentences from the Aalto University School of Business (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.
Facebook
TwitterBy data.world's Admin [source]
This folder contains datasets from The Pudding essay The Good, the Rad, and the Gnarly published in June 2018 which provides an in-depth examination of skateboard music genre usage across multiple companies. Not only does this dataset provide insight into trends and patterns in term of genre usage over time, but it also allows users to explore down to the artist level.
The folder contains two files:
time_series.tsvandwaffle.csv. The former contains data on ingredient lists from 211 chocolate chip cookie recipes alongside their scaled yield, while the latter consists of skateboard company genre usages percents multiplied by 1000 along with associated genres or fake genres used for testing purposes. Both datasets can be used to gain greater understanding into the inner workings of skateboard music taste and trends while still being able to examine particular artists' usage across time and companies if desiredDetailed below are column descriptions for both files:
time_series.tsv: This file is made up of a number of columns that include 'genre', 'time', 'percentage used (% p)', 'maximum percentage across all genres (% maxp)', 'a peak (p_peak)', and finally a moving average percentage use (p_smooth). Each column is valuable when engaging with this dataset's layerd approach to exploring skateboard music trends over time alongside individual artists growing popularity compared with others in similar styles or even more broad categories such as Hip Hop, Electronic Music etc..
waffle.csv: This file consists four columns - 'source','value','company','fake genre' - each helping paint a picture about how specific companies utilize various aspects within broader genres like Classic Rock, Indie/Alternative Music etc.. allowing viewers to delve right on down into specifics like exact artist or 80's metallic band etc.. Utilizing this dataset demands attention so as not mixup what particular genre using what company contributing which portion value wise relative overall favorite amongst boardsports enthusiast globally!
Both these datasets are characterized by their temporal applicability that scale concerts pre-December 2017; hence allowing viewers engage framework bar none! All data available under the MIT License[link]!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this guide, we will take a look at how you can use this data set to better understand music in the skateboarding community.
Step 1: Exploring the Data Set
The first step is to get familiar with all of the columns contained in this dataset. The following table provides an overview of what is included:
| Header | Description | Data Type | |-------------|------------------------------------------------------------|-----------| |
source| Genre of music from broad genre bins | text |
|value| Percentage of associated genres used for corresponding company, multiplied by 1000| number
|company| Skateboard company | textUsing these headers, you can examine which genres are most popular amongst different companies, allowing skaters to draw comparisons between them. This will help skaters form an understanding as to why some companies might enjoy certain music more than others. Additionally, you can track certain trends over time using this dataset - allowing insights into which genres may be becoming more or less popular on each touring team or within each brands video output over time. Finally, if it becomes necessary due to licensing issues or other restrictions one brand places upon its releases or press materials you may ...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was automatically generated from this notebook. It provides 512x512 png tiles of the images and masks. It has been derived from HuBMAP's competition data. By using this dataset, you acknowledge and accept the rules of the competition, which is non-exhaustively summarized below:
DATA ACCESS AND USE: Open Source
Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, Cuba, Iran, Syria, North Korea, Sudan, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Sponsor reserves the right to award alternative Prizes where needed to comply with local laws.
"Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.
A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.
B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.
C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your models and Submissions. However, you will (i) ensure the External Data is available to use by all participants of the competition for purposes of the competition at no cost to the other participants and (ii) post such access to the External Data for the participants to the official competition forum prior to the Entry Deadline.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset has gained popularity over time and is widely known. While Kaggle courses teach how to use Google BigQuery to extract a sample from it, this notebook provides a HOW-TO guide to access the dataset directly within your own notebook. Instead of uploading the entire dataset here, which is quite large, I offer several alternatives to work with a smaller portion of it. My main focus was to demonstrate various techniques to make the dataset more manageable on your own laptop, ensuring smoother operations. Additionally, I've included some interesting insights on basic descriptive statistics and even a modeling example, which can be further explored based on your preferences. I intend to revisit and refine it in the near future to enhance its rigor. Meanwhile, I welcome any suggestions to improve the notebook!
Here are the columns that I have chosen to include (after carefully eliminating a few others):
Feel free to explore the notebook and provide any suggestions for improvement. Your feedback is highly appreciated!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The PubChemLite Compound Collection for Exposomics is a comprehensive compilation of over 371,000 chemicals from a diverse range of areas and application domains. This invaluable library provides data on molecular structure and composition, annotation categories, chemical functionality, as well as useful information about associated disorders and diseases. It encompasses fields ranging from tumorology to drug-discovery, nutrition to toxicology - all enriched with PubMed papers and patents related to each substance. Moreover, the collection includes safety information regarding the pharmacological effects of each compound as well its toxicity profile when exposed in vitro or when metabolised by the liver. For food-related substances the FoodRelated field provides further details on whether their use is suitable for Human Consumption or not. With its comprehensive range of annotation categories this collection can provide invaluable insight into how environment affects human health giving researchers access to serious evidence backed source data helping them pursue important questions in exposomics
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides an invaluable resource for research in a range of fields, including tumorology, drug-discovery, food and nutrition, toxicology, and many others. It can be used to explore the relationships between various chemicals and related biological effects.
In order to use the PubChemLite Compound Collection for Exposomics effectively and efficiently there are several key steps to follow:
Familiarize yourself with the columns in the dataset. There are 15 columns available in this dataset which provide information on a range of topics as well as relevant annotation types related to each chemical compound. By understanding which columns are most relevant you can better focus your investigations into specific areas of interest.
Analyze each column according to its type. Each column contains data elements that can have different formats or data types (e.g., integer values for PubMed_Counts). Make sure you understand how these datatypes impact how you interpret or apply your analysis techniques on the data set. Additionally check whether any appropriate filtering is necessary according to certain criteria before further investigating individual rows .
Use tools such as visualization tools for visualizing patterns within specific variables or relationships between them if needed . Plotting techniques such as box scheme libraries (like seaborn ) may be used here where suitable .
- Developing a personalized nutrition plan by correlating individual food intake to the associated chemical compounds for better understanding of nutrient absorption and health effects.
- Understanding reproducibility in drug-discovery and drug safety with detailed analysis of PubMed, Patent and Toxicity information linked to each compound in the dataset.
- Identifying new opportunities for agrochemical research and product development through visibility into AgroChemInfo annotation data linked to key compounds found in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: PubChemLite_31Oct2020_exposomics.csv | Column name | Description | |:---------------------|:---------------------------------------------------------------------------------------------------| | FirstBlock | A unique identifier for each chemical compound. (String) | | PubMed_Count | The number of times the chemical compound has been mentioned in PubMed. (Integer) | | Patent_Count | The number of times the chemical compound has been mentioned in patents. (Integer) | | Synonym | A list of alternative names for the chemical ...
Facebook
TwitterBy Elias Dabbas [source]
This dataset contains the details about Hollywood's all-time domestic box office records. It includes data scraped from Box Office Mojo, which breakdowns every movie's lifetime gross, ranking and production year. Domestic gross (adjusted to inflation) has been used as the benchmark to determine what movies were the most successful at the box office in America. This dataset allows you to explore an extensive, comprehensive list of Hollywood all-time biggest hits. Analyze examples of previously unprecedented blockbusters and observe current market trends with this comprehensive overview of domestic box office history - only here at this treasury of motion picture insights!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains comprehensive information about Hollywood movies and their domestic performance at the box office. It includes data on films' production year, lifetime gross, ranking and the studio that produced them. By using this dataset, you can analyze the financial successes and failures of films produced by different studios to gain insights into the Hollywood movie market over time.
The 'rank' column shows each film's ranking compared to other Hollywood movies released in its year of release based on its box office revenue from theaters (not including other sources such as DVD sales or streaming services). The higher the number for a film’s rank means it was more successful financially than other films released in its date window when ticket prices were taken into account; lower numbers equate to less success at that time frame's box office.
The ‘title’ column features all movies analyzed here with links provided which direct users to articles giving background information about those projects - directorial credentials or management history -- as well as full reviews with ratings given by critics while they were screened theatricallly across North America (U.S., Canada).
The ‘studio’ outlines which media conglomerate is credited with distribution/marketing rights for each featured motion picture during their original domestic theatrical runs; these name-brands represent umbrella-corporations comprising multiple divisions specializing in creative development/financing of cinematic works along with doorways engineered around technical know-how -- ie: visual effects shops used by filmmakers during post-production responsibilities their respective productions entailed) -- maintained throughout various industrial regions across entertainment media outlets extending well beyond motion pictures proper... including music/television sector domains defined under respective company flags like Warner Bros., Disney(ABC), NBCUniversal(Comcast) ++ et al mirroring segmentations off any parent brand cited within this database under said label; pertaining solely toward big screen celluloid matters examined herein because charter established assumptions indicate only valid commercially viable feature length fare delivering both titles & collections contained below adheres relevant criterion set forth specifications that warrant inclusion alongside applicable vertical peers made front % center terms established formulating current entries visible within page iteration whilst conforming platform protocols designed enable public
- Creating a recommendation engine to suggest similar movies based on lifetime gross and year of release.
- Data analysis and visualization of box office trends over time for major Hollywood studios.
- Utilizing the data to recommend alternative ways for movie marketers to invest their advertising budgets in order to maximize their return on investment
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - **Keep i...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.