This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.
The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.
This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
Column | Description |
code_blocks_index | Global index linking code blocks to markup_data.csv. |
kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
code_block_id |
Position of the code block within the notebook. |
code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
Column | Description |
kernel_id | Identifier for the Kaggle Jupyter notebook. |
kaggle_score | Performance metric of the notebook. |
kaggle_comments | Number of comments on the notebook. |
kaggle_upvotes | Number of upvotes the notebook received. |
kernel_link | URL to the notebook. |
comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
Column | Description |
comp_name | Name of the Kaggle competition. |
description | Overview of the competition task. |
data_type | Type of data used in the competition. |
comp_type | Classification of the competition. |
subtitle | Short description of the task. |
EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
data_sources | Links to datasets used. |
metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
Column | Description |
code_block | Machine learning code block. |
too_long | Flag indicating whether the block spans multiple semantic types. |
marks | Confidence level of the annotation. |
graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id
column.comp_name
. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index
column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank
), providing additional context for evaluation.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Customer log dataset is a 12.5 GB JSON file and it contains 18 columns and 26,259,199 records. There are 12 string columns and 6 numeric columns, which may also contain null or NaN values. The columns include userId, artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song,status, ts and userAgent. As evident from the column names, the dataset contains various user-related information, such as user identifiers, demographic details (firstName, lastName, gender), interaction details (artist, song, length, itemInSession, sessionId, registration, lastinteraction) and technical details (userAgent, method, page, location, status, level, auth).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19's Impact on Educational Stress’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bsoyka3/educational-stress-due-to-the-coronavirus-pandemic on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The survey collecting this information is still open for responses here.
I just made this public survey because I want someone to be able to do something fun or insightful with the data that's been gathered. You can fill it out too!
Each row represents a response to the survey. A few things have been done to sanitize the raw responses: - Column names and options have been renamed to make them easier to work with without much loss of meaning. - Responses from non-students have been removed. - Responses with ages greater than or equal to 22 have been removed.
Take a look at the column description for each column to see what exactly it represents.
This dataset wouldn't exist without the help of others. I'd like to thank the following people for their contributions: - Every student who responded to the survey with valid responses - @radcliff on GitHub for providing the list of countries and abbreviations used in the survey and dataset - Giovanna de Vincenzo for providing the list of US states used in the survey and dataset - Simon Migaj for providing the image used for the survey and this dataset
--- Original source retains full ownership of the source dataset ---
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
Dataset Fields
Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article
About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.
The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1
About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.
The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.
Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2
Citation If you use our data, please cite the following paper:
bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Kaggle Datasets Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-datasets-ranking on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains Kaggle ranking of datasets.
+800 rows and 8 columns. Columns' description are listed below.
Data from Kaggle. Image from The Guardian.
If you're reading this, please upvote.
--- Original source retains full ownership of the source dataset ---
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.
The dataset you provided appears to contain information related to reported crimes, with each row representing a specific crime incident. Below, I'll describe the meaning and potential content of each column in the dataset:
DR_NO: This column likely represents a unique identifier or reference number for each reported crime incident. It helps in tracking and referencing individual cases.
Date Rptd: This column stores the date when the crime was reported to law enforcement authorities. It marks the date when the incident came to their attention.
DATE OCC: This column indicates the date when the crime actually occurred or took place. It represents the day when the incident happened.
TIME OCC: This column records the time of day when the crime occurred. It provides a timestamp for the incident.
AREA: This column may represent a specific geographical area or jurisdiction within a larger region where the crime took place. It categorizes the incident's location.
AREA NAME: This column likely contains the name or label of the larger area or district that encompasses the specific area where the crime occurred.
Rpt Dist No: This column might represent a reporting district number or code within the specified area. It provides additional location details.
Part 1-2: This column could be related to the type or category of crime reported. "Part 1" crimes typically include serious offenses like homicide, robbery, etc., while "Part 2" crimes may include less serious offenses.
Crm Cd: This column may contain a numerical code representing the specific type of crime that was committed. Each code corresponds to a distinct category of criminal activity.
Crm Cd Desc: This column likely contains a textual description or label for the crime type identified by the "Crm Cd."
Mocodes: This column might store additional information or details related to the modus operandi (MO) of the crime, providing insights into how the crime was committed.
Vict Age: This column records the age of the victim involved in the crime.
Vict Sex: This column indicates the gender or sex of the victim.
Vict Descent: This column might represent the ethnic or racial background of the victim.
Premis Cd: This column could contain a numerical code representing the type of premises where the crime occurred, such as a residence, commercial establishment, or public place.
Premis Desc: This column likely contains a textual description or label for the type of premises identified by the "Premis Cd."
Weapon Used Cd: This column may indicate whether a weapon was used in the commission of the crime and, if so, it could provide a numerical code for the type of weapon.
Weapon Desc: This column likely contains a textual description or label for the type of weapon identified by the "Weapon Used Cd."
Status: This column could represent the current status or disposition of the reported crime, such as "open," "closed," "under investigation," etc.
Status Desc: This column likely contains a textual description or label for the status of the reported crime.
Crm Cd 1, Crm Cd 2, Crm Cd 3, Crm Cd 4: These columns might provide additional numerical codes for multiple crime categories associated with a single incident.
LOCATION: This column likely describes the specific location or address where the crime occurred, providing detailed location information.
Cross Street: This column might include the name of a cross street or intersection near the crime location, offering additional context.
LAT: This column stores the latitude coordinate of the crime location, allowing for precise geospatial mapping.
LON: This column contains the longitude coordinate of the crime location, complementing the latitude for accurate geolocation.
Overall, this dataset appears to be a comprehensive record of reported crimes, providing valuable information about the nature of each incident, the location, and various details related to the victims, perpetrators, and circumstances surrounding the crimes. It can be a valuable resource for crime analysis, law enforcement, and public safety research.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”.
Note that this dataset is a structurally improved version of this dataset.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the IN, US, GB, DE, CA, FR, RU, BR, MX, KR, and JP regions (India, USA, Great Britain, Germany, Canada, France, Russia, Brazil, Mexico, South Korea, and, Japan respectively), with up to 200 listed trending videos per day.
Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.
The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the 11 regions in the dataset.
For more information on specific columns in the dataset refer to the column metadata.
This dataset was collected using the YouTube API. This dataset is the updated version of Trending YouTube Video Statistics.
Possible uses for this dataset could include: - Sentiment analysis in a variety of forms - Categorizing YouTube videos based on their comments and statistics. - Training ML algorithms like RNNs to generate their own YouTube comments. - Analyzing what factors affect how popular a YouTube video will be. - Statistical analysis over time.
For further inspiration, see the kernels on this dataset!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Phishing website Detector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/eswarchandt/phishing-website-detector on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :
A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).
The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions
The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.
You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
Description:
This dataset is sourced from the "Roast-PC by Gemini" website, a platform that provides AI-powered roasting (critical feedback) on custom PC builds. Users input the components of their PC build, including CPU, GPU, motherboard, RAM, PSU, disk, and intended use case. The dataset captures the logs of these submissions, along with the roasting comments generated by Gemini AI, Google's AI model.
Dataset Overview:
Column Names and Descriptions:
Time
: Date and Time of request.cpu
: The CPU model specified by the user (e.g., "AMD Ryzen 5 5500", "Intel i7 1200K").gpu
: The GPU model specified by the user (e.g., "NVIDIA RTX 3080", "AMD Radeon RX 6800").motherboard
: The motherboard model specified by the user (e.g., "ASUS ROG Strix B550-F", "MSI B450 TOMAHAWK").ram
: The RAM configuration specified by the user, including size and speed (e.g., "16GB DDR4 3200MHz").psu
: The PSU (Power Supply Unit) model specified by the user, including wattage (e.g., "Corsair RM750x 750W").disk
: The storage devices specified by the user, including type and capacity (e.g., "1TB NVMe SSD", "500GB SATA HDD").use_case
: The intended use of the PC as specified by the user (e.g., "gaming", "video editing", "general use").roast_comments
: The AI-generated feedback or roasting comments provided by Gemini AI, critiquing the PC build based on the components and use case (indonesian).Functionality:
This dataset serves multiple purposes:
This dataset is ideal for those interested in PC building, hardware analysis, AI-generated content, or anyone curious about trends in custom PC configurations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘🗳 Pollster Ratings’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/pollster-ratingse on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains the data behind FiveThirtyEight's pollster ratings.
- FiveThirtyEight's Pollster Ratings
- The State Of The Polls, 2019
- The Polls Are All Right
- The State Of The Polls, 2016
- How FiveThirtyEight Calculates Pollster Ratings
pollster-stats-full
contains a spreadsheet with all of the summary data and calculations involved in determining the pollster ratings as well as descriptions for each column.
pollster-ratings
has ratings and calculations for each pollster. A copy of this data and descriptions for each column can also be found in pollster-stats-full.
raw-polls
contains all of the polls analyzed to give each pollster a gradeSource: https://github.com/fivethirtyeight/data
License: The data is available under the Creative Commons Attribution 4.0 International License. If you find it useful, please let us know.
Updated: Pollster-ratings and raw-polls synced from source weekly.
This dataset was created by FiveThirtyEight and contains around 10000 samples along with Cand2 Id, Pollster, technical information and other features such as: - Samplesize - Partisan - and more.
- Analyze Cand2 Party in relation to Race Id
- Study the influence of Margin Poll on Cand1 Actual
- More datasets
If you use this dataset in your research, please credit FiveThirtyEight
--- Original source retains full ownership of the source dataset ---
his dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction. It has been used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/
===================================================
Full description:
This dataset contains 2 files: train.txt test.txt corresponding to the training and test parts of the data.
====================================================
Dataset construction:
The training dataset consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not. The positive (clicked) and negatives (non-clicked) examples have both been subsampled (but at different rates) in order to reduce the dataset size.
There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes. The semantic of these features is undisclosed. Some features may have missing values.
The rows are chronologically ordered.
The test set is computed in the same way as the training set but it corresponds to events on the day following the training period. The first column (label) has been removed.
====================================================
Format:
The columns are tab separeted with the following schema:
When a value is missing, the field is just empty. There is no label field in the test set.
====================================================
Dataset assembled by Olivier Chapelle (o.chapelle@criteo.com)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.
Citation
This dataset was generated and used in our paper:
Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.
Please cite this paper if you use the timestamps.csv file in your work.
Generation
The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:
A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006
More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.
Description
The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.
The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:
file_name,cough_number,start_time,end_time
Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.
Licensing
The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.
The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.
The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset offers an extensive assortment of job postings, designed to support investigations and examinations within the realms of job market patterns, natural language processing (NLP), and machine learning. Developed for educational and research objectives, this dataset presents a varied array of job advertisements spanning diverse industries and job categories.
Category- The category of the job. Workplace- If the job in remote, on-site or hybrid. Location- Location of the job posting. Department- The department for which the job has been posted. Type- If the job is full-time, part-time or contractual in nature.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset provides the Young's Modulus values (in GPa) for 50 metals, covering a wide range of categories such as alkali metals, alkaline earth metals, transition metals, and rare earth elements. Young's Modulus is a fundamental mechanical property that measures a material's stiffness under tensile or compressive stress. It is critical for applications in materials science, physics, and engineering.
The dataset includes: - 50 metals with their chemical symbols and Young's Modulus values. - A wide range of stiffness values, from soft metals like cesium (1.7 GPa) to very stiff metals like ruthenium (447 GPa). - Clean and complete data with no missing or duplicate entries.
This dataset can be utilized in various data science and engineering applications, such as: 1. Material Property Prediction: Train machine learning models to predict mechanical properties based on elemental features. 2. Cluster Analysis: Group metals based on their mechanical properties or periodic trends. 3. Correlation Studies: Explore relationships between Young's Modulus and other physical/chemical properties (e.g., density, atomic radius). 4. Engineering Simulations: Use the data for simulations in structural analysis or material selection for design purposes. 5. Visualization and Education: Create visualizations to teach periodic trends and material property variations.
Column Name | Description |
---|---|
Metal | Name of the metal (e.g., Lithium, Beryllium). |
Symbol | Chemical symbol of the metal (e.g., Li, Be). |
Young's Modulus (GPa) | Young's Modulus value in gigapascals (GPa), indicating stiffness under stress. |
The dataset was ethically sourced from publicly available scientific references and academic resources. The data was verified for accuracy using multiple authoritative sources, ensuring reliability for research and educational purposes. No proprietary or sensitive information was included.
Key checks performed: - No missing values: The dataset contains complete entries for all 50 metals. - No duplicates: Each metal appears only once in the dataset. - Statistical analysis: The mean Young's Modulus is ~98.93 GPa, with a wide range from 1.7 GPa to 447 GPa.
We would like to thank the following sources for their contributions to this dataset: - Academic references such as WebElements, Byju's Chemistry Resources, and Wikipedia for cross-verifying the data. - Scientific databases like MatWeb and ASM International for providing accurate material property data. - Special thanks to DALL·E 3 for generating the accompanying dataset image.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description
The dataset you've provided appears to capture agricultural data for Karnataka, specifically focusing on crop yields in Mangalore. Key features include the year of production, geographic details, and environmental conditions such as rainfall (measured in mm), temperature (in degrees Celsius), and humidity (as a percentage). Soil type, irrigation method, and crop type are also recorded, along with crop yields, market price, and season of growth (e.g., Kharif).
The dataset includes several columns related to crop production conditions and outcomes. For example, coconut crop data reveals a pattern of yields over different area sizes, showing how factors like rainfall, temperature, and irrigation influence production. Prices also vary, offering insights into the economic aspects of agriculture in the region. This information could be used to study the impact of environmental conditions and farming techniques on crop productivity, assisting in the development of optimized agricultural practices tailored for specific soil types, climates, and crop needs.
Column Description
yield: yield typically refers to the amount of crop produced per unit area of land
In season column:
Kharif Season: This is the monsoon crop season, where crops are sown at the beginning of the monsoon season (around June) and harvested at the end of the monsoon season (around October). Examples of Kharif crops include rice, maize, and pulses.
Rabi Season: This is the winter crop season, where crops are sown after the monsoon season (around November) and harvested in the spring (around April). Examples of Rabi crops include wheat, barley, and mustard.
Zaid Season: This is the summer crop season, which falls between the Kharif and Rabi seasons (around March to June). Zaid crops are usually short-duration crops and include vegetables, watermelons, and cucumbers.
Authors
rajesh naik
Area covered
Karnataka
Unique identifier
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset History
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.
Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.
Tasks to perform
The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.
Masked in the column description means already converted from categorical value to numerical column.
Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.
DATA PREPROCESSING
Check the basic statistics of the dataset
Check for missing values in the data
Check for unique values in data
Perform EDA
Purchase Distribution
Check for outliers
Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc
Drop unnecessary fields
Convert categorical data into integer using map function (e.g 'Gender' column)
Missing value treatment
Rename columns
Fill nan values
map range variables into integers (e.g 'Age' column)
Data Visualisation
All the Best!!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘prediction of facebook comment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kiranraje/prediction-facebook-comment on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The Dataset is uploaded in ZIP format. The dataset contains 5 variants of the dataset, for the details about the variants and detailed analysis read and cite the research paper TITLE='Comment Volume Prediction
28 columns content in this Dataset 1] Describing popularity or support for the source. 2] Describe how many prople so far visited this place 3]Defines the daily interest of individuals towards source of the document/ Post. 4]Defines the daily interest of individuals towards source of the document/ Post.
--- Original source retains full ownership of the source dataset ---
Excel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).
This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.
The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.
This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.