Since Persian datasets are really scarce I scrape Twitter in order to make a new Persian dataset.
The tweets have been pulled from Twitter using snscrape
and manual tagging has been done based on Ekman's 6 main emotions.
For privacy sake, I pre-process and remove usernames, display names, and mentions from all tweets. Also, I deleted the timestamps and Tweets IDs.
Columns: 1) tweet 2) replyCount 3) retweetCount 4) likeCount 5) quoteCount 6) hashtags 7) sourceLabel 8) emotion
Please leave an upvote if you find this relevant. :)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Nonindigenous aquatic species introductions are widely recognized as major stressors to freshwater ecosystems, threatening native endemic biodiversity and causing negative impacts to ecosystem services as well as damaging local and regional economies.
So, it's thus necessary to monitor the spatial and temporal trends and spread in order to guide prevention and control efforts and to develop effective policy aimed at mitigating impacts.
In other way, you can improve your skills in analyze spatial and temporal patterns. For that I recommend reviewing this course first.
This kaggle dataset contains nonindigenous aquatic species introductions from 1616 to 2016 in United States of America.
Two sources were used:
1. DAT_SPECIES: Information of species.
Dataset belongs to the U.S. EPA Office of Research and Development, and can be downloaded on various open data platforms. See data.gov or data.world.
The data provided are merge and a small cleaning is carried out. Its features are:
2. DAT_SPATIAL: Georeferencied information in USA by state.
Imported and preprocessed from geopandas datasets and a brief web scraping. Its features are:
Thanks to Kaggle! the platform, its resources and the community in general are great.
It's possible can provide important insights into the historical drivers of any event and aid in forecasting future patterns. Then, if you consider adding spaciotemporal information, the analysis becomes more complete.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This is a modified version of Preet Viradiya's dataset "Brian Tumor Dataset", but all the tumor images have been preprocessed, normalized, and the tumor location metadata has been manually gathered into a separate dataset.
The full preprocess sequence is detailed in the first half of this notebook in the original dataset: Brain tumor image preprocessing & clasifier
DISCLAIMER: I am no neuroscientist, so this data should only be used for practice purposes, as some of the tumor data location is bound to be inaccurate or plainly wrong.
The data is split in two datasets:
1. image_df
contains 2500 separate 128x128px images of cancer brain scans, one in each row. Reshaping a row into a 128x128 array should be necessary in order to display it correctly.
2. data_df
contains four integers per entry detailing the coordinates of the top left corner of the approximate rectangle containing the tumor in the same-index image. The following two values contain the rectangle width and height respectively.
An example of loading and displaying data from this dataset has been included into the notebooks section under the name Dataset Usage Basic Example.
Thanks to Preet Viradiya for providing the original images
This dataset's goal is to find a way to improve an hypotetical brain scan classificator. The question of "does this brain have cancer?" has been answered in the original dataset, using regression and this modified dataset not only the classification question question can be answered but also a model can be trained to point out exactly where the tumor is located.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides an in-depth look into the global CO2 emissions at the country-level, allowing for a better understanding of how much each country contributes to the global cumulative human impact on climate. It contains information on total emissions as well as from coal, oil, gas, cement production and flaring, and other sources. The data also provides a breakdown of per capita CO2 emission per country - showing which countries are leading in pollution levels and identifying potential areas where reduction efforts should be concentrated. This dataset is essential for anyone who wants to get informed about their own environmental footprint or conduct research on international development trends
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a country-level survey of global fossil CO2 emissions, including total emissions, emissions from coal, oil, gas, cement, flaring and other sources as well as per capita emissions.
For researchers looking to quantify global CO2 emission levels by country over time and understand the sources of these emissions this dataset can be a valuable resource.
The data is organized using the following columns: Country (the name of the country), ISO 3166-1 alpha-3 (the three letter code for the country), Year (the year of survey data), Total (the total amount of CO2 emitted by the country in that year), Coal (amount of CO2 emitted by coal in that year), Oil (amount emitted by oil) , Gas (amount emitted by gas) , Cement( amount emitted by cement) , Flaring(flaring emission levels ) and Other( other forms such as industrial processes ). In addition there is also one extra column Per Capita which provides an insight into how much personal carbon dioxide emission is present in each Country per individual .
To make use of these columns you can aggregate sum up Total column for a specific region or help define how much each source contributes to Total column such as how many percent it accounts for out of 100 or construct dashboard visualizations to explore what sources are responsible for higher level emission across different countries similar clusters or examine whether individual countries Focusing on Flaring — emissions associated with burning off natural gas while drilling—can improve overall Fossil Fuel Carbon Emission profiles better understanding of certain types nuclear power plants etc.
The main purpose behind this dataset was to facilitate government bodies private organizations universities NGO's research agencies alike applying analytical techniques tracking environment changes linked with influence cross regions providing resources needed analyze process monitor developing directed ways managing efficient ways get detailed comprehensive verified information
With insights gleaned from this dataset one can begin identify strategies efforts pollutant mitigation climate change combat etc while making decisions centered around sustainable developments with continent wide unified plans policy implementations keep an eye out evidences regional discrepancies being displayed improving quality life might certainly seem likely assure task easy quickly done “Global Fossil Carbon Dioxide Emissions:Country Level Survey 2002 2022 could exactly what us
- Using the per capita emissions data, develop a reporting system to track countries' progress in meeting carbon emission targets and give policy recommendations for how countries can reach those targets more quickly.
- Analyze the correlation between different fossil fuel sources and CO2 emissions to understand how best to reduce CO2 emissions at a country-level.
- Create an interactive map showing global CO2 levels over time that allows users to visualize trends by country or region across all fossil fuel sources
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: GCB2022v27_MtCO2_flat.csv | Column name | Description ...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Labeled datasets are useful in machine learning research.
This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.
Tables: 1) annotations_bbox 2) dict 3) images 4) labels
Update Frequency: Quarterly
Fork this kernel to get started.
https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images
https://cloud.google.com/bigquery/public-data/openimages
APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.
Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.
The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.
Banner Photo by Mattias Diesel from Unsplash.
Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
So what can you try building? Here are some suggestions: - Start with an image classifier. Use the masterCategory column from styles.csv and train a convolutional neural network. - Try adding more sophisticated classification by predicting the other category labels in styles.csv
By Ben Jones [source]
This remarkable dataset chronicles the world record progression of the men's mile run, containing detailed information on each athlete's time, their name, nationality, date of their accomplishment and the location of their event. It allows us to look back in history and get a comprehensive overview of how this track event has progressed over time. Analyzing this information can help us understand how training and technology have improved the event over the years; as well as study different athletes' performances and learn how some athletes have pushed beyond their limits or fell short. This valuable resource is an essential source for anyone intrigued by the cutting edge achievements in men's mile running world records. Discovering powerful insights from this dataset can allow us to gain perspective into not only our own personal goals but also uncover ideas on how we could continue pushing our physical boundaries by watching past successes. Explore and comprehend for yourself what it means to be a true athlete at heart!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction on how best to use this dataset in order to analyze various aspects involving the men’s mile run world records. We will focus on analyzing specific fields such as date, athlete name & nationality, time taken for completion and auto status by using statistical methods and graphical displays of data.
In order to use this data effectively it is important that you understand what each field measures: • Time: The amount of time it took for an athlete to finish a race - measured in minutes and seconds (example: 3:54).
• Auto: Whether or not a pacemaker was used during a specific race (example ; yes/no).
• Athlete Name & Nationality: The name and nationality associated with an athlete who set \record(example; Usain Bolt - Jamaica).
• Date : Year representing when a specific record was set by an individual( example-2021 ). •Venue : Location at which the record is set.(example; London Olympic Stadium )Now that you understand which fields measure what let’s discuss various ways that you can use these datasets features. Analyzing trends in historical sporting performances has long been utilized as means for understanding changes brought about by new training methods/technologies etc., over time . This can be done with our dataset by using basic statistical displays like bar graphs & average analysis or more advanced methods such as regression analysis or even Bayesian approaches etc..The first thing anyone interested should do when dealing with this sort of data is inspect any wacky outliers before beginning more rigorous analysis; if one discovers any potential unreasonable values it would be best to discard them before building after models or readings based off them (this sort of elimination is common practice).After cleaning your work space let’s move onto building interactive visual display through graphics ,plotting different columns against one another e.g., – plotting
time
againstdate
allows us see changes overtime from 1861 until now . Additionally plottingtime
vsAuto
allows us see any
- Comparing individual athletes and identifying those who have consistently pushed the event to higher levels of performance.
- Analyzing national trends related to improvement in track records over time, based on differences in training and technology.
- Creating a heatmap to visualize the progression of track records around the world and locate regions with a particularly strong historical performance in this event
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. -...
From Be Slavery Free watchdog group. Hand-transcribed into Python data frame in my Enhanced Cacao Data Gathering notebook, and saved out as csv. Data is as presented in the graphical PDF scorecard document.
I encoded the colored bunny and egg values as numbers. 1 = "Leading the industry on policy" 2 = "Starting to implement good policies" 3 = "Needs more work on policy and implementation" 4 = "Needs to catch up with the industry" 0 = "Did not respond to survey; Lacking in transparency"
Note: a single 0 ramified across all ratings columns in my manual transcription of the data set for those that did not respond to the industry survey but were listed on the scorecard with a black egg or bunny.
The SubsidiaryIndustry column would probably be best parsed out (delimiter is '-') and split into unrelated columsn (e.g. "Subsidiary" and "Industry") or even encoded with one-hot encoding (e.g. either generic "Subsidiary1", "Subsidiary2", etc. or specific "Chocolate", "Trader", etc., with binary values.
Imported for use cross referencing with the Flavors of Cacao datasets scraped by the import script or analyzed by various cacao analytics exercises.
The 2021 scorecard is also available (though I haven't personally transcribed it yet).
The latest version adds two more csv files, transformations of the first file provided here. - be_slavery_free_chocolate_normalized.csv takes the weird scale I input from the scale the original scorecard uses (1-6 expressed in green through red plus 0 for missingvalues), and refactors and normalizes them to the 0 to 1 scale used for, for example, the stars() plot in R. - be_slavery_free_chocolate_normalized_split.csv takes the normalized set and splits the SubsidiaryIndustry into a separate row for each "-" delimited value. I also manually went through the resulting data frame to remove duplicates, e.g. for traders/manufacturers/processors.
Both of these data sets can more easily be used with the stars() plotting function and with other older functions that require normalized data. For the stars() function specifically, be sure to use one of the text rows as row names (with row.names(df) = df$Company
, for example), since the function implicitly expects to use the row name as the star plot label in a faceted display.
Photo by Ákos Helgert: https://www.pexels.com/photo/yellow-cacao-fruit-8900912/
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of various Agentic AI (autonomous AI) applications across multiple industries in 2025. It contains detailed records of how AI is being utilized to automate complex tasks, improve efficiency, and generate measurable outcomes. The dataset is designed to help researchers, data scientists, and businesses understand the current state and potential of Agentic AI in different sectors. Dataset Features: Industry: The sector where Agentic AI is applied (e.g., Healthcare, Finance, Manufacturing).
Application Area: The specific task or function performed by the AI agent (e.g., Fraud Detection, Predictive Maintenance).
AI Agent Name: The name of the AI system or agent deployed (e.g., HealthAI Monitor, FinSecure Agent).
Task Description: A brief description of the AI's function or role.
Technology Stack: The technologies powering the AI (e.g., Machine Learning, NLP, Computer Vision).
Outcome Metrics:The measurable impact of the AI deployment (e.g., 30% reduction in ER visits).
Deployment Year: The year the AI system was deployed (ranging from 2023 to 2025).
Geographical Region: The region where the AI application is implemented (e.g., North America, Asia, Europe).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
During the 2019 Australian election I noticed that almost everything I was seeing on Twitter was unusually left-wing. So I decided to scrape some data and investigate. Unfortunately my sentiment analysis has so far been too inaccurate to come to any useful conclusions. I decided to share the data so that others may be able to help with the sentiment or any other interesting analysis.
Over 180,000 tweets collected using Twitter API keyword search between 10.05.2019 and 20.05.2019. Columns are as follows:
The latitude and longitude of user_location is also available in location_geocode.csv. This information was retrieved using the Google Geocode API.
Thanks to Twitter for providing the free API.
There are a lot of interesting things that could be investigated with this data. Primarily I was interested to do sentiment analysis, before and after the election results were known, to determine whether Twitter users are indeed a left-leaning bunch. Did the tweets become more negative as the results were known?
Other ideas for investigation include:
Take into account retweets and favourites to weight overall sentiment analysis.
Which parts of the world are interested (ie: tweet about) the Australian elections, apart from Australia?
How do the users who tweet about this sort of thing tend to describe themselves?
Is there a correlation between when the user joined Twitter and their political views (this assumes the sentiment analysis is already working well)?
Predict gender from username/screen name and segment tweet count and sentiment by gender
The Gender Statistics database is a comprehensive source for the latest sex-disaggregated data and gender statistics covering demography, education, health, access to economic opportunities, public life and decision-making, and agency.
The data is split into several files, with the main one being Data.csv. The Data.csv contains all the variables of interest in this dataset, while the others are lists of references and general nation-by-nation information.
Data.csv contains the following fields:
I couldn't find any metadata for these, and I'm not qualified to guess at what each of the variables mean. I'll list the variables for each file, and if anyone has any suggestions (or, even better, actual knowledge/citations) as to what they mean, please leave a note in the comments and I'll add your info to the data description.
Country-Series.csv
Country.csv
FootNote.csv
Series-Time.csv
Series.csv
This dataset was downloaded from The World Bank's Open Data project. The summary of the Terms of Use of this data is as follows:
You are free to copy, distribute, adapt, display or include the data in other products for commercial and noncommercial purposes at no cost subject to certain limitations summarized below.
You must include attribution for the data you use in the manner indicated in the metadata included with the data.
You must not claim or imply that The World Bank endorses your use of the data by or use The World Bank’s logo(s) or trademark(s) in conjunction with such use.
Other parties may have ownership interests in some of the materials contained on The World Bank Web site. For example, we maintain a list of some specific data within the Datasets that you may not redistribute or reuse without first contacting the original content provider, as well as information regarding how to contact the original content provider. Before incorporating any data in other products, please check the list: Terms of use: Restricted Data.
-- [ed. note: this last is not applicable to the Gender Statistics database]
The World Bank makes no warranties with respect to the data and you agree The World Bank shall not be liable to you in connection with your use of the data.
This is only a summary of the Terms of Use for Datasets Listed in The World Bank Data Catalogue. Please read the actual agreement that controls your use of the Datasets, which is available here: Terms of use for datasets. Also see World Bank Terms and Conditions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Since Persian datasets are really scarce I scrape Twitter in order to make a new Persian dataset.
The tweets have been pulled from Twitter using snscrape
and manual tagging has been done based on Ekman's 6 main emotions.
For privacy sake, I pre-process and remove usernames, display names, and mentions from all tweets. Also, I deleted the timestamps and Tweets IDs.
Columns: 1) tweet 2) replyCount 3) retweetCount 4) likeCount 5) quoteCount 6) hashtags 7) sourceLabel 8) emotion
Please leave an upvote if you find this relevant. :)