Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide.A good example is given by the WorldWideScience search engine:The database is available at . It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009)Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends.This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1041505%2F0625876b77e55a56422bb5a37d881e0d%2Fawdasdw.jpg?generation=1595666545033847&alt=media" alt="">
Ever wondered what people are saying about certain countries? Whether it's in a positive/negative light? What are the most commonly used phrases/words to describe the country? In this dataset I present tweets where a certain country gets mentioned in the hashtags (e.g. #HongKong, #NewZealand). It contains around 150 countries in the world. I've added an additional field called polarity which has the sentiment computed from the text field. Feel free to explore! Feedback is much appreciated!
Each row represents a tweet. Creation Dates of Tweets Range from 12/07/2020 to 25/07/2020. Will update on a Monthly cadence. - The Country can be derived from the file_name field. (this field is very Tableau friendly when it comes to plotting maps) - The Date at which the tweet was created can be got from created_at field. - The Search Query used to query the Twitter Search Engine can be got from search_query field. - The Tweet Full Text can be got from the text field. - The Sentiment can be got from polarity field. (I've used the Vader Model from NLTK to compute this.)
There maybe slight duplications in tweet id's before 22/07/2020. I have since fixed this bug.
Thanks to the tweepy package for making the data extraction via Twitter API so easy.
Feel free to checkout my blog if you want to learn how I built the datalake via AWS or for other data shenanigans.
Here's an App I built using a live version of this data.
This dataset features over 1,000,000 high-quality images of cars, sourced globally from photographers, enthusiasts, and automotive content creators. Optimized for AI and machine learning applications, it provides richly annotated and visually diverse automotive imagery suitable for a wide array of use cases in mobility, computer vision, and retail.
Key Features: 1. Comprehensive Metadata: each image includes full EXIF data and detailed annotations such as car make, model, year, body type, view angle (front, rear, side, interior), and condition (e.g., showroom, on-road, vintage, damaged). Ideal for training in classification, detection, OCR for license plates, and damage assessment.
Unique Sourcing Capabilities: the dataset is built from images submitted through a proprietary gamified photography platform with auto-themed competitions. Custom datasets can be delivered within 72 hours targeting specific brands, regions, lighting conditions, or functional contexts (e.g., race cars, commercial vehicles, taxis).
Global Diversity: contributors from over 100 countries ensure broad coverage of car types, manufacturing regions, driving orientations, and environmental settings—from luxury sedans in urban Europe to pickups in rural America and tuk-tuks in Southeast Asia.
High-Quality Imagery: images range from standard to ultra-HD and include professional-grade automotive photography, dealership shots, roadside captures, and street-level scenes. A mix of static and dynamic compositions supports diverse model training.
Popularity Scores: each image includes a popularity score derived from GuruShots competition performance, offering valuable signals for consumer appeal, aesthetic evaluation, and trend modeling.
AI-Ready Design: this dataset is structured for use in applications like vehicle detection, make/model recognition, automated insurance assessment, smart parking systems, and visual search. It’s compatible with all major ML frameworks and edge-device deployments.
Licensing & Compliance: fully compliant with privacy and automotive content use standards, offering transparent and flexible licensing for commercial and academic use.
Use Cases: 1. Training AI for vehicle recognition in smart city, surveillance, and autonomous driving systems. 2. Powering car search engines, automotive e-commerce platforms, and dealership inventory tools. 3. Supporting damage detection, condition grading, and automated insurance workflows. 4. Enhancing mobility research, traffic analytics, and vision-based safety systems.
This dataset delivers a large-scale, high-fidelity foundation for AI innovation in transportation, automotive tech, and intelligent infrastructure. Custom dataset curation and region-specific filters are available. Contact us to learn more!
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.
This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.
Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.
Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more
Train LLMs or chatbots on cinematic language and metadata
Build or enrich movie recommendation engines
Run cross-lingual or multi-region film analytics
Benchmark genre popularity across time periods
Power academic studies or entertainment dashboards
Feed into knowledge graphs, search engines, or NLP pipelines
This dataset aims to map the existing Bike Bus initiatives worldwide, identify their diversity, and understand their challenges and motivations. The dataset is divided into two files: the BB_database, which maps the name, location, and contact (when available) of the bike bus initiatives that we could find, and the BB_survey, which contains the data resulting from an online survey aimed at Bike Bus organizers identified through the BB_database. The data from the survey includes information on the name of the Bike Bus initiatives, location, organizing actors, starting year, contact details, route characteristics (travel time, distance, frequency, and space to cycle), participants (number of children and adults, age, and gender), and management (child supervision, goals, barriers, motivations and challenges). Most of the initiatives are in Catalonia and Spain, yet the database includes Bike Buses worldwide. Data was derived from unpublished master dissertation: Martín, S. (2022). BiciBús in Catalonia: Rutes and characteristics of the bike-train movement. Institute of Environmental Science and Technology at the Universitat Autònoma de Barcelona. Description of methods used for collection-generation of data: The following data comes from a mapping excercise of Bike Buses and an online survey aimed at Bike Bus organizers. It builds on the online survey by Martín (2022) for her master's thesis. In this firt phase of the project leaded by Martín (2022) respondents were approached via social media and email from an archival analysis in Google, Facebook, Instagram, and Twitter (now X) using the keyword "Bicibus". The rest of the respondents were approached using the snowball sampling method. The scope of this first survey was Spain, and it was available in Spanish and Catalan. The survey received 19 responses during this phase. The second phase of the data collection expanded the scope of the survey internationally. The questions were translated into English, and the literature review was done adding the search engines Google Scholar and Scopus, using the keywords "Bike Bus" and "Bike Train". By the end of the data collection, the survey received 143 responses.
This dataset features over 340,000 high-quality images of jewelry sourced from photographers worldwide. Designed to support AI and machine learning applications, it provides a richly detailed and carefully annotated collection of jewelry imagery across styles, materials, and contexts.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Each image is pre-annotated with object and scene detection metadata, including jewelry type, material, and context—ideal for tasks like object detection, style classification, and fine-grained visual analysis. Popularity metrics, derived from engagement on our proprietary platform, are also included.
Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions focused on jewelry photography ensure high-quality, well-lit, and visually appealing submissions. Custom datasets can be sourced on-demand within 72 hours to meet specific requirements such as jewelry category (rings, necklaces, bracelets, etc.), material type, or presentation style (worn vs. product shots).
Global Diversity: photographs have been submitted by contributors in over 100 countries, offering an extensive range of cultural styles, design traditions, and jewelry aesthetics. The dataset includes handcrafted and luxury items, traditional and contemporary pieces, and representations across diverse ethnic and regional fashions.
High-Quality Imagery: the dataset includes high-resolution images suitable for detailed product analysis. Both studio-lit commercial shots and lifestyle/editorial photography are included, allowing models to learn from various presentation styles and settings.
Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This metric offers insight into aesthetic appeal and global consumer preferences, aiding AI models focused on trend analysis or user engagement.
AI-Ready Design: this dataset is optimized for training AI in jewelry classification, attribute tagging, visual search, and recommendation systems. It integrates easily into retail AI workflows and supports model development for e-commerce and fashion platforms.
Licensing & Compliance: the dataset complies fully with data privacy and IP standards, offering transparent licensing for commercial and academic purposes.
Use Cases: 1. Training AI for visual search and recommendation engines in jewelry e-commerce. 2. Enhancing product recognition, classification, and tagging systems. 3. Powering AR/VR applications for virtual try-ons and 3D visualization. 4. Supporting fashion analytics, trend forecasting, and cultural design research.
This dataset offers a diverse, high-quality resource for training AI and ML models in the jewelry and fashion space. Customizations are available to meet specific product or market needs. Contact us to learn more!
Google’s energy consumption has increased over the last few years, reaching 25.9 terawatt hours in 2023, up from 12.8 terawatt hours in 2019. The company has made efforts to make its data centers more efficient through customized high-performance servers, using smart temperature and lighting, advanced cooling techniques, and machine learning. Datacenters and energy Through its operations, Google pursues a more sustainable impact on the environment by creating efficient data centers that use less energy than the average, transitioning towards renewable energy, creating sustainable workplaces, and providing its users with the technological means towards a cleaner future for the future generations. Through its efficient data centers, Google has also managed to divert waste from its operations away from landfills. Reducing Google’s carbon footprint Google’s clean energy efforts is also related to their efforts to reduce their carbon footprint. Since their commitment to using 100 percent renewable energy, the company has met their targets largely through solar and wind energy power purchase agreements and buying renewable power from utilities. Google is one of the largest corporate purchasers of renewable energy in the world.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Nowadays web portals play an essential role in searching and retrieving information in the several fields of knowledge: they are ever more technologically advanced and designed for supporting the storage of a huge amount of information in natural language originating from the queries launched by users worldwide.A good example is given by the WorldWideScience search engine:The database is available at . It is based on a similar gateway, Science.gov, which is the major path to U.S. government science information, as it pulls together Web-based resources from various agencies. The information in the database is intended to be of high quality and authority, as well as the most current available from the participating countries in the Alliance, so users will find that the results will be more refined than those from a general search of Google. It covers the fields of medicine, agriculture, the environment, and energy, as well as basic sciences. Most of the information may be obtained free of charge (the database itself may be used free of charge) and is considered ‘‘open domain.’’ As of this writing, there are about 60 countries participating in WorldWideScience.org, providing access to 50+databases and information portals. Not all content is in English. (Bronson, 2009)Given this scenario, we focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends.This project includes eight months of query logs registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows – as well as reflects – the cultural changes of our modern society.