https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains information about world's biggest companies.
Among them you can find companies founded in the US, the UK, Europe, Asia, South America, South Africa, Australia.
The dataset contains information about the year the company was founded, its' revenue and net income in years 2018 - 2020, and the industry.
I have included 2 csv files: the raw csv file if you want to practice cleaning the data, and the clean csv ready to be analyzed.
The third dataset includes the name of all the companies included in the previous datasets and 2 additional columns: number of employees and name of the founder.
In addition there's tesla.csv file containing shares prices for Tesla.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore the "Largest News Articles Dataset from CNBC," a comprehensive collection of news articles published by CNBC, one of the leading global news sources for business, finance, and current affairs.
This dataset includes thousands of articles covering a wide range of topics, such as financial markets, economic trends, technology, politics, health, and more. Each article in the dataset provides detailed information, including headlines, publication dates, authors, article content, and categories, offering valuable insights for researchers, data analysts, and media professionals.
Key Features:
Whether you're conducting research on financial markets, analyzing media trends, or developing new content, the "Largest News Articles Dataset from CNBC" is an invaluable resource that provides detailed insights and comprehensive coverage of the latest news.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the latest Covid-19 WORLD state-wise data in detail from 2/24/2020 to 6/29/2021. This dataset can be used to analyze covid conditions in the World. This dataset is great for Exploratory Data Analysis. they are 99k rows to work with.
If you find this dataset useful, please consider upvoting ✊
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this dataset you can find hundreds of thousands of the largest cities in the world and info about their latitude, longitude, timezone, location, etc.
This data comes from https://data.world/fiftin/cities/workspace/file?filename=RU.txt.
This data set provides a list of the three largest glaciers and glacier complexes in each of the 19 glacial regions of the world as defined by the Global Terrestrial Network for Glaciers. The data are provided in shapefile format with an outline for each of the largest ice bodies along with a number of attributes such as area in km2.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.
Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.
Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.
We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.
In this dataset, we have include several files:
Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):
Other files include:
The raw data comes from the Berkeley Earth data page.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
All mosques from around the world by available capacity, that belong to any Islamic school or branch, that can accommodate at least 15,000 worshippers in all available places of prayer such as prayer halls (musala), courtyards (ṣaḥn) and porticoes (riwāq). All the mosques in this list are congregational mosques – a type of mosque that hosts the Friday prayer (ṣalāt al-jumuʿa) in congregation (jamāʿa).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.
Key Features
Country: Name of the country.
Density (P/Km2): Population density measured in persons per square kilometer.
Abbreviation: Abbreviation or code representing the country.
Agricultural Land (%): Percentage of land area used for agricultural purposes.
Land Area (Km2): Total land area of the country in square kilometers.
Armed Forces Size: Size of the armed forces in the country.
Birth Rate: Number of births per 1,000 population per year.
Calling Code: International calling code for the country.
Capital/Major City: Name of the capital or major city.
CO2 Emissions: Carbon dioxide emissions in tons.
CPI: Consumer Price Index, a measure of inflation and purchasing power.
CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
Currency_Code: Currency code used in the country.
Fertility Rate: Average number of children born to a woman during her lifetime.
Forested Area (%): Percentage of land area covered by forests.
Gasoline_Price: Price of gasoline per liter in local currency.
GDP: Gross Domestic Product, the total value of goods and services produced in the country.
Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
Largest City: Name of the country's largest city.
Life Expectancy: Average number of years a newborn is expected to live.
Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
Minimum Wage: Minimum wage level in local currency.
Official Language: Official language(s) spoken in the country.
Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
Physicians per Thousand: Number of physicians per thousand people.
Population: Total population of the country.
Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
Tax Revenue (%): Tax revenue as a percentage of GDP.
Total Tax Rate: Overall tax burden as a percentage of commercial profits.
Unemployment Rate: Percentage of the labor force that is unemployed.
Urban Population: Percentage of the population living in urban areas.
Latitude: Latitude coordinate of the country's location.
Longitude: Longitude coordinate of the country's location.
Potential Use Cases
Analyze population density and land area to study spatial distribution patterns.
Investigate the relationship between agricultural land and food security.
Examine carbon dioxide emissions and their impact on climate change.
Explore correlations between economic indicators such as GDP and various socio-economic factors.
Investigate educational enrollment rates and their implications for human capital development.
Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
Study labor market dynamics through indicators such as labor force participation and unemployment rates.
Investigate the role of taxation and its impact on economic development.
Explore urbanization trends and their social and environmental consequences.
//// 🌍 Avanteer Employee Data ////
The Largest Dataset of Active Global Profiles 1B+ Records | Updated Daily | Built for Scale & Accuracy
Avanteer’s Employee Data offers unparalleled access to the world’s most comprehensive dataset of active professional profiles. Designed for companies building data-driven products or workflows, this resource supports recruitment, lead generation, enrichment, and investment intelligence — with unmatched scale and update frequency.
//// 🔧 What You Get ////
1B+ active profiles across industries, roles, and geographies
Work history, education history, languages, skills and multiple additional datapoints.
AI-enriched datapoints include: Gender Age Normalized seniority Normalized department Normalized skillset MBTI assessment
Daily updates, with change-tracking fields to capture job changes, promotions, and new entries.
Flexible delivery via API, S3, or flat file.
Choice of formats: raw, cleaned, or AI-enriched.
Built-in compliance aligned with GDPR and CCPA.
//// 💡 Key Use Cases ////
✅ Smarter Talent Acquisition Identify, enrich, and engage high-potential candidates using up-to-date global profiles.
✅ B2B Lead Generation at Scale Build prospecting lists with confidence using job-related and firmographic filters to target decision-makers across verticals.
✅ Data Enrichment for SaaS & Platforms Supercharge ATS, CRMs, or HR tech products by syncing enriched, structured employee data through real-time or batch delivery.
✅ Investor & Market Intelligence Analyze team structures, hiring trends, and senior leadership signals to discover early-stage investment opportunities or evaluate portfolio companies.
//// 🧰 Built for Top-Tier Teams Who Move Fast ////
Zero duplicate, by design
<300ms API response time
99.99% guaranteed API uptime
Onboarding support including data samples, test credits, and consultations
Advanced data quality checks
//// ✅ Why Companies Choose Avanteer ////
➔ The largest daily-updated dataset of global professional profiles
➔ Trusted by sales, HR, and data teams building at enterprise scale
➔ Transparent, compliant data collection with opt-out infrastructure baked in
➔ Dedicated support with fast onboarding and hands-on implementation help
////////////////////////////////
Empower your team with reliable, current, and scalable employee data — all from a single source.
Business-critical Data Types We offer access to robust datasets sourced from over 13M job ads daily. Track companies’ growth, market focus, technological shifts, planned geographic expansion, and more: - Identify new business opportunities - Identify and forecast industry & technological trends - Help identify the jobs, teams, and business units that have the highest impact on corporate goals - Identify most in-demand skills and qualifications for key positions.
Fresh Datasets We regularly update our datasets, assuring you access to the latest data and allowing for timely analysis of rapidly evolving markets & dynamic businesses.
Historical Datasets We maintain at your disposal historical datasets, allowing for comprehensive, reliable, and statistically sound historical analysis, trend identification, and forecasting.
Easy Access and Retrieval Our job listing datasets are available in industry-standard, convenient JSON and CSV formats. These structured formats make our datasets compatible with machine learning, artificial intelligence training, and similar applications. The historical data retrieval process is quick and reliable thanks to our robust, easy-to-implement API integration.
Datasets for investors Investment firms and hedge funds use our datasets to better inform their investment decisions by gaining up-to-date, reliable insights into workforce growth, geographic expansion, market focus, technology shifts, and other factors of start-ups and established companies.
Datasets for businesses Our datasets are used by retailers, manufacturers, real estate agents, and many other types of B2B & B2C businesses to stay ahead of the curve. They can gain insights into the competitive landscape, technology, and product adoption trends as well as power their lead generation processes with data-driven decision-making.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dimensions is the largest database of research insight in the world. It represents the most comprehensive collection of linked data related to the global research and innovation ecosystem available in a single platform. Because Dimensions maps the entire research lifecycle, you can follow academic and industry research from early stage funding, through to output and on to social and economic impact. Businesses, governments, universities, investors, funders and researchers around the world use Dimensions to inform their research strategy and make evidence-based decisions on the R&D and innovation landscape. With Dimensions on Google BigQuery, you can seamlessly combine Dimensions data with your own private and external datasets; integrate with Business Intelligence and data visualization tools; and analyze billions of data points in seconds to create the actionable insights your organization needs. Examples of usage: Competitive intelligence Horizon-scanning & emerging trends Innovation landscape mapping Academic & industry partnerships and collaboration networks Key Opinion Leader (KOL) identification Recruitment & talent Performance & benchmarking Tracking funding dollar flows and citation patterns Literature gap analysis Marketing and communication strategy Social and economic impact of research About the data: Dimensions is updated daily and constantly growing. It contains over 112m linked research publications, 1.3bn+ citations, 5.6m+ grants worth $1.7trillion+ in funding, 41m+ patents, 600k+ clinical trials, 100k+ organizations, 65m+ disambiguated researchers and more. The data is normalized, linked, and ready for analysis. Dimensions is available as a subscription offering. For more information, please visit www.dimensions.ai/bigquery and a member of our team will be in touch shortly. If you would like to try our data for free, please select "try sample" to see our openly available Covid-19 data.En savoir plus
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.
We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.
cvedataset-patches.zip
file contains fix patches, and postgrescvedumper.sql.zip
contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.
MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).
For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes
If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.
This product uses the NVD API but is not endorsed or certified by the NVD.
This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).
To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:
POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d
Please use this for citation:
title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages={42--51},
year={2024}
}
The British Geological Survey has one of the largest databases in the world on the production and trade of minerals. The dataset contains annual production statistics by mass for more than 70 mineral commodities covering the majority of economically important and internationally-traded minerals, metals and mineral-based materials. For each commodity the annual production statistics are recorded for individual countries, grouped by continent. Import and export statistics are also available for years up to 2002. Maintenance of the database is funded by the Science Budget and output is used by government, private industry and others in support of policy, economic analysis and commercial strategy. As far as possible the production data are compiled from primary, official sources. Quality assurance is maintained by participation in such groups as the International Consultative Group on Non-ferrous Metal Statistics. Individual commodity and country tables are available for sale on request.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This shapefile contains the major and largest basins of the world.
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Cross-national research on the causes and consequences of income inequality has been hindered by the limitations of existing inequality datasets: greater coverage across countries and over time is available from these sources only at the cost of significantly reduced comparability across observations. The goal of the Standardized World Income Inequality Database (SWIID) is to overcome these limitations. A custom missing-data algorithm was used to standardize the United Nations University's World Income Inequality Database and data from other sources; data collected by the Luxembourg Income Study served as the standard. The SWIID provides comparable Gini indices of gross and net income inequality for 192 countries for as many years as possible from 1960 to the present along with estimates of uncertainty in these statistics. By maximizing comparability for the largest possible sample of countries and years, the SWIID is better suited to broadly cross-national research on income inequality than previously available sources: it offers coverage double that of the next largest income inequality dataset, and its record of comparability is three to eight times better than those of alternate datasets.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.
One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.
Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.
The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.
As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.
Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.
The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.
Image data is critical for computer vision application
Licence Ouverte / Open Licence 1.0https://www.etalab.gouv.fr/wp-content/uploads/2014/05/Open_Licence.pdf
License information was derived automatically
There has been a marked revival of interest in the study of the distribution of top incomes using tax data. Beginning with the research by Thomas Piketty (2001, 2003) of the long-run distribution of top incomes in France, a succession of studies has constructed top income share time series over the long-run for more than twenty countries to date.
These projects have generated a large volume of data, which are intended as a research resource for further analysis. The world top incomes database aims to provide convenient on line access to all the existent series. This is an ongoing effort, and we will progressively update the basis with new observations, as authors extend the series forwards and backwards. Despite the database’s name, we will also add information on the distribution of earnings and the distribution of wealth. As the map below shows, around forty-five further countries are under study, and will be incorporated at some point (see Work in Progress).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The input tsunami hazard data are based on the global hazard analysis of Davies et al. (2017), developed jointly by Geoscience Australia and NGI, formatted for use in ThinkHazard!. The data serves as data for Global Tsunami Model (GTM, http://globaltsunamimodel.org/). The global tsunami dataset contains maximum inundation heights, calculated at offshore hazard points and projected to shoreline by simple interpolation. Tsunami Maximum Inundation Height (MIH) is defined as the largest elevation the tsunami reaches above still water level, consistent with IOC-UNESCO terminology. The MIH hazard data are at global level for return periods: 10, 50, 100, 200, 500, 1000, and 2500 year. Values above and below extreme values, are referred to as >=20 m and
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains information about world's biggest companies.
Among them you can find companies founded in the US, the UK, Europe, Asia, South America, South Africa, Australia.
The dataset contains information about the year the company was founded, its' revenue and net income in years 2018 - 2020, and the industry.
I have included 2 csv files: the raw csv file if you want to practice cleaning the data, and the clean csv ready to be analyzed.
The third dataset includes the name of all the companies included in the previous datasets and 2 additional columns: number of employees and name of the founder.
In addition there's tesla.csv file containing shares prices for Tesla.