Executive Summary: Artificial intelligence (AI) is a transformative technology that holds promise for tremendous societal and economic benefit. AI has the potential to revolutionize how we live, work, learn, discover, and communicate. AI research can further our national priorities, including increased economic prosperity, improved educational opportunities and quality of life, and enhanced national and homeland security. Because of these potential benefits, the U.S. government has invested in AI research for many years. Yet, as with any significant technology in which the Federal government has interest, there are not only tremendous opportunities but also a number of considerations that must be taken into account in guiding the overall direction of Federally-funded R&D in AI. On May 3, 2016,the Administration announced the formation of a new NSTC Subcommittee on Machine Learning and Artificial intelligence, to help coordinate Federal activity in AI.1 This Subcommittee, on June 15, 2016, directed the Subcommittee on Networking and Information Technology Research and Development (NITRD) to create a National Artificial Intelligence Research and Development Strategic Plan. A NITRD Task Force on Artificial Intelligence was then formed to define the Federal strategic priorities for AI R&D, with particular attention on areas that industry is unlikely to address. This National Artificial Intelligence R&D Strategic Plan establishes a set of objectives for Federallyfunded AI research, both research occurring within the government as well as Federally-funded research occurring outside of government, such as in academia. The ultimate goal of this research is to produce new AI knowledge and technologies that provide a range of positive benefits to society, while minimizing the negative impacts. To achieve this goal, this AI R&D Strategic Plan identifies the following priorities for Federally-funded AI research: Strategy 1: Make long-term investments in AI research. Prioritize investments in the next generation of AI that will drive discovery and insight and enable the United States to remain a world leader in AI. Strategy 2: Develop effective methods for human-AI collaboration. Rather than replace humans, most AI systems will collaborate with humans to achieve optimal performance. Research is needed to create effective interactions between humans and AI systems. Strategy 3: Understand and address the ethical, legal, and societal implications of AI. We expect AI technologies to behave according to the formal and informal norms to which we hold our fellow humans. Research is needed to understand the ethical, legal, and social implications of AI, and to develop methods for designing AI systems that align with ethical, legal, and societal goals. Strategy 4: Ensure the safety and security of AI systems. Before AI systems are in widespread use, assurance is needed that the systems will operate safely and securely, in a controlled, well-defined, and well-understood manner. Further progress in research is needed to address this challenge of creating AI systems that are reliable, dependable, and trustworthy. Strategy 5: Develop shared public datasets and environments for AI training and testing. The depth, quality, and accuracy of training datasets and resources significantly affect AI performance. Researchers need to develop high quality datasets and environments and enable responsible access to high-quality datasets as well as to testing and training resources. Strategy 6: Measure and evaluate AI technologies through standards and benchmarks. . Essential to advancements in AI are standards, benchmarks, testbeds, and community engagement that guide and evaluate progress in AI. Additional research is needed to develop a broad spectrum of evaluative techniques. Strategy 7: Better understand the national AI R&D workforce needs. Advances in AI will require a strong community of AI researchers. An improved understanding of current and future R&D workforce demands in AI is needed to help ensure that sufficient AI experts are available to address the strategic R&D areas outlined in this plan. The AI R&D Strategic Plan closes with two recommendations: Recommendation 1: Develop an AI R&D implementation framework to identify S&T opportunities and support effective coordination of AI R&D investments, consistent with Strategies 1-6 of this plan. Recommendation 2: Study the national landscape for creating and sustaining a healthy AI R&D workforce, consistent with Strategy 7 of this plan.
In 2022, the global total corporate investment in artificial intelligence (AI) reached almost 92 billion U.S. dollars, a slight decrease from the previous year. In 2018, the yearly investment in AI saw a slight downturn, but that was only temporary. Private investments account for a bulk of total AI corporate investment. AI investment has increased more than sixfold since 2016, a staggering growth in any market. It is a testament to the importance of the development of AI around the world.
What is Artificial Intelligence (AI)?
Artificial intelligence, once the subject of people’s imaginations and the main plot of science fiction movies for decades, is no longer a piece of fiction, but rather commonplace in people’s daily lives whether they realize it or not. AI refers to the ability of a computer or machine to imitate the capacities of the human brain, which often learns from previous experiences to understand and respond to language, decisions, and problems. These AI capabilities, such as computer vision and conversational interfaces, have become embedded throughout various industries’ standard business processes.
AI investment and startups
The global AI market, valued at 142.3 billion U.S. dollars as of 2023, continues to grow driven by the influx of investments it receives. This is a rapidly growing market, looking to expand from billions to trillions of U.S. dollars in market size in the coming years. From 2020 to 2022, investment in startups globally, and in particular AI startups, increased by five billion U.S. dollars, nearly double its previous investments, with much of it coming from private capital from U.S. companies. The most recent top-funded AI businesses are all machine learning and chatbot companies, focusing on human interface with machines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accompanying material for the paper "The anatomy of Green AI technologies: structure, evolution, and impact" (2025).
The Green AI Patent Dataset comprises 63 326 unique U.S. patents that intersect environmental (“green”) technologies with artificial‐intelligence components, spanning from 1976 to 2023. It was assembled by combining:
PatentsView (USPTO) – U.S. patents (snapshot of January 2025) labelled under Cooperative Patent Classification classes Y02 and Y04S for climate‐change mitigation/adaptation and smart‐grid technologies.
Artificial Intelligence Patent Dataset (AIPD 2023 - most recent update) – USPTO’s machine‐learning–validated classification of AI‐related patents (predict50_any_ai = 1). Available here: Pairolero, N. et al. The artificial intelligence patent dataset (aipd) 2023 update. USPTO Economic Working Paper 2024-4,
USPTO (2024). Available at https://www.uspto.gov/sites/default/files/documents/oce-aipd-2023.pdf.
Variable | Description | Completeness (non-null count) |
---|---|---|
patent_id | Unique USPTO patent identifier. | 63 326 |
cpc_subclass | Subclasses of "green" CPC taxonomy Y02 / Y04S. Refer to the USTPO's website for more details: https://www.uspto.gov/web/patents/classification/cpc/html/cpc-Y.html | 63 326 |
patent_date | Grant date of the patent (YYYY-MM-DD). | 63 326 |
patent_title | Title of the patent. | 63 326 |
assignee | Disambiguated assignee organization name. | 59 479 |
country | Disambiguated assignee country. | 59 155 |
forward_citations | Number of times this patent is cited by later patents (forward citations). | 63 326 |
tech_domain | BERTOPIC‐derived technology domain (integer 0–15; –1 marks outliers). | 62 337 |
real_value | Market‐value proxy associated with the patent, derived from the updated dataset of Kogan, L., Papanikolaou, D., Seru, A. & Stoffman, N. Technological innovation, resource allocation, and growth. The Q. J. Econ. 132, 665–712, DOI: 10.1093/qje/qjw040 (2017). | 26 306 |
Each patent was assigned to one of 16 topics (tech_domain), numbered 0–15 (with –1 for outliers). Below is the label, example keywords (with their topic cohesion scores), and the number of patents in each topic:
ID | Label | Top Keywords (score) | Count |
---|---|---|---|
0 | Data Processing & Memory Management | processing (0.516), computing (0.461), process (0.449), systems (0.443), memory (0.421) | 27 435 |
1 | Microgrid & Distributed Energy Systems | microgrid (0.487), electricity (0.421), utility (0.401), power (0.380), energy (0.370) | 5 378 |
2 | Vehicle Control & Autonomous Powertrains | vehicle (0.477), vehicles (0.468), control (0.416), driving (0.387), engine (0.386) | 3 747 |
3 | Irrigation & Agricultural Water Mgmt | irrigation (0.511), systems (0.431), flow (0.353), process (0.348), water (0.333) | 2 754 |
4 | Photovoltaic & Electrochemical Devices | semiconductor (0.518), photoelectric (0.509), electrodes (0.487), electrode (0.473), photovoltaic (0.470) | 2 599 |
5 | Clinical Microbiome & Therapeutics | microbiome (0.481), clinical (0.371), physiological (0.321), therapeutic (0.320), disease (0.314) | 2 286 |
6 | Combustion Engine Control | combustion (0.423), engine (0.373), control (0.342), fuel (0.338), ignition (0.318) | 2 179 |
7 | Battery Charging & Management | charging (0.485), charger (0.449), charge (0.425), battery (0.386), batteries (0.377) | 1 541 |
8 | HVAC & Thermal Regulation | hvac (0.515), heater (0.474), cooling (0.471), heating (0.464), evaporator (0.455) | 1 523 |
9 | Lighting & Illumination Systems | lighting (0.621), illumination (0.601), lights (0.545), brightness (0.526), light (0.488) | 1 219 |
10 | Exhaust & Emission Treatment | exhaust (0.464), catalytic (0.446), purification (0.444), catalyst (0.366), emissions (0.365) | 1 064 |
11 | Wind Turbine & Rotor Control | turbines (0.498), turbine (0.488), windmill (0.464), wind (0.418), rotor (0.300) | 988 |
12 | Aircraft Wing Aerodynamics & Control | wing (0.450), aircraft (0.448), wingtip (0.424), apparatus (0.423), aerodynamic (0.418) | 697 |
13 | Meteorological Radar & Weather Forecasting | radar (0.541), meteorological (0.511), weather (0.412), precipitation (0.391), systems (0.372) | 542 |
14 | Fuel Cell Systems & Electrodes | fuel (0.375), cell (0.313), systems (0.295), cells (0.291), controls (0.262) | 377 |
15 | Turbine Airfoils & Cooling | airfoils (0.584), airfoil (0.572), turbine (0.433), engine (0.333), axial (0.321) | 352 |
–1 | Outliers | – | 7 656 |
This Zenodo entry contains topic_modeling.ipynb
, a fully documented jupyter notebook containing Python code for uncovering latent themes in patent abstracts using BERTopic. It walks through text preprocessing (lowercasing, standard English stopwords plus “herein” and “invention,” tokenization, and boilerplate removal), embedding with the all-MiniLM-L6-v2 SentenceTransformer, dimensionality reduction via UMAP, clustering with HDBSCAN, and topic extraction through class-based TF-IDF. The script also executes a grid search over UMAP and HDBSCAN hyperparameters, computes UMass coherence and topic diversity for each configuration, and saves a CSV of evaluation metrics, enabling straightforward reproduction of our topic-modeling workflow.
Additional analyses, such as data cleaning, merging, aggregation, and the generation of summary tables and plots, were also performed but are not included here by default, as they consist of straightforward operations using standard open-source libraries (e.g., pandas, NumPy, matplotlib, and seaborn). The full code for these steps can be made available upon request.
This dataset contains the results of a survey conducted on undergraduate students enrolled in the 2nd and 3rd year of study at the Faculty of Cybernetics, Statistics and Economic Informatics. The survey was conducted online and distributed through social media groups. The aim of the survey was to gather insights into students' perceptions of the role of artificial intelligence in education.
👇
Question 1: On a scale of 1 to 10, how informed do you think you are about the concept of artificial intelligence? (1-not informed at all, 10-extremely informed)
Question 2: What sources do you use to learn about the concept of artificial intelligence? -Internet -Books/Scientific papers (physical/online format) -Social media -Discussions with family/friends -I don't inform myself about AI
Question 3: Express your agreement or disagreement with the following statements: (Strongly Disagree, Partially Disagree, Neutral, Partially Agree, Fully Agree) 1. AI encourages dehumanization 2. Robots will replace people at work 3. AI helps to solve many problems in society (education, agriculture, medicine), managing time and dangerous situations more efficiently 4. AI will rule society
Question 4: Express your agreement or disagreement with the following statements: (Strongly Disagree, Partially Disagree, Neutral, Partially Agree, Fully Agree) 1. Machinery using AI is very expensive and resource intensive to build and maintain 2. AI will lead to a global economic crisis 3. AI will help global economic growth 4. AI leads to job losses
Question 5: When you think about AI do you feel: o Curiosity o Fear o Indifference o Trust
Question 6: In which areas do you think AI would have a big impact? -Education -Medicine -Agriculture -Constructions -Marketing -Public administration -Art
Question 7: On a scale of 1 to 10, how useful do you think AI would be in the educational process? (1- not useful at all, 10-extremely useful)
Question 8: What do you think is the main advantage that AI would have in the teaching process? o Teachers can be assisted by a virtual assistant for teaching lessons and answering students' questions immediately o More efficient management of teachers' time o More interactive and engaging lessons for students o Other
Question 9: What do you think is the main advantage that AI would have in the learning process? o Personalized lessons according to students' needs o Universal access for all students eager to learn, including those with special needs o More interactive and engaging lessons for students o Other
Question 10: What do you think is the main advantage that AI would have in the evaluation process? o Automation of exam grading o Fewer errors in grading system o Constant feedback from virtual assistants for each student o Other
Question 11: What do you think is the main disadvantage that AI would have in the educational process? o Lack of a relationship between students and teacher o Internet addiction o Rarer interactions between students and teachers o Loss of information caused by possible system failure
Question 12: What is your gender? o Female o Male
Question 13: What is your year of study? o Year 2 o Year 3
Question 14: What is your major? o Economic Cybernetics o Statistics and Economic Forecasting o Economic Informatics
Question 15: Did you pass all your exams? o Yes o No
Question 16: What is your GPA for your last year of study? (Note that grades are from 1 to 10 in Romania) o 5.0-5.4 o 5.5.-5.9 o 6.0-6.4 o 6.5-6.9 o 7.0-7.4 o 7.5-7.9 o 8.0-8.4 o 8.5-8.9 o 9.0-9.4 o 9.5-10
Turnover data by fiscal year for the City of Tempe compared to the seven market cities which included Chandler, Gilbert, Glendale, Mesa, Phoenix, Peoria and Scottsdale. There are two totals, one with and one without retires.Please note that the Valley Benchmark Cities’ annual average is unavailable for FY 2020/2021 due to a gap in data collection during that year.Please note that corrections were made to the data, including historic data, due to additional review and research on the data on 10/2/2024.This page provides data for the Employee Turnover performance measure.The performance measure dashboard is available at 5.07 Employee Turnover.Additional InformationSource: Department ReportsContact: Lawrence La VictoireContact E-Mail: lawrence_lavictoire@tempe.govData Source Type: ExcelPreparation Method: Extracted from PeopleSoft and requested data from other cities is entered manually into a spreadsheet and calculations are conducted to determine percent of turnover per fiscal yearPublish Frequency:AnnuallyPublish Method: ManualData Dictionary
This EnviroAtlas dataset shows the employment rate, or the percent of the population aged 16-64 who have worked in the past 12 months. The employment rate is a measure of the percent of the working-age population who are employed. It is an indicator of the prevalence of unemployment, which is often used to assess labor market conditions by economists. It is a widely used metric to evaluate the sustainable development of communities (NRC, 2011, UNECE, 2009). This dataset is based on the American Community Survey 5-year data for 2008-2012. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1.Introduction
Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.
One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.
This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.
2. Citation
Please cite the following papers when using this dataset:
3. Dataset Modalities
The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.
3.1 Data Collection
The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.
The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.
Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.
It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.
The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).
File |
Period |
Number of Samples (days) |
product 1 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 1 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 1 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 2 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 2 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 2 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 3 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 3 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 3 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 4 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 4 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 4 2022.xlsx |
01/01/2022–31/12/2022 |
364 |
product 5 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 5 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 5 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 6 2020.xlsx |
01/01/2020–31/12/2020 |
362 |
product 6 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 6 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 7 2020.xlsx |
01/01/2020–31/12/2020 |
362 |
product 7 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 7 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
3.2 Dataset Overview
The following table enumerates and explains the features included across all of the included files.
Feature |
Description |
Unit |
Day |
day of the month |
- |
Month |
Month |
- |
Year |
Year |
- |
daily_unit_sales |
Daily sales - the amount of products, measured in units, that during that specific day were sold |
units |
previous_year_daily_unit_sales |
Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year |
units |
percentage_difference_daily_unit_sales |
The percentage difference between the two above values |
% |
daily_unit_sales_kg |
The amount of products, measured in kilograms, that during that specific day were sold |
kg |
previous_year_daily_unit_sales_kg |
Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year |
kg |
percentage_difference_daily_unit_sales_kg |
The percentage difference between the two above values |
kg |
daily_unit_returns_kg |
The percentage of the products that were shipped to selling points and were returned |
% |
previous_year_daily_unit_returns_kg |
The percentage of the products that were shipped to |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Startup Success Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishkc06/startup-success-prediction on 28 January 2022.
--- Dataset description provided by original source is as follows ---
A startup or start-up is a company or project begun by an entrepreneur to seek, develop, and validate a scalable economic model. While entrepreneurship refers to all new businesses, including self-employment and businesses that never intend to become registered, startups refer to new businesses that intend to grow large beyond the solo founder. Startups face high uncertainty and have high rates of failure, but a minority of them do go on to be successful and influential. Some startups become unicorns: privately held startup companies valued at over US$1 billion. [Source of information: Wikipedia]
https://images.unsplash.com/photo-1556761175-5973dc0f32e7?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60" alt="startup image">
Startups play a major role in economic growth. They bring new ideas, spur innovation, create employment thereby moving the economy. There has been an exponential growth in startups over the past few years. Predicting the success of a startup allows investors to find companies that have the potential for rapid growth, thereby allowing them to be one step ahead of the competition.
The objective is to predict whether a startup which is currently operating turns into a success or a failure. The success of a company is defined as the event that gives the company's founders a large sum of money through the process of M&A (Merger and Acquisition) or an IPO (Initial Public Offering). A company would be considered as failed if it had to be shut down.
The data contains industry trends, investment insights and individual company information. There are 48 columns/features. Some of the features are:
Predicting the success of a startup allows investors to find companies that have the potential for rapid growth, thereby allowing them to be one step ahead of the competition.
--- Original source retains full ownership of the source dataset ---
This EnviroAtlas dataset portrays the commute time of workers to their workplace for each Census Block Group (CBG) during 2008-2012. Data were compiled from the Census ACS (American Community Survey) 5-year Summary Data. The commute time is the amount of travel time in minutes for workers to get from home to work. This value includes private vehicle use, carpooling, public transit, bicycling, or walking. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was compiled with the ultimate goal of developing non-invasive computer vision algorithms for assessing shrimp biometrics and biomass estimation. The main folder, labeled "DATASET," contains five sub-folders—DB1, DB2, DB3, DB4, and DB5—each filled with images of shrimps. Additionally, each sub-folder is accompanied by an Excel file that includes manually measured data for the shrimps pictured. The files are named respectively: DB1_INDUSTRIAL_FARM_1, DB2_INDUSTRIAL_FARM_2_C1, DB3_INDUSTRIAL_FARM_2_C2, DB4_ACADEMIC_POND_S1, and DB5_ACADEMIC_POND_S2.
Here’s a detailed description of the contents of each sub-folder and its corresponding Excel file:
1) DB1 includes 490 PNG images of 22 shrimps taken from one pond at an industrial farm. The associated Excel file, DB1_INDUSTRIAL_FARM_1, contains columns for: SAMPLE: Reflecting the number of individual shrimps (22 entries or rows). LENGTH (cm): Measuring from the rostrum (near the eyes) to the start of the tail. WEIGHT (g): Recorded using a scale. COMPLETE SHRIMP IMAGES: Indicates if at least one full-body image is available (1) or not (0).
2) DB2 consists of 2002 PNG images of 58 shrimps. The Excel file, DB2_INDUSTRIAL_FARM_2_C1, includes: SAMPLE: Number of shrimps (58 entries or rows). CEPHALOTHORAX (cm): Width measured at the middle. LENGTH (cm) and WEIGHT (g): Similar measurements as DB1. COMPLETE SHRIMP IMAGES: Presence (1) or absence (0) of full-body images.
3) DB3 contains 1719 PNG images of 50 shrimps, with its Excel file, DB3_INDUSTRIAL_FARM_2_C2, documenting: SAMPLE: Number of shrimps (50 entries or rows). Measurements and categories identical to DB2.
4) DB4 encompasses 635 PNG images of 20 shrimps, detailed in the Excel file DB4_ACADEMIC_POND_S1. This includes: SAMPLE: Number of shrimps (20 entries or rows). CEPHALOTHORAX (cm), LENGTH (cm), WEIGHT (g), and COMPLETE SHRIMP IMAGES: Documented as in other datasets.
5) DB5 includes 661 PNG images of 20 shrimps, with DB5_ACADEMIC_POND_S2 as the corresponding Excel file. The file mirrors the structure and measurements of DB4.
The images for each foler are named "sm_n", where m is the number of shrimp sample and n is the number of picture of that shrimp. This carefully structured dataset provides comprehensive biometric data on shrimps, facilitating the development of algorithms aimed at non-invasive measurement techniques. This will likely be pivotal in enhancing the precision of biomass estimation in aquaculture farming, utilizing advanced statistical morphology analysis and machine learning techniques.
This table contains data on the number of licensed day care center slots (facility capacity) per 1,000 children aged 0-5 years in California, its regions, counties, cities, towns, and census tracts. The table contains 2015 data, and includes type of facility (day care center or infant center). Access to child care has become a critical support for working families. Many working families find high-quality child care unaffordable, and the increasing cost of child care can be crippling for low-income families and single parents. These barriers can impact parental choices of child care. Increased availability of child care facilities can positively impact families by providing more choices of child care in terms of price and quality. Estimates for this indicator are provided for the total population, and are not available by race/ethnicity. More information on the data table and a data dictionary can be found in the Data and Resources section. The licensed day care centers table is part of a series of indicators in the Healthy Communities Data and Indicators Project (HCI) of the Office of Health Equity. The goal of HCI is to enhance public health by providing data, a standardized set of statistical measures, and tools that a broad array of sectors can use for planning healthy communities and evaluating the impact of plans, projects, policy, and environmental changes on community health. The creation of healthy social, economic, and physical environments that promote healthy behaviors and healthy outcomes requires coordination and collaboration across multiple sectors, including transportation, housing, education, agriculture and others. Statistical metrics, or indicators, are needed to help local, regional, and state public health and partner agencies assess community environments and plan for healthy communities that optimize public health. More information on HCI can be found here: https://www.cdph.ca.gov/Programs/OHE/CDPH%20Document%20Library/Accessible%202%20CDPH_Healthy_Community_Indicators1pager5-16-12.pdf
The format of the licensed day care centers table is based on the standardized data format for all HCI indicators. As a result, this data table contains certain variables used in the HCI project (e.g., indicator ID, and indicator definition). Some of these variables may contain the same value for all observations.
Investigator(s): Bureau of Justice Statistics The National Survey of Prosecutors is a survey of chief prosecutors in state court systems. A chief prosecutor is an official, usually locally elected and typically with the title of district attorney or county attorney, who is in charge of a prosecutorial district made up of one or more counties, and who conducts or supervises the prosecution of felony cases in a state court system. Prosecutors in courts of limited jurisdiction, such as municipal prosecutors, were not included in the survey. The survey's purpose was to obtain detailed descriptive information on prosecutors' offices, as well as information on their policies and practices. Years Produced: Every 4 to 5 years.
The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its expected market size in 2018. With a share of 45 percent, the software segment would become the large big data market segment by 2027.
What is Big data?
Big data is a term that refers to the kind of data sets that are too large or too complex for traditional data processing applications. It is defined as having one or some of the following characteristics: high volume, high velocity or high variety. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets.
Big data analytics
Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate new business insights. The global big data and business analytics market was valued at 169 billion U.S. dollars in 2018 and is expected to grow to 274 billion U.S. dollars in 2022. As of November 2018, 45 percent of professionals in the market research industry reportedly used big data analytics as a research method.
We indicate how likely a piece of content is computer generated or human written. Content: any text in English or Spanish, from a single sentence to articles of 1,000s words length.
Data uniqueness: we use custom built and trained NLP algorithms to assess human effort metrics that are inherent in text content. We focus on what's in the text, not metadata such as publication or engagement. Our AI algorithms are co-created by NLP & journalism experts. Our datasets have all been human-reviewed and labeled.
Dataset: CSV containing URL and/or body text, with attributed scoring as an integer and model confidence as a percentage. We ignore metadata such as author, publication, date, word count, shares and so on, to provide a clean and maximally unbiased assessment of how much human effort has been invested in content. Our data is provided in CSV/RSS/JSON format. One row = one scored article. CSV contains URL and/or body text, with attributed scoring as an integer and model confidence as a percentage.
Integrity indicators provided as integers on a 1–5 scale. We also have custom models with 35 categories that can be added on request.
Data sourcing: public websites, crawlers, scrapers, other partnerships where available. We generally can assess content behind paywalls as well as without paywalls. We source from ~4,000 news outlets, examples include: Bloomberg, CNN, BCC are one each. Countries: all English-speaking markets world-wide. Includes English-language content from non English majority regions, such as Germany, Scandinavia, Japan. Also available in Spanish on request.
Use-cases: assessing the implicit integrity and reliability of an article. There is correlation between integrity and human value: we have shown that articles scoring highly according to our scales show increased, sustained, ongoing end-user engagement. Clients also use this to assess journalistic output, publication relevance and to create datasets of 'quality' journalism.
Overtone provides a range of qualitative metrics for journalistic, newsworthy and long-form content. We find, highlight and synthesise content that shows added human effort and, by extension, added human value.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
UnFake - https://github.com/negativenagesh/UnFake Checkout and give star
Welcome to the UnFake Deepfake Detection Dataset, a meticulously curated collection of approximately 76000 images designed to advance research in deepfake detection. This dataset is the backbone of the "UnFake" project—a pioneering platform aimed at identifying AI-generated or manipulated images, with a focus on protecting users of platforms like Unsplash from misinformation, legal risks, and reputational harm. With the rise of deepfake technology blurring the lines between reality and fabrication, this dataset provides a robust resource for training and evaluating models to distinguish real images from their AI-crafted counterparts.
The dataset comprises real images scraped from Unsplash (approximately 76,000). It primarily focuses on human faces and bodies, spanning a wide range of categories to ensure diversity and real-world applicability.
Dataset Composition Total Images: ~76000 Sources: Unsplash: ~76,000 real images scraped via the Unsplash API, representing high-quality, royalty-free photography.
Pose & Composition: Includes close-up shots and headshot/portrait-style images.
Purpose This dataset was created to address the growing challenge of deepfake proliferation on platforms hosting billions of images, such as Unsplash. With over 5 million photos and 13 billion monthly impressions (source: Unsplash), Unsplash is a vital resource for designers, marketers, educators, and more. However, the lack of transparency about image authenticity poses risks—misinformation, copyright violations, defamation, and more. This dataset powers the "UnFake" platform, integrating deepfake detection into the image-downloading process, and is now shared with the Kaggle community to foster innovation in AI-driven media authentication.
Key Features Size: ~76K images, offering a substantial volume for training and testing. Diversity: Broad representation across ethnicity, age, and facial characteristics. Real-World Relevance: Sourced from Unsplash, a widely-used platform, plus synthetic deepfakes mimicking modern AI techniques.
Potential Use Cases Deepfake Detection Research: Train and benchmark convolutional neural networks (e.g., EfficientNet-B7, as used in UnFake) to classify real vs. fake images. Media Authentication: Develop tools to verify image authenticity on stock photo platforms. AI Ethics & Security: Study the implications of deepfake technology and build countermeasures. Educational Projects: Use in academic settings to explore computer vision and AI. Dataset Structure Format: Images are stored as standard image files (e.g., JPEG/PNG). Directories: Organized into real and fake subfolders for ease of use.
How It Was Built Real Images: Scraped from Unsplash using its official API, focusing on human-centric photos.
Acknowledgments Unsplash: For providing a rich source of real-world images via their API.
License This dataset is released under the Apache License Version 2.0, allowing free use, modification, and distribution. See the LICENSE file for details.
Get Started Download the dataset, explore the diversity of real images, and join the fight against manipulated media! Check out the UnFake GitHub repo - https://github.com/negativenagesh/UnFake for the full project, including code and model details.
A one-year seismic hazard forecast for the Central and Eastern United States, based on induced and natural earthquakes, has been produced by the U.S. Geological Survey. The model assumes that earthquake rates calculated from several different time windows will remain relatively stationary and can be used to forecast earthquake hazard and damage intensity for the year 2018. This assessment is the first step in developing an operational earthquake forecast for the CEUS, and the analysis could be revised with updated seismicity and model parameters. Consensus input models consider alternative earthquake catalog durations, smoothing parameters, maximum magnitudes, and ground motion estimates, and represent uncertainties in earthquake occurrence and diversity of opinion in the science community. Near some areas of active induced earthquakes, hazard is higher than in the 2014 USGS National Seismic Hazard Model (NSHM) by more than a factor of 3; the 2014 NSHM did not consider induced earthquakes. In some areas, previously observed induced earthquakes have stopped, so the seismic hazard reverts back to the 2014 NSHM. This data set represents the results of calculations of hazard curves for a grid of points with a spacing of 0.05 degrees in latitude and longitude. This particular data set is for horizontal spectral response acceleration for 1.0-second period with a 1 percent probability of exceedance in 1 year. The data are for the Western United States and are based on the long-term 2014 National Seismic Hazard Model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Traffic counter’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/61a025bea95e6e49a64ed7a7 on 17 January 2022.
--- Dataset description provided by original source is as follows ---
Bordeaux Métropole has installed a network of sensors to assess the state of traffic in its territory. This dataset represents by a punctual the geolocation of these sensors on the pavement. It provides information on the identifier and type of meters. There are two types of meters: all-vehicle meters connected to the central station of the Bordeaux Métropole Circulation service via the “Gertrude” system and SIREDO stations that make it possible to distinguish light cars from heavy goods vehicles and to raise speeds.
For “BOUCLE” sensors, a “real time” data record is available at a time of 5 minutes. For sensors of the “SIREDO” type, the operating system does not now allow real-time data to be retrieved on the opendata.
The “5 min” data history has also been available since mid-October 2020.
These data correspond to the gross census of the sensors. They do not take into account any posteriori patches in the event of a specific failure or problem on the sensor. Publications produced by Bordeaux Métropole from reliable data sets may differ from the raw holdings of the data 5 min.
Sensors are magnetic loops located in the roadway to count road vehicles running on the sensor, so they can be disrupted by external events or work on public space.
Via the Bordeaux Métropole WebServices, it is possible to:
This dataset is available in an additional format:
Download in AutoCAD DWG format
This data set is refreshed all: 3 Minute(s). Be careful, for performance reasons, this dataset (Table, Map, Analysis and Export tabs) can be updated less frequently than the source and a deviation may exist. We also invite you to use our Webservices (see Webservices BM tab) to retrieve the freshest data.
--- Original source retains full ownership of the source dataset ---
On an island largely devoid of native vertebrate seed dispersers, we monitored forest succession for seven years following ungulate exclusion from a 5-hectare area and adjacent plots with ungulates still present. The study site was in northern Guam on Andersen Air Force Base (13°37’N, 144°51’E) and situated on a coralline limestone plateau. We established 22 plots and six 0.25-m2 subplots to measure trees and understory canopy. Data were collected in February or March, during the dry season from 2005-2011.
Caeli can provide this data through an API, dashboard, real-time geo map, or via datasets(.csv). In addition, all this data is available in daily, monthly and annual formats. The data can be delivered in various spatial resolutions starting from 0.001 degrees latitude and longitude (WSG 84), which roughly converts to 100X100 meter.
The Caeli datasets are often used for creating and validating various models and for training machine learning algorithms. We’ll allow you to specify your state or country, your preferred timeframe, resolution, and pollutant. Based on this information we’ll compile a reliable dataset. The measurements in de dataset can be used in determining the air quality of a region for a specific period of time. Additionally, your composite dataset can also serve for strategy and reporting purposes, such as ESG strategy, TCDF, SFDR, and sustainable decision making. The price of the dataset is based on the size of the area, the resolution chosen, and the number of years.
Additional information about particulate matter(PM2,5 – PM10): Particulate matter (PM) refers to tiny particles suspended in the air that can be inhaled into the respiratory system. PM is classified by size, with PM2.5 and PM10 referring to particles that are 2.5 micrometers and 10 micrometers in diameter, respectively. PM2.5 particles are particularly harmful because they are small enough to pass through the respiratory system and enter the bloodstream, where they can cause a variety of health problems. PM2.5 and PM10 are often used as indicators of air quality, with higher concentrations of these particles in the air being associated with increased risk of respiratory and cardiovascular diseases.
Are you interested in the pollutant particulate matter(PM2,5 – PM10) or would you like to gather more information about our opportunities? Please, do not hesitate to contact us. www.caeli.space
Sector coverage: Financial | Energy | Government | Agricultural | Health | Shipping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Selected Video Facsimile/Slot Machine Data from Foxwoods and Mohegan Sun Casinos’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/16616a80-923f-4991-90de-ed11bdf3d67f on 27 January 2022.
--- Dataset description provided by original source is as follows ---
Mohegan Sun Footnotes: (1) Monthly contributions are due to the State by the 15th of the following month. (2) Mohegan Sun did not include the value of eBonus credits redeemed by patrons at slot machines in its video facsimile devices Win amounts; however, the value of eBonus credits wagered was included in the reported Handle. In addition, please be advised that the Casino Hold % column amounts may be understated and the Payout % column amounts may be overstated as a result of this. (3) From July 1, 2009 to June 30, 2012, if the aggregate amount of eBonus coupons or credits actually played on the Mohegan Tribe's Video Facsimiles during a particular month exceeded 5.5% of “gross operating revenues” for that month, the Mohegan Tribe paid to the State an amount equal to twenty-five percent (25%) of such excess face amount of eBonus coupons or credits used in such calendar month (the "eBonus Contribution"). Beginning on July 1, 2012, and for all months thereafter, the aggregate amount threshold for determining the eBonus Contribution increased from 5.5% to 11% of "gross operating revenues." (4) The value of eBonus free slot play credits redeemed during February 2009 totaled $1,910,268; however, it was determined that eBonus credits redeemed were overstated by $1,460,390 for January 2008 though January 2009. February 2009 is adjusted by this amount. March 2009 was was adjusted by an additional $8,139. (5) During fiscal year 2010 the Mohegan Tribe and the State of Connecticut settled a dispute regarding the proper treatment of eBonus for the period November 2007 through June 2009. As a result of this settlement, the State of Connecticut received $5,727,731, including interest. (6) For fiscal years 2007/2008 and 2008/2009, Poker Pro Electronic Table Rake Amounts of $401,309 and $42,188, respectively, were included in the calculation to determine the amount of Slot Machine Contributions to the State of Connecticut. (7) The Mohegan Sun Casino officially opened on Saturday, October 12, 1996. On October 8-10, video facsimile/slot machines were available for actual play during pre-opening charitable gaming nights. (8) Beginning with the month of May 2001, Mohegan Sun Casino reports video facsimile/slot machine win on an accrual basis, reflecting data captured and reported by an on-line slot accounting system. Reports were previously prepared on a cash basis, based on the coin and currency removed from the machines on each gaming day. (9) Cumulative Win amount total should be reduced by $1,452,341.21 to correct for an over reporting of slot revenues for prior periods related to errors in the accrual carry forward of estimated cash on floor.
Foxwoods Footnotes: (1) Monthly contributions are due to the State by the 15th of the following month. (2) The operation of the video facsimile/slot machines began at Foxwoods on January 16, 1993. (3) Foxwoods did not include the value of Free Play coupons redeemed by patrons at slot machines in its video facsimile devices Win amounts; however, the value of Free Play coupons wagered was included in the reported Handle. In addition, please be advised that the Casino Hold % column amounts may be understated and the Payout % column amounts may be overstated as a result of this. (4) From July 1, 2009 to June 30, 2012, if the aggregate amount of Free Play coupons or credits actually played on the Mashantucket Pequot Tribe's Video Facsimiles during a particular month exceeded 5.5% of “gross operating revenues” for that month, the Mashantucket Pequot Tribe paid to the State an amount equal to twenty-five percent (25%) of such excess face amount of Free Play coupons or credits used in such calendar month (the "Free Play Contribution"). Beginning on July 1, 2012, and for all months thereafter, the aggregate amount threshold for determining the Free Play Contribution increased from 5.5% to
--- Original source retains full ownership of the source dataset ---
Executive Summary: Artificial intelligence (AI) is a transformative technology that holds promise for tremendous societal and economic benefit. AI has the potential to revolutionize how we live, work, learn, discover, and communicate. AI research can further our national priorities, including increased economic prosperity, improved educational opportunities and quality of life, and enhanced national and homeland security. Because of these potential benefits, the U.S. government has invested in AI research for many years. Yet, as with any significant technology in which the Federal government has interest, there are not only tremendous opportunities but also a number of considerations that must be taken into account in guiding the overall direction of Federally-funded R&D in AI. On May 3, 2016,the Administration announced the formation of a new NSTC Subcommittee on Machine Learning and Artificial intelligence, to help coordinate Federal activity in AI.1 This Subcommittee, on June 15, 2016, directed the Subcommittee on Networking and Information Technology Research and Development (NITRD) to create a National Artificial Intelligence Research and Development Strategic Plan. A NITRD Task Force on Artificial Intelligence was then formed to define the Federal strategic priorities for AI R&D, with particular attention on areas that industry is unlikely to address. This National Artificial Intelligence R&D Strategic Plan establishes a set of objectives for Federallyfunded AI research, both research occurring within the government as well as Federally-funded research occurring outside of government, such as in academia. The ultimate goal of this research is to produce new AI knowledge and technologies that provide a range of positive benefits to society, while minimizing the negative impacts. To achieve this goal, this AI R&D Strategic Plan identifies the following priorities for Federally-funded AI research: Strategy 1: Make long-term investments in AI research. Prioritize investments in the next generation of AI that will drive discovery and insight and enable the United States to remain a world leader in AI. Strategy 2: Develop effective methods for human-AI collaboration. Rather than replace humans, most AI systems will collaborate with humans to achieve optimal performance. Research is needed to create effective interactions between humans and AI systems. Strategy 3: Understand and address the ethical, legal, and societal implications of AI. We expect AI technologies to behave according to the formal and informal norms to which we hold our fellow humans. Research is needed to understand the ethical, legal, and social implications of AI, and to develop methods for designing AI systems that align with ethical, legal, and societal goals. Strategy 4: Ensure the safety and security of AI systems. Before AI systems are in widespread use, assurance is needed that the systems will operate safely and securely, in a controlled, well-defined, and well-understood manner. Further progress in research is needed to address this challenge of creating AI systems that are reliable, dependable, and trustworthy. Strategy 5: Develop shared public datasets and environments for AI training and testing. The depth, quality, and accuracy of training datasets and resources significantly affect AI performance. Researchers need to develop high quality datasets and environments and enable responsible access to high-quality datasets as well as to testing and training resources. Strategy 6: Measure and evaluate AI technologies through standards and benchmarks. . Essential to advancements in AI are standards, benchmarks, testbeds, and community engagement that guide and evaluate progress in AI. Additional research is needed to develop a broad spectrum of evaluative techniques. Strategy 7: Better understand the national AI R&D workforce needs. Advances in AI will require a strong community of AI researchers. An improved understanding of current and future R&D workforce demands in AI is needed to help ensure that sufficient AI experts are available to address the strategic R&D areas outlined in this plan. The AI R&D Strategic Plan closes with two recommendations: Recommendation 1: Develop an AI R&D implementation framework to identify S&T opportunities and support effective coordination of AI R&D investments, consistent with Strategies 1-6 of this plan. Recommendation 2: Study the national landscape for creating and sustaining a healthy AI R&D workforce, consistent with Strategy 7 of this plan.