20 datasets found
  1. R

    Mobile Detection Dataset

    • universe.roboflow.com
    zip
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asu (2023). Mobile Detection Dataset [Dataset]. https://universe.roboflow.com/asu-b6mtv/mobile-detection-l2iov/model/7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    asu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Mobile Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Distracted Driver Detection: This model can be incorporated into vehicular systems to alert when drivers are distracted by activities such as mobile use, drinking, eating, or smoking, thereby improving road safety.

    2. Public Health Research: Researchers can use this model to anonymously collect data in public spaces to study the prevalence of unhealthy habits like smoking and high consumption of fast food, informing public health initiatives.

    3. Employee Productivity Monitoring: Companies could use this model to monitor employee focus and productivity, identifying when workers are distracted by mobile phones, eating, or smoking during work hours.

    4. Parental Control Applications: The model can be used to monitor children's activity, alerting parents if the child uses a mobile phone excessively or is detected smoking.

    5. Consumer Behaviour Analysis: Retail businesses can use the model to understand the habits of their customers better, such as the prevalence of mobile use in-store, food and drink consumption patterns, and smoking habits.

  2. Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Ltd.
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  3. d

    Dataplex: US Healthcare NPI Data | Access 8.5M B2B Contacts with Emails &...

    • datarade.ai
    .csv, .txt
    Updated Jul 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataplex (2024). Dataplex: US Healthcare NPI Data | Access 8.5M B2B Contacts with Emails & Phones | Perfect for Outreach & Market Research [Dataset]. https://datarade.ai/data-products/dataplex-us-healthcare-npi-data-access-8-5m-b2b-contacts-w-dataplex
    Explore at:
    .csv, .txtAvailable download formats
    Dataset updated
    Jul 13, 2024
    Dataset authored and provided by
    Dataplex
    Area covered
    United States
    Description

    US Healthcare NPI Data is a comprehensive resource offering detailed information on health providers registered in the United States.

    Dataset Highlights:

    • NPI Numbers: Unique identification numbers for health providers.
    • Contact Details: Includes addresses and phone numbers.
    • State License Numbers: State-specific licensing information.
    • Additional Identifiers: Other identifiers related to the providers.
    • Business Names: Names of the provider’s business entities.
    • Taxonomies: Classification of provider types and specialties.

    Taxonomy Data:

    • Includes codes, groupings, and classifications.
    • Facilitates detailed analysis and categorization of providers.

    Data Updates:

    • Weekly Delta Changes: Ensures the dataset is current with the latest changes.
    • Monthly Full Refresh: Comprehensive update to maintain accuracy.

    Use Cases:

    • Market Analysis: Understand the distribution and types of healthcare providers across the US. Analyze market trends and identify potential gaps in healthcare services.
    • Outreach: Create targeted marketing campaigns to reach specific types of healthcare providers. Use contact details for direct outreach and engagement with providers.
    • Research: Conduct in-depth research on healthcare providers and their specialties. Analyze provider attributes to support academic or commercial research projects.
    • Compliance and Verification: Verify provider credentials and compliance with state licensing requirements. Ensure accurate provider information for regulatory and compliance purposes.

    Data Quality and Reliability:

    • The dataset is meticulously curated to ensure high quality and reliability. Regular updates, both weekly and monthly, ensure that users have access to the most current information. The comprehensive nature of the data, combined with its regular updates, makes it a valuable tool for a wide range of applications in the healthcare sector.

    Access and Integration: - CSV Format: The dataset is provided in CSV format, making it easy to integrate with various data analysis tools and platforms. - Ease of Use: The structured format of the data ensures that it can be easily imported, analyzed, and utilized for various applications without extensive preprocessing.

    Ideal for:

    • Healthcare Professionals: Physicians, nurses, and other healthcare providers who need to verify information about their peers.
    • Analysts: Data analysts and business analysts who require detailed and accurate healthcare provider data for their projects.
    • Businesses: Companies in the healthcare sector looking to understand market dynamics and reach out to providers.
    • Researchers: Academic and commercial researchers conducting studies on healthcare providers and services.

    Why Choose This Dataset?

    • Comprehensive Coverage: Detailed information on millions of healthcare providers across the US.
    • Regular Updates: Weekly and monthly updates ensure that the data remains current and reliable.
    • Ease of Integration: Provided in a user-friendly CSV format for easy integration with your existing systems.
    • Versatility: Suitable for a wide range of applications, from market analysis to compliance and research.

    By leveraging the US Healthcare NPI & Taxonomy Data, users can gain valuable insights into the healthcare landscape, enhance their outreach efforts, and conduct detailed research with confidence in the accuracy and comprehensiveness of the data.

    Summary:

    • This dataset is an invaluable resource for anyone needing detailed and up-to-date information on US healthcare providers. Whether for market analysis, research, outreach, or compliance, the US Healthcare NPI & Taxonomy Data offers the detailed, reliable information needed to achieve your goals.
  4. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    bin, json, text/markdown, csvAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  5. Enterprise Survey 2009-2019, Panel Data - Slovenia

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Aug 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank Group (WBG) (2020). Enterprise Survey 2009-2019, Panel Data - Slovenia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3762
    Explore at:
    Dataset updated
    Aug 6, 2020
    Dataset provided by
    European Bank for Reconstruction and Developmenthttp://ebrd.com/
    World Bank Grouphttp://www.worldbank.org/
    European Investment Bankhttp://eib.org/
    Time period covered
    2008 - 2019
    Area covered
    Slovenia
    Description

    Abstract

    The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.

    The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.

    As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.

    Geographic coverage

    National

    Analysis unit

    The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.

    Universe

    As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.

    Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.

    For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.

    For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).

    Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).

    For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.

    For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.

    For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.

    Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.

    Response rate

    Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.

    Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

    For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.

    For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.

    For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.

    Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.

  6. d

    Residential Existing Homes (One to Four Units) Energy Efficiency Projects...

    • catalog.data.gov
    • data.ny.gov
    • +2more
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ny.gov (2025). Residential Existing Homes (One to Four Units) Energy Efficiency Projects with Income-based Incentives by Customer Type: Beginning 2010 [Dataset]. https://catalog.data.gov/dataset/residential-existing-homes-one-to-four-units-energy-efficiency-projects-with-income-based-
    Explore at:
    Dataset updated
    Jul 26, 2025
    Dataset provided by
    data.ny.gov
    Description

    IMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. The Residential Existing Homes Program is a market transformation program that uses Building Performance Institute (BPI) Goldstar contractors to install comprehensive energy-efficient improvements. The program is designed to use building science and a whole-house approach to reduce energy use in the State’s existing one-to-four family and low-rise multifamily residential buildings and capture heating fuel and electricity-related savings. The Program provides income-based incentives, including an assisted subsidy for households with income up to 80% of the State or Median County Income, whichever is higher to install eligible energy efficiency improvements including building shell measures, high efficiency heating and cooling measures, ENERGY STAR appliances and lighting. D I S C L A I M E R: Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and First Year Energy Savings $ Estimate represent contractor reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. The analysis did not evaluate every single project, but rather a sample of projects from 2007 and 2008, so the results are applicable to the population on average but not necessarily to any individual project which could have over or under achieved in comparison to the evaluated savings. The results from the impact analysis will be updated when more recent information is available. Many factors influence the degree to which estimated savings are realized, including proper calibration of the savings model and the savings algorithms used in the modeling software. Some reasons individual households may realize savings different from those projected include, but are not limited to, changes in the number or needs of household members, changes in occupancy schedules, changes in energy usage behaviors, changes to appliances and electronics installed in the home, and beginning or ending a home business. Beginning November 2017, the Program requires the use of HPXML-compliant modeling software tools and data quality protocols have been implemented to more accurately project savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf. The New York Residential Existing Homes (One to Four Units) dataset includes the following data points for projects completed during Green Jobs Green-NY, beginning November 15, 2010: Home Performance Project ID, Home Performance Site ID, Project County, Project City, Project Zip, Gas Utility, Electric Utility, Project Completion Date, Customer Type, Low-Rise or Home Performance Indicator, Total Project Cost (USD), Total Incentives (USD), Type of Program Financing, Amount Financed Through Program (USD), Pre-Retrofit Home Heating Fuel Type, Year Home Built, Size of Home, Volume of Home, Number of Units, Measure Type, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, First Year Energy Savings $ Estimate (USD), Homeowner Received Green Jobs-Green NY Free/Reduced Cost Audit (Y/N). How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.

  7. Z

    Enterprise-Driven Open Source Software

    • data.niaid.nih.gov
    • opendatalab.com
    Updated Apr 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kotti, Zoe (2020). Enterprise-Driven Open Source Software [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3653877
    Explore at:
    Dataset updated
    Apr 22, 2020
    Dataset provided by
    Kotti, Zoe
    Theodorou, Georgios
    Spinellis, Diomidis
    Kravvaritis, Konstantinos
    Louridas, Panos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

    The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.

    url: the project's GitHub URL

    project_id: the project's GHTorrent identifier

    sdtc: true if selected using the same domain top committers heuristic (9,016 records)

    mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)

    mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),

    star_number: number of GitHub watchers

    commit_count: number of commits

    files: number of files in current main branch

    lines: corresponding number of lines in text files

    pull_requests: number of pull requests

    github_repo_creation: timestamp of the GitHub repository creation

    earliest_commit: timestamp of the earliest commit

    most_recent_commit: date of the most recent commit

    committer_count: number of different committers

    author_count: number of different authors

    dominant_domain: the projects dominant email domain

    dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain

    dominant_domain_author_commits: corresponding number for commit authors

    dominant_domain_committers: number of committers whose email matches the project's dominant domain

    dominant_domain_authors: corresponding number for commit authors

    cik: SEC's EDGAR "central index key"

    fg500: true if this is a Fortune Global 500 company (2,233 records)

    sec10k: true if the company files SEC 10-K forms (4,180 records)

    sec20f: true if the company files SEC 20-F forms (429 records)

    project_name: GitHub project name

    owner_login: GitHub project's owner login

    company_name: company name as derived from the SEC and Fortune 500 data

    owner_company: GitHub project's owner company name

    license: SPDX license identifier

    The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

    url: the project's GitHub URL

    project_id: the project's GHTorrent identifier

    stars: number of GitHub watchers

    commit_count: number of commits

  8. Airline Dataset

    • kaggle.com
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Airline Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.

    Content

    This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.

    Dataset Glossary (Column-wise)

    • Passenger ID - Unique identifier for each passenger
    • First Name - First name of the passenger
    • Last Name - Last name of the passenger
    • Gender - Gender of the passenger
    • Age - Age of the passenger
    • Nationality - Nationality of the passenger
    • Airport Name - Name of the airport where the passenger boarded
    • Airport Country Code - Country code of the airport's location
    • Country Name - Name of the country the airport is located in
    • Airport Continent - Continent where the airport is situated
    • Continents - Continents involved in the flight route
    • Departure Date - Date when the flight departed
    • Arrival Airport - Destination airport of the flight
    • Pilot Name - Name of the pilot operating the flight
    • Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

    Structure of the Dataset

    https://i.imgur.com/cUFuMeU.png" alt="">

    Acknowledgement

    The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.

    Cover Photo by: Kevin Woblick on Unsplash

    Thumbnail by: Airplane icons created by Freepik - Flaticon

  9. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    United Kingdom, Norway, India, Sint Maarten (Dutch part), Cook Islands, Oman, Western Sahara, Dominican Republic, Barbados, Jordan
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  10. e

    Cohort Study of Recently Formed Northern Businesses, 1993-1995 - Dataset -...

    • b2find.eudat.eu
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Cohort Study of Recently Formed Northern Businesses, 1993-1995 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/09bd53d2-d73b-5622-b83c-d47fa1cf1150
    Explore at:
    Dataset updated
    Oct 21, 2023
    Description

    Abstract copyright UK Data Service and data collection copyright owner. The aims of the study are: (i)to identify the key characteristics of (a) a sample of newly formed Northern businesses; and (b) the founders of those businesses; (ii)to track the perceptions founders have, at different points during the first two years of life, of likely business prospects over a range of time horizons; (iii)to track the actual development of the businesses in their first two years in such a way that the perceptions identified in (ii) can be compared with what actually happened; (iv)to analyse the relationship between the key characteristics of the sample businesses and their founders identified in (i) and: (a) the perceptions identified in (ii); (b) the actual development identified in (iii); and (c) the gap between perceptions and outcomes; (v)to analyses the determinants of changes in relationship between perceptions and actual outcomes. Main Topics: The key characteristics of the respondents and their businesses were analysed. Information was provided on such variables as sectoral breakdown, employment growth, markets, competition, difficulties, challenges, financing and assistance from external agencies. Questions relating to respondents focused on educational, social and economic background including work experience. The project also looked at how small businesses change and develop over time. At each interview stage respondents' perceptions of future business prospects over different time horizons were elicited. These periods coincided with planned subsequent interviews which would facilitate comparison of forecasts with actual outcomes. Although the study mainly addressed the issues of survival, employment, turnover and product mix, respondents were also asked about likely future changes in a number of other aspects of business activity and organisation. Various investigative techniques such as regression, logit analysis and some parametric and non-parametric tests were used to analyse the dataset. The dataset also provided the means to examine whether the survey firms improved their forecasting ability between the first and second two six month periods. Some insight was also provided into the nature of VAT data on registration and deregistrations. In particular, the opportunity was provided to examine, firstly, the extent to which registrants are involved in setting up entirely new businesses and secondly, the relationship between the date of registration and the start of trading. The depositor states The results suggested that some caution should be exercised in the use of VAT registration statistics in the analysis of firm births. Volunteer sample See documentation for further information. Face-to-face interview the first and third interviews were conducted face-to-face, and the second by telephone. 1993 1995 ADMINISTRATION ADMINISTRATIVE STRU... ADVANCED LEVEL EXAM... ADVERTISING ADVICE AGE APPRENTICESHIP ATTITUDES BANK ACCOUNTS BUSINESS FORMATION BUSINESS MANAGEMENT BUSINESSES Business industrial... CAPITAL COMMERCIAL BUILDINGS COMMERCIAL INNOVATION COMMUNITIES CONSUMPTION TAX COSTS CUSTOMERS DEBTS DEGREES DEMAND DEVELOPMENT DEVELOPMENT PROGRAMMES ECONOMIC ACTIVITY ECONOMIC COMPETITION ECONOMIC CONDITIONS EDUCATIONAL BACKGROUND EMPLOYEES EMPLOYMENT EMPLOYMENT HISTORY EXPECTATION EXPENDITURE EXPORTS AND IMPORTS England FATHER S OCCUPATION FINANCE FINANCIAL ADVICE FINANCIAL MANAGEMENT FINANCIAL RESOURCES FINANCIAL SUPPORT FORECASTING FULL TIME EMPLOYMENT GENDER GENERAL CERTIFICATE... GOVERNMENT DEPARTMENTS GRANTS HOURS OF WORK INCOME INDUSTRIES INFORMATION INFORMATION CENTRES INFORMATION SOURCES INTERPERSONAL RELAT... LABOUR MARKET LAND AND PROPERTY F... LEGAL ADVICE LOANS LOCAL GOVERNMENT LOCATION OF INDUSTRY MANAGERS MANUAL WORKERS MARKET STRUCTURE MARKETING MARKETS ECONOMICS MORTGAGES MOTIVATION OCCUPATIONS ORDINARY LEVEL EXAM... OWNERSHIP AND TENURE PART TIME EMPLOYMENT PERFORMANCE PERSONNEL POSTGRADUATE COURSES PRICES PRIVATE SECTOR PRODUCTS PROFESSIONAL PERSONNEL PROGRESS PROPERTY PUBLIC SECTOR QUALIFICATIONS QUALITY RECRUITMENT RENTED ACCOMMODATION RENTS ROLES SCIENTIFIC PERSONNEL SELF EMPLOYED SEMI SKILLED WORKERS SIZE SKILLED WORKERS SMALL BUSINESSES SUPPLY TRADE TRAINING COURSES TRAINING ORGANIZATIONS TURNOVER Trade UNEMPLOYMENT VALUES WHITE COLLAR WORKERS industry and markets

  11. e

    Transition to Clean Energy Enterprise Survey- Jordan, TCEESJ_2023 - Jordan

    • erfdataportal.com
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Economics Research Forum (2024). Transition to Clean Energy Enterprise Survey- Jordan, TCEESJ_2023 - Jordan [Dataset]. https://www.erfdataportal.com/index.php/catalog/287
    Explore at:
    Dataset updated
    Mar 31, 2024
    Dataset authored and provided by
    Economics Research Forum
    Time period covered
    2023
    Area covered
    Jordan
    Description

    Abstract

    The MENA region faces heightened climate challenges and growing energy issues, especially for energy-importing countries. The transition to clean energy in MENA is crucial, and the region holds inherent comparative advantages due to its natural resources, such as high solar radiation and strong wind nodes. This dataset, collected in one round, includes various company-specific details-sector categorization, employee count, regulatory compliance, experiences with grid-based electricity, and the extent of clean energy transition among enterprises in Jordan. The data was collected through a comprehensive cross-sectional survey conducted from September to November 2023, exploring how Micro, Small, and Medium Enterprises (MSMEs) in Jordan are transitioning to clean energy. The survey is part of the activities under the newly launched ERF project, 'The Role of MSMEs in Fostering Inclusive and Equitable Economic Growth in the Context of the Clean Energy Transition in MENA,' funded by IDRC. The project involves a series of quantitative national surveys in five targeted countries: Egypt, Jordan, Morocco, Lebanon, and Tunisia. The initiative aims to gather crucial data reflecting the ongoing energy transition in these countries. The survey data's objective is to enhance knowledge and contribute to strategic policy initiatives, with the goal of promoting sustainable, efficient, and equitable energy management. This includes addressing emission mitigation, ensuring energy security, and promoting equity. All these Surveys on Transitions to Clean Energy in MENA Enterprises follow relatively comparable designs, collecting data on enterprises within the Arab countries (Egypt, Jordan, Morocco, Tunisia, and Lebanon). This harmonization is intended to create comparable data, facilitating cross-country and comparative research among the five Arab countries

    Geographic coverage

    National

    Analysis unit

    Enterprises

    Universe

    The target population is the non-governmental micro, small, and medium enterprises that commenced business operations before 2023.

    Kind of data

    Sample Survey Data [ssd]

    Sampling procedure

    The target population of the surveys was businesses with less than 100 employees that started business operations before 2023. For the sampling frame, we used data from Kinz which is a website for a data mining Jordanian corporate. The website grant access to subscribers. We had access to the complete list of about 82,000 businesses from 15 broad business sectors. Lists of businesses are presented in about 8,206 pages (10 businesses per page).We directly selected the sample from the Kinz webpages,where a random sample of pages within each business sector was selected for the survey, and all businesses in the selected pages were contacted. A stratified sample of 5,884 businesses were selected. The sample was stratified according to 15 business sectors. The sample was selected in two stages; in the first stage, 600 pages were selected from the Kinz website, where the selected pages were proportionally distributed to the distribution of all pages by business secto. In the second stage, all businesses in selected pages were contacted.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    Note: The questionnaire can be seen in the documentation materials tab.

    Response rate

    Response rate is 19.9%, after excluding those phones that were not in service and firms that were not eligible from the response rate.

  12. R

    Data from: Symbol Detection Dataset

    • universe.roboflow.com
    zip
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    symbols (2022). Symbol Detection Dataset [Dataset]. https://universe.roboflow.com/symbols/symbol-detection-xdwl1/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2022
    Dataset authored and provided by
    symbols
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Characters Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Educational Applications: The "symbol detection" model could be used to create interactive educational software or games. For instance, it can help children in identifying different letters or symbols. Children could take photos of different symbols they encounter, and the app could help them read or learn about the symbol.

    2. Assistive Technology: This model could be integrated into assistive technologies for visually impaired individuals. It could be designed to read aloud the detected symbols so those people get assistance in understanding written textual information around them.

    3. Document Analysis: It can be applied to perform document analysis tasks. For example, in context of business document management, the model could help categorize documents based on the identified characters or symbols, eventually making search and retrieval efficient.

    4. Auto Translation: The model can be employed in a real-time translation application to identify characters or symbols before translating them into the desired language. For instance, a use case would be translating signs or words in images into another language, thus aiding travelers in foreign countries to understand written signs.

    5. Coding Assistance: The model may find use in software development or coding education platforms. It can be used to recognize and verify the syntax and symbols used in different programming languages, thereby offering automated suggestions or corrections.

  13. Financial Risk Dataset

    • kaggle.com
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berker ERYILMAZ (2025). Financial Risk Dataset [Dataset]. https://www.kaggle.com/datasets/berkereryilmaz/financial-risk-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Berker ERYILMAZ
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 1,000 financial records with five key features and one target variable, Loan Default Risk. It is designed for credit risk analysis, helping to predict whether a customer is likely to default on a loan based on financial attributes.

    Income: The individual's annual income. Credit Score: A credit rating score ranging from 300 to 850, where higher values indicate better creditworthiness. Spending Score: A normalized score between 0 and 100, representing the individual's spending habits. Transaction Count: The number of transactions made by the individual in a given period. Savings Ratio: The ratio of savings to income, ranging from 0 to 1. Loan Default Risk (Target): 0: Low risk (likely to repay the loan). 1: High risk (likely to default on the loan).

    Feel free to use this dataset for research, projects, or educational purposes. If you use it in a publication, kindly provide attribution.

    This dataset was synthetically generated. The features were adjusted to resemble real-world financial data, but they do not represent actual individuals or real financial records.

  14. R

    Ui Detector Dataset

    • universe.roboflow.com
    zip
    Updated May 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tunis Business School (2023). Ui Detector Dataset [Dataset]. https://universe.roboflow.com/tunis-business-school-t3rrg/ui-detector-eulxp/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2023
    Dataset authored and provided by
    Tunis Business School
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    UI Components Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Automated User Interface Testing: The UI-Detector model can be used to identify all components of a software/website's user interface during automated testing. For example, the model can identify the existence, placement, and functionality of key UI components like buttons, text fields, and checkboxes.

    2. Accessibility Assessment: The model can assist in evaluating and improving the accessibility of a web or software UI by identifying all of its components and assessing how easy they are to use, especially for users with special needs.

    3. Interactive Documentation Creation: The UI-Detector model can automatically generate interactive documentation for software by taking screenshots of the application and identifying each UI component, creating a more engaging and intuitive learning experience for new users.

    4. User Interface Design Analysis: Designers can use the model to automatically analyze a UI design, identifying problematic components such as misplaced buttons or lacking necessary forms and providing suggestions for improvements.

    5. Competitors Analysis: The model can be utilized to evaluate competitors' user interfaces, identifying distinct features that can be used as a reference for improving own UI designs.

  15. Illegal online trade of invasive plants in Australia

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Maher; Oliver Stringham; Stephanie Moncayo; Lisa Wood; Charlotte Lassaline; John Virtue; Phill Cassey (2023). Illegal online trade of invasive plants in Australia [Dataset]. http://doi.org/10.6084/m9.figshare.22493944.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jacob Maher; Oliver Stringham; Stephanie Moncayo; Lisa Wood; Charlotte Lassaline; John Virtue; Phill Cassey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    The project contains data and code used in our project investigating the illegal online trade of invasive plants in Australia. The dataset titled 'plant_trade_dataset.csv' contains cleaned and consolidated detection records from online advertisements for plants. The dataset titled 'weed_trade_dataset.csv' is a subset of the 'plant_trade_dataset.csv' dataset containing records of plants advertised which are prohibited to trade in at least one Australian jurisdiction. The advertisements were collected from an Australian online classifieds website which we have purposefully kept confidential in accordance with the ethics approval of this project. As such all information that could identify a user of the website has also been removed. However, this dataset still provides all necessary information needed to replicate the analysis of this study. We have provided R scripts which can be used to replicate the results of the study. ‘string_matching.R’ provides an example of how string matching was used to detect desired advertisements. This code uses ‘faux_web_scraped_data.csv’, ‘faux_plant_terms.csv’ and ‘incoreect_term_matches.csv’ to demonstrate how it functions. ‘quantity_price_permutation_code.R’ is the code used for analysing the effect of trade prohibition on quantity and price, along with the formatted datasets ‘qty_law_comp.csv’ for quantity and ‘price_law_comp’ for price. The remaining analysis (i.e., species accumulation, trade quantity, and use of traded taxa) was performed with code from the ‘analysis_code.R’ script.

  16. Hydro Flow Metrics Historical

    • usfs.hub.arcgis.com
    Updated Jan 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2020). Hydro Flow Metrics Historical [Dataset]. https://usfs.hub.arcgis.com/maps/048d23eff1ee409c995f85698a6ae65a
    Explore at:
    Dataset updated
    Jan 6, 2020
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    U.S. Forest Service
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    A map service depicting modeled streamflow metrics from the historical time period (1977-2006) in the United States. In addition to standard NHD attributes, the streamflow datasets include metrics on mean daily flow (annual and seasonal), flood levels associated with 1.5-year, 10-year, and 25-year floods; annual and decadal minimum weekly flows and date of minimum weekly flow, center of flow mass date; baseflow index, and average number of winter floods. These files and additional information are available on the project website, https://www.fs.usda.gov/rm/boise/AWAE/projects/modeled_stream_flow_metrics.shtml. Streams without flow metrics (null values) were removed from this dataset to improve display speed; to see all stream lines, use an NHD flowline dataset.The flow regime is of fundamental importance in determining the physical and ecological characteristics of a river or stream, but actual flow measurements are only available for a small minority of stream segments, mostly on large rivers. Flows for all other streams must be extrapolated or modeled. Modeling is also necessary to estimate flow regimes under future climate conditions. Climate data such as this dataset is valuable for planning and monitoring purposes. Business use cases include: climate change and water rights assessments; analysis of water availability, runoff, groundwater, and impacts to aquatic organisms; resource management; post fire recovery; restoration activities, etc.Hydro flow metrics data can be downloaded from here.This feature layer contains a series of fields from the NHD, including the COMID , which provides a unique identifier for each NHD stream segment, as well as other basic hydrological information. It also contains the Region field, which indicates the NHD region (2-digit hydrologic unit codes) or a subdivision of regions based on NHDPlus production units (https://www.horizon-systems.com/NHDPlus/). Production units are designated by letters appended to the region code, such as 10U (the upper Missouri River basin). Additional documentation about this dataset is located in the data user guide. A StoryMap including a map viewer and map exporter by forest/region is also available. Additional climate and streamflow products from the Office of Sustainability and Climate are available in our Climate Gallery.This dataset contains the following data layers:Mean annual flow: calculated as the mean of the yearly discharge valuesMean spring flow: calculated as the mean of the March/April/May discharge values, weighted by the number of days per monthMean summer flow: calculated as the mean of the June/July/August discharge values, weighted by the number of days per monthMean autumn flow: calculated as the mean of the September/October/November discharge values, weighted by the number of days per monthMean winter flow: calculated as the mean of the December/January/February discharge values, weighted by the number of days per month1.5-year flood: calculated by first finding the greatest daily flow from each year; the 33rd percentile of the annual maximum series defines the flow that occurs every 1.5 years, on average10-year flood: the flow that occurs every 10 years, on average, calculated as the 90th percentile of the annual maximum series25-year flood: the flow that occurs every 25 years, on average, calculated as the 96th percentile of the annual maximum series1-year minimum weekly flow: the average across years of the lowest 7-day flow during each year. Year is defined either as January/December or June/May, whichever has a lower standard deviation in the date of the low-flow week. This was done so that, for example, in areas with winter droughts, a December to January drought would not be split up by the start of a new year.10-year minimum weekly flow: average lowest 7-day flow during a decade (calculated as the 10th percentile of the annual minimum weekly flows)Date of minimum weekly flow: average date of the center of the lowest 7-day flow of the year, with 'year' defined either as January/December or June/May, whichever has a lower standard deviation in the date of the low-flow week. This was done to prevent erroneous results when the drought season crosses the break between years: e.g., if the lowest flow was on December 31 of the first year (day #365) and January 1 of the second year (day #1), this would give an average of day #183, July 2nd; switching the range of months in this case prevents this error.Baseflow index: the ratio of the average daily flow during the lowest 7-day flow of the year to the average daily flow during the year overall; this can be used as a rough estimate of the proportion of streamflow originating from groundwater discharge, rather than from recent precipitationCenter of flow mass/center of timing: calculated using a weighted mean: CFM=(flow1*1+flow2*2+ flow365*365)/(flow1+flow2+ flow365) where flowi is the flow volume on day i of the water year. This can be used to indicate areas where most of the precipitation occurs early in the water year (fall), or later (spring/summer).Number of winter floods: calculated as the average number of daily flows between December 1 and March 31 that exceed the 95th percentile of daily flows across the entire year

  17. e

    Law, Finance and Development Indices, 1970-2005 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Law, Finance and Development Indices, 1970-2005 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/626d4fa5-400d-5bb3-bb46-1bfcd1e71aa0
    Explore at:
    Dataset updated
    Oct 23, 2023
    Description

    Abstract copyright UK Data Service and data collection copyright owner. This study examined the links between legal systems and economic development, focusing on the relationship between law and finance. New datasets were created, charting legal change over time in the areas of shareholder protection, creditor protection and labour regulation. Indices with up to 60 indicators were used to code for the law of five significant countries (France, Germany, India, the United Kingdom and the United States of America) for 36 years (1970-2005), and reduced-form indices of 10-12 indicators to code for a wider sample (25 countries) for the period 1995-2005. The coding methods used marked an advance on previous studies, by incorporating a wider range of legal and regulatory variables and taking into account the different ways in which regulatory rules can be expressed (as mandatory rules or as ‘defaults’ applying in the absence of contrary agreement). Time-series and panel data econometric analysis were used to test for correlations between the scores in the indices and economic performance variables. Further information can be found on the Centre for Business Research project web page and the ESRC Award web page.

  18. d

    Google Map Data, Google Map Data Scraper, Business location Data- Scrape All...

    • datarade.ai
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    APISCRAPY (2022). Google Map Data, Google Map Data Scraper, Business location Data- Scrape All Publicly Available Data From Google Map & Other Platforms [Dataset]. https://datarade.ai/data-products/google-map-data-google-map-data-scraper-business-location-d-apiscrapy
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    May 23, 2022
    Dataset authored and provided by
    APISCRAPY
    Area covered
    Switzerland, Svalbard and Jan Mayen, Albania, Japan, Macedonia (the former Yugoslav Republic of), Bulgaria, United States of America, Serbia, Denmark, Gibraltar
    Description

    APISCRAPY, your premier provider of Map Data solutions. Map Data encompasses various information related to geographic locations, including Google Map Data, Location Data, Address Data, and Business Location Data. Our advanced Google Map Data Scraper sets us apart by extracting comprehensive and accurate data from Google Maps and other platforms.

    What sets APISCRAPY's Map Data apart are its key benefits:

    1. Accuracy: Our scraping technology ensures the highest level of accuracy, providing reliable data for informed decision-making. We employ advanced algorithms to filter out irrelevant or outdated information, ensuring that you receive only the most relevant and up-to-date data.

    2. Accessibility: With our data readily available through APIs, integration into existing systems is seamless, saving time and resources. Our APIs are easy to use and well-documented, allowing for quick implementation into your workflows. Whether you're a developer building a custom application or a business analyst conducting market research, our APIs provide the flexibility and accessibility you need.

    3. Customization: We understand that every business has unique needs and requirements. That's why we offer tailored solutions to meet specific business needs. Whether you need data for a one-time project or ongoing monitoring, we can customize our services to suit your needs. Our team of experts is always available to provide support and guidance, ensuring that you get the most out of our Map Data solutions.

    Our Map Data solutions cater to various use cases:

    1. B2B Marketing: Gain insights into customer demographics and behavior for targeted advertising and personalized messaging. Identify potential customers based on their geographic location, interests, and purchasing behavior.

    2. Logistics Optimization: Utilize Location Data to optimize delivery routes and improve operational efficiency. Identify the most efficient routes based on factors such as traffic patterns, weather conditions, and delivery deadlines.

    3. Real Estate Development: Identify prime locations for new ventures using Business Location Data for market analysis. Analyze factors such as population density, income levels, and competition to identify opportunities for growth and expansion.

    4. Geospatial Analysis: Leverage Map Data for spatial analysis, urban planning, and environmental monitoring. Identify trends and patterns in geographic data to inform decision-making in areas such as land use planning, resource management, and disaster response.

    5. Retail Expansion: Determine optimal locations for new stores or franchises using Location Data and Address Data. Analyze factors such as foot traffic, proximity to competitors, and demographic characteristics to identify locations with the highest potential for success.

    6. Competitive Analysis: Analyze competitors' business locations and market presence for strategic planning. Identify areas of opportunity and potential threats to your business by analyzing competitors' geographic footprint, market share, and customer demographics.

    Experience the power of APISCRAPY's Map Data solutions today and unlock new opportunities for your business. With our accurate and accessible data, you can make informed decisions, drive growth, and stay ahead of the competition.

    [ Related tags: Map Data, Google Map Data, Google Map Data Scraper, B2B Marketing, Location Data, Map Data, Google Data, Location Data, Address Data, Business location data, map scraping data, Google map data extraction, Transport and Logistic Data, Mobile Location Data, Mobility Data, and IP Address Data, business listings APIs, map data, map datasets, map APIs, poi dataset, GPS, Location Intelligence, Retail Site Selection, Sentiment Analysis, Marketing Data Enrichment, Point of Interest (POI) Mapping]

  19. Football Players Data

    • kaggle.com
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Masood Ahmed (2023). Football Players Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/6960429
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Masood Ahmed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description:

    This comprehensive dataset offers detailed information on approximately 17,000 FIFA football players, meticulously scraped from SoFIFA.com.

    It encompasses a wide array of player-specific data points, including but not limited to player names, nationalities, clubs, player ratings, potential, positions, ages, and various skill attributes. This dataset is ideal for football enthusiasts, data analysts, and researchers seeking to conduct in-depth analysis, statistical studies, or machine learning projects related to football players' performance, characteristics, and career progressions.

    Features:

    • name: Name of the player.
    • full_name: Full name of the player.
    • birth_date: Date of birth of the player.
    • age: Age of the player.
    • height_cm: Player's height in centimeters.
    • weight_kgs: Player's weight in kilograms.
    • positions: Positions the player can play.
    • nationality: Player's nationality.
    • overall_rating: Overall rating of the player in FIFA.
    • potential: Potential rating of the player in FIFA.
    • value_euro: Market value of the player in euros.
    • wage_euro: Weekly wage of the player in euros.
    • preferred_foot: Player's preferred foot.
    • international_reputation(1-5): International reputation rating from 1 to 5.
    • weak_foot(1-5): Rating of the player's weaker foot from 1 to 5.
    • skill_moves(1-5): Skill moves rating from 1 to 5.
    • body_type: Player's body type.
    • release_clause_euro: Release clause of the player in euros.
    • national_team: National team of the player.
    • national_rating: Rating in the national team.
    • national_team_position: Position in the national team.
    • national_jersey_number: Jersey number in the national team.
    • crossing: Rating for crossing ability.
    • finishing: Rating for finishing ability.
    • heading_accuracy: Rating for heading accuracy.
    • short_passing: Rating for short passing ability.
    • volleys: Rating for volleys.
    • dribbling: Rating for dribbling.
    • curve: Rating for curve shots.
    • freekick_accuracy: Rating for free kick accuracy.
    • long_passing: Rating for long passing.
    • ball_control: Rating for ball control.
    • acceleration: Rating for acceleration.
    • sprint_speed: Rating for sprint speed.
    • agility: Rating for agility.
    • reactions: Rating for reactions.
    • balance: Rating for balance.
    • shot_power: Rating for shot power.
    • jumping: Rating for jumping.
    • stamina: Rating for stamina.
    • strength: Rating for strength.
    • long_shots: Rating for long shots.
    • aggression: Rating for aggression.
    • interceptions: Rating for interceptions.
    • positioning: Rating for positioning.
    • vision: Rating for vision.
    • penalties: Rating for penalties.
    • composure: Rating for composure.
    • marking: Rating for marking.
    • standing_tackle: Rating for standing tackle.
    • sliding_tackle: Rating for sliding tackle.

    Use Case:

    This dataset is ideal for data analysis, predictive modeling, and machine learning projects. It can be used for:

    • Player performance analysis and comparison.
    • Market value assessment and wage prediction.
    • Team composition and strategy planning.
    • Machine learning models to predict future player potential and career trajectories.

    Note:

    Please ensure to adhere to the terms of service of SoFIFA.com and relevant data protection laws when using this dataset. The dataset is intended for educational and research purposes only and should not be used for commercial gains without proper authorization.

  20. Dynamic Pricing Dataset

    • kaggle.com
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2024). Dynamic Pricing Dataset [Dataset]. https://www.kaggle.com/datasets/arashnic/dynamic-pricing-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A ride-sharing company wants to implement a dynamic pricing strategy to optimize fares based on real-time market conditions. The company only uses ride duration to decide ride fares currently. The company aims to leverage data-driven techniques to analyze historical data and develop a predictive model that can dynamically adjust prices in response to changing factors.

    The dataset containing historical ride data has been provided. It includes features such as the number of riders, number of drivers, location category, customer loyalty status, number of past rides, average ratings, time of booking, vehicle type, expected ride duration, and historical cost of the rides.

    Your goal is to build a dynamic pricing model that incorporates the provided features to predict optimal fares for rides in real-time. The model must consider factors such as demand patterns and supply availability.

    https://i0.wp.com/vitalflux.com/wp-content/uploads/2023/07/dynamic-pricing-machine-learning-strategies-examples.png?resize=1536%2C698&ssl=1" alt="ridimage">

    Features:

    'Number_of_Riders', 'Number_of_Drivers', 'Location_Category', 'Customer_Loyalty_Status', 'Number_of_Past_Rides', 'Average_Ratings', 'Time_of_Booking', 'Vehicle_Type', 'Expected_Ride_Duration', 'Historical_Cost_of_Ride'

    Some References:

    - Dynamic Pricing Explained: Machine Learning in Revenue Management and Pricing Optimization

    - Dynamic Pricing using Reinforcement Learning

    - Dynamic Pricing on E-commerce Platform with Deep Reinforcement Learning: A Field Experiment

    - Engineering Extreme Event Forecasting at Uber with Recurrent Neural Networks

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
asu (2023). Mobile Detection Dataset [Dataset]. https://universe.roboflow.com/asu-b6mtv/mobile-detection-l2iov/model/7

Mobile Detection Dataset

mobile-detection-l2iov

mobile-detection-dataset

Explore at:
zipAvailable download formats
Dataset updated
Jun 29, 2023
Dataset authored and provided by
asu
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured
Mobile Bounding Boxes
Description

Here are a few use cases for this project:

  1. Distracted Driver Detection: This model can be incorporated into vehicular systems to alert when drivers are distracted by activities such as mobile use, drinking, eating, or smoking, thereby improving road safety.

  2. Public Health Research: Researchers can use this model to anonymously collect data in public spaces to study the prevalence of unhealthy habits like smoking and high consumption of fast food, informing public health initiatives.

  3. Employee Productivity Monitoring: Companies could use this model to monitor employee focus and productivity, identifying when workers are distracted by mobile phones, eating, or smoking during work hours.

  4. Parental Control Applications: The model can be used to monitor children's activity, alerting parents if the child uses a mobile phone excessively or is detected smoking.

  5. Consumer Behaviour Analysis: Retail businesses can use the model to understand the habits of their customers better, such as the prevalence of mobile use in-store, food and drink consumption patterns, and smoking habits.

Search
Clear search
Close search
Google apps
Main menu