100+ datasets found
  1. Riga Data Science Club

    • kaggle.com
    zip
    Updated Mar 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitry Yemelyanov (2021). Riga Data Science Club [Dataset]. https://www.kaggle.com/datasets/dmitryyemelyanov/rigadsclub
    Explore at:
    zip(494849 bytes)Available download formats
    Dataset updated
    Mar 29, 2021
    Authors
    Dmitry Yemelyanov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Riga
    Description

    Context

    Riga Data Science Club is a non-profit organisation to share ideas, experience and build machine learning projects together. Data Science community should known own data, so this is a dataset about ourselves: our website analytics, social media activity, slack statistics and even meetup transcriptions!

    Content

    Dataset is split up in several folders by the context: * linkedin - company page visitor, follower and post stats * slack - messaging and member activity * typeform - new member responses * website - website visitors by country, language, device, operating system, screen resolution * youtube - meetup transcriptions

    Inspiration

    Let's make Riga Data Science Club better! We expect this data to bring lots of insights on how to improve.

    "Know your c̶u̶s̶t̶o̶m̶e̶r̶ member" - Explore member interests by analysing sign-up survey (typeform) responses - Explore messaging patterns in Slack to understand how members are retained and when they are lost

    Social media intelligence * Define LinkedIn posting strategy based on historical engagement data * Define target user profile based on LinkedIn page attendance data

    Website * Define website localisation strategy based on data about visitor countries and languages * Define website responsive design strategy based on data about visitor devices, operating systems and screen resolutions

    Have some fun * NLP analysis of meetup transcriptions: word frequencies, question answering, something else?

  2. m

    COVID-19 Combined Data-set with Improved Measurement Errors

    • data.mendeley.com
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
    Explore at:
    Dataset updated
    May 13, 2020
    Authors
    Afshin Ashofteh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.

  3. Data Science Jobs & Salaries 2024

    • kaggle.com
    zip
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fahad Rehman (2024). Data Science Jobs & Salaries 2024 [Dataset]. https://www.kaggle.com/datasets/fahadrehman07/data-science-jobs-and-salary-glassdoor
    Explore at:
    zip(2332449 bytes)Available download formats
    Dataset updated
    Apr 27, 2024
    Authors
    Fahad Rehman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Data in the dataset is extracted from the Glassdoor website, which is a job posting website. The dataset has data related to data science jobs and salaries and a lot more, offering a clear view of job opportunities. It is packed with essential details like job titles, estimated salaries, job descriptions, company ratings, and key company info such as location, size, and industry. Whether you're job hunting or researching, this dataset helps you understand the job market easily. Start exploring now to make smart career choices!".

    Perfect for adding to your Kaggle notebooks, our dataset is a treasure trove for analyzing all kinds of job-related info. Whether you're curious about salary trends or want to find the best-rated companies, this dataset has you covered. It's great for beginners and experts alike, offering lots of chances to learn and discover. You can use it to predict things or find hidden patterns—there's so much you can do! So, get ready to explore the world of jobs with our easy-to-use dataset on Kaggle.

    Please Upvote the dataset✨, If you like it because it takes time to make a high-quality Dataset.🍒

    Columns in Dataset:

    1. **Job Title:** Title of the Job
    2. **Salary Estimate:** Estimated salary for the job that the company provides
    3. **Job Description:** The description of the job
    4. **Rating:** Rating of the company
    5. **Company Name:** Name of the Company
    6. **Location:** Location of the job
    7. **Headquarters:** Headquarters of the company
    8. **Size:** Number of employees in the company
    9. **Founded:** The year company founded
    10. **Type of ownership:** Ownership types like private, public, government, and non-profit organizations
    11. **Industry:** Industry type like Aerospace, Energy where the company provides services
    12. **Sector:** Which type of services company provide in the industry, like industry (Energy), Sector (Oil, Gas)
    13. **Revenue:** Total revenue of the company
    14. **Competitors:** Company competitors
  4. Website Statistics - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2024). Website Statistics - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/website-statistics1
    Explore at:
    Dataset updated
    Mar 12, 2024
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This Website Statistics dataset has three resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file. Please Note: due to a change in Analytics platform and accompanying metrics, the current files do not contain a full years data. The files will be updated again in January 2025 with 2024-2025 data. The previous dataset containing Web Analytics has been archived and can be found in the following link; https://lincolnshire.ckan.io/dataset/website-statistics-archived Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year. Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year. Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year. Note: The resources above exclude API calls (automated requests for datasets). These Website Statistics resources are updated annually in February by the Lincolnshire County Council Open Data team.

  5. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset provided by
    Oxy Labs
    Authors
    Oxylabs
    Area covered
    British Indian Ocean Territory, Moldova (Republic of), Northern Mariana Islands, Bangladesh, Isle of Man, Canada, Andorra, Taiwan, Nepal, Tunisia
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  6. n

    Amazon Web Services Public Data Sets

    • neuinfo.org
    • dknet.org
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services Public Data Sets [Dataset]. http://identifiers.org/RRID:SCR_006318
    Explore at:
    Description

    A multidisciplinary repository of public data sets such as the Human Genome and US Census data that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community. Anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but they can work with you to host larger data sets as well. You must have the right to make the data freely available.

  7. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    University of Zagreb
    University of the Aegean
    Gdańsk University of Technology
    University of Tartu
    Authors
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  8. u

    Behance Community Art Data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

    Metadata includes

    • appreciates (likes)

    • timestamps

    • extracted image features

    Basic Statistics:

    • Users: 63,497

    • Items: 178,788

    • Appreciates (likes): 1,000,000

  9. Ecommerce Dataset (Products & Sizes Included)

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anvit kumar (2025). Ecommerce Dataset (Products & Sizes Included) [Dataset]. https://www.kaggle.com/datasets/anvitkumar/shopping-dataset
    Explore at:
    zip(1274856 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Anvit kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📦 Ecommerce Dataset (Products & Sizes Included)

    🛍️ Essential Data for Building an Ecommerce Website & Analyzing Online Shopping Trends 📌 Overview This dataset contains 1,000+ ecommerce products, including detailed information on pricing, ratings, product specifications, seller details, and more. It is designed to help data scientists, developers, and analysts build product recommendation systems, price prediction models, and sentiment analysis tools.

    🔹 Dataset Features

    Column Name Description product_id Unique identifier for the product title Product name/title product_description Detailed product description rating Average customer rating (0-5) ratings_count Number of ratings received initial_price Original product price discount Discount percentage (%) final_price Discounted price currency Currency of the price (e.g., USD, INR) images URL(s) of product images delivery_options Available delivery methods (e.g., standard, express) product_details Additional product attributes breadcrumbs Category path (e.g., Electronics > Smartphones) product_specifications Technical specifications of the product amount_of_stars Distribution of star ratings (1-5 stars) what_customers_said Customer reviews (sentiments) seller_name Name of the product seller sizes Available sizes (for clothing, shoes, etc.) videos Product video links (if available) seller_information Seller details, such as location and rating variations Different variants of the product (e.g., color, size) best_offer Best available deal for the product more_offers Other available deals/offers category Product category

    📊 Potential Use Cases

    📌 Build an Ecommerce Website: Use this dataset to design a functional online store with product listings, filtering, and sorting. 🔍 Price Prediction Models: Predict product prices based on features like ratings, category, and discount. 🎯 Recommendation Systems: Suggest products based on user preferences, rating trends, and customer feedback. 🗣 Sentiment Analysis: Analyze what_customers_said to understand customer satisfaction and product popularity. 📈 Market & Competitor Analysis: Track pricing trends, popular categories, and seller performance. 🔍 Why Use This Dataset? ✅ Rich Feature Set: Includes all necessary ecommerce attributes. ✅ Realistic Pricing & Rating Data: Useful for price analysis and recommendations. ✅ Multi-Purpose: Suitable for machine learning, web development, and data visualization. ✅ Structured Format: Easy-to-use CSV format for quick integration.

    📂 Dataset Format CSV file (ecommerce_dataset.csv) 1000+ samples Multi-category coverage 🔗 How to Use? Download the dataset from Kaggle. Load it in Python using Pandas: python Copy Edit import pandas as pd
    df = pd.read_csv("ecommerce_dataset.csv")
    df.head() Explore trends & patterns using visualization tools (Seaborn, Matplotlib). Build models & applications based on the dataset!

  10. Data from: Inventory of online public databases and repositories holding...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  11. o

    PhishingWebsites

    • openml.org
    Updated Feb 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae) (2016). PhishingWebsites [Dataset]. https://www.openml.org/d/4534
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2016
    Authors
    Rami Mustafa A Mohammad ( University of Huddersfield; rami.mohammad '@' hud.ac.uk; rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield; t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai; fadi '@' cud.ac.ae)
    Description

    Author: Rami Mustafa A Mohammad ( University of Huddersfield","rami.mohammad '@' hud.ac.uk","rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield","t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai","fadi '@' cud.ac.ae)
    Source: UCI
    Please cite: Please refer to the Machine Learning Repository's citation policy

    Source:

    Rami Mustafa A Mohammad ( University of Huddersfield, rami.mohammad '@' hud.ac.uk, rami.mustafa.a '@' gmail.com) Lee McCluskey (University of Huddersfield,t.l.mccluskey '@' hud.ac.uk ) Fadi Thabtah (Canadian University of Dubai,fadi '@' cud.ac.ae)

    Data Set Information:

    One of the challenges faced by our research was the unavailability of reliable training datasets. In fact this challenge faces any researcher in the field. However, although plenty of articles about predicting phishing websites have been disseminated these days, no reliable training dataset has been published publically, may be because there is no agreement in literature on the definitive features that characterize phishing webpages, hence it is difficult to shape a dataset that covers all possible features. In this dataset, we shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, we propose some new features.

    Attribute Information:

    For Further information about the features see the features file in the data folder of UCI.

    Relevant Papers:

    Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0

    Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25 (2). pp. 443-458. ISSN 0941-0643

    Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi Abdeljaber (2014) Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3). pp. 153-160. ISSN 1751-8709

    Citation Request:

    Please refer to the Machine Learning Repository's citation policy

  12. m

    LegitPhish Dataset

    • data.mendeley.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachana Potpelwar (2025). LegitPhish Dataset [Dataset]. http://doi.org/10.17632/hx4m73v2sf.1
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Rachana Potpelwar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 101,219 URLs and 18 features (including the label). Here's a description of each attribute: Phishing (0): 63,678 URLs

    Legitimate (1): 37,540 URLs

    These URLs have been sourced from the URLHaus database, scraped from many sites and other well-known repositories malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content.

    The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features:

    URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications:

    This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems.

    Usage:

    Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.

    Feature Name Description URL The full URL string. url_length - Total number of characters in the URL. has_ip_address - Binary flag (1/0): whether the URL contains an IP address. dot_count - Number of . characters in the URL. https_flag - Binary flag (1/0): whether the URL uses HTTPS. url_entropy - Shannon entropy of the URL string – higher values indicate more randomness. token_count - Number of tokens/words in the URL. subdomain_count - Number of subdomains in the URL. query_param_count - Number of query parameters (after ?). tld_length - Length of the Top-Level Domain (e.g., "com" = 3). path_length - Length of the path part after the domain. has_hyphen_in_domain Binary flag (1/0): whether the domain contains a hyphen (-). number_of_digits - Total number of numeric characters in the URL. tld_popularity Binary flag (1/0): whether the TLD is popular. suspicious_file_extension Binary flag (1/0): indicates if the URL ends with suspicious extensions (e.g., .exe, .zip). domain_name_length - Length of the domain name. percentage_numeric_chars - Percentage of numeric characters in the URL. ClassLabel Target label: 1 = Legitimate, 0 = Phishing.

  13. w

    Vehicle licensing statistics data tables

    • gov.uk
    • s3.amazonaws.com
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2025). Vehicle licensing statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-tables
    Explore at:
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    GOV.UK
    Authors
    Department for Transport
    Description

    Data files containing detailed information about vehicles in the UK are also available, including make and model data.

    Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.

    The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.

    Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:

    Licensed Vehicles (2014 Q3 to 2016 Q3)

    We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.

    3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification

    Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:

    • 3.1% in 2024

    • 2.3% in 2023

    • 1.4% in 2022

    Table VEH0156 (2018 to 2023)

    Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.

    Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.

    Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.

    If you have questions regarding any of these changes, please contact the Vehicle statistics team.

    All vehicles

    Licensed vehicles

    Overview

    VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)

    Detailed breakdowns

    VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)

    VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at

  14. The Canada Trademarks Dataset

    • zenodo.org
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeremy Sheff; Jeremy Sheff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Canada Trademarks Dataset

    18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

    Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

    Python and Stata Scripts (c) 2021 Jeremy Sheff

    Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

    This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

    Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

    Terms of Use:

    As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

    The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

    The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

    Details of Repository Contents:

    This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

    • /csv: contains the .csv versions of the data files
    • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
    • /dta: contains the .dta versions of the data files
    • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

    If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

    The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

    With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

    The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

    This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

  15. ATSDR Hazardous Waste Site Polygon Data with CIESIN Modifications, Version 2...

    • data.nasa.gov
    • dataverse.harvard.edu
    • +6more
    Updated Dec 12, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2014). ATSDR Hazardous Waste Site Polygon Data with CIESIN Modifications, Version 2 [Dataset]. https://data.nasa.gov/dataset/atsdr-hazardous-waste-site-polygon-data-with-ciesin-modifications-version-2
    Explore at:
    Dataset updated
    Dec 12, 2014
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Agency for Toxic Substances and Disease Registry (ATSDR) Hazardous Waste Site Polygon Data with CIESIN Modifications, Version 2 is a database providing georeferenced data for 1,572 National Priorities List (NPL) Superfund sites. These were selected from the larger set of the ATSDR Hazardous Waste Site Polygon Data, Version 2 data set with polygons from May 26, 2010. The modified data set contains only sites that have been proposed, currently on, or deleted from the final NPL as of October 25, 2013. Of the 2,080 ATSDR polygons from 2010, 1,575 were NPL sites but three sites were excluded - 2 in the Virgin Islands and 1 in Guam. This data set is modified by the Columbia University Center for International Earth Science Information Network (CIESIN). The modified polygon database includes all the attributes for these NPL sites provided in the ATSDR GRASP Hazardous Waste Site Polygon database and selected attributes from the EPA List 9 Active CERCLIS sites and SCAP 12 NPL sites databases. These polygons represent sites considered for cleanup under the Comprehensive Environmental Response, Compensation and Liability Act (CERCLA or Superfund). The Geospatial Research, Analysis, and Services Program (GRASP, Division of Health Studies, Agency for Toxic Substances and Disease Registry, Centers for Disease Control and Prevention) has created site boundary data using the best available information for those sites where health assessments or consultations have been requested.

  16. Fuel Economy Label and CAFE Data Inventory

    • catalog.data.gov
    • data.amerigeoss.org
    • +1more
    Updated Jul 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Air and Radiation (OAR) - Office of Transportation and Air Quality (OTAQ) (2021). Fuel Economy Label and CAFE Data Inventory [Dataset]. https://catalog.data.gov/dataset/fuel-economy-label-and-cafe-data-inventory
    Explore at:
    Dataset updated
    Jul 12, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The Fuel Economy Label and CAFE Data asset contains measured summary fuel economy estimates and test data for light-duty vehicle manufacturers by model for certification as required under the Energy Policy and Conservation Act of 1975 (EPCA) and The Energy Independent Security Act of 2007 (EISA) to collect vehicle fuel economy estimates for the creation of Economy Labels and for the calculation of Corporate Average Fuel Economy (CAFE). Manufacturers submit data on an annual basis, or as needed to document vehicle model changes.The EPA performs targeted fuel economy confirmatory tests on approximately 15% of vehicles submitted for validation. Confirmatory data on vehicles is associated with its corresponding submission data to verify the accuracy of manufacturer submissions beyond standard business rules. Submitted data comes in XML format or as documents, with the majority of submissions being sent in XML, and includes descriptive information on the vehicle itself, fuel economy information, and the manufacturer's testing approach. This data may contain proprietary information (CBI) such as information on estimated sales or other data elements indicated by the submitter as confidential. CBI data is not publically available; however, within the EPA data can accessed under the restrictions of the Office of Transportation and Air Quality (OTAQ) CBI policy [RCS Link]. Datasets are segmented by vehicle model/manufacturer and/or year with corresponding fuel economy, test, and certification data. Data assets are stored in EPA's Verify system.Coverage began in 1974 with early records being primarily paper documents which did not go through the same level of validation as primarily digital submissions which started in 2008. Early data is available to the public digitally starting from 1978, but more complete digital certification data is available starting in 2008. Fuel economy submission data prior to 2006 was calculated using an older formula; however, mechanisms exist to make this data comparable to current results.Fuel Economy Label and CAFE Data submission documents with metadata, certificate and summary decision information is utilized and made publically available through the EPA/DOE's Fuel Economy Guide Website (https://www.fueleconomy.gov/) as well as EPA's Smartway Program Website (https://www.epa.gov/smartway/) and Green Vehicle Guide Website (http://ofmpub.epa.gov/greenvehicles/Index.do;jsessionid=3F4QPhhYDYJxv1L3YLYxqh6J2CwL0GkxSSJTl2xgMTYPBKYS00vw!788633877) after it has been quality assured. Where summary data appears inaccurate, OTAQ returns the entries for review to their originator.

  17. NASA Global Web-Enabled Landsat Data Annual Global 30 m V031 - Dataset -...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). NASA Global Web-Enabled Landsat Data Annual Global 30 m V031 - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/nasa-global-web-enabled-landsat-data-annual-global-30-m-v031-37b9c
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) Global Web-Enabled Landsat Data Annual (GWELDYR) Version 3.1 data product provides Landsat data at 30 meter (m) resolution for terrestrial non-Antarctica locations over annual reporting periods for the 1985, 1990, and 2000 epochs. GWELD data products are generated from all available Landsat 4 and 5 Thematic Mapper (TM) and Landsat 7 Enhanced Thematic Mapper Plus (ETM+) data in the U.S. Geological Survey (USGS) Landsat archive. The GWELD suite of products provide consistent data to derive land cover as well as geophysical and biophysical information for regional assessment of land surface dynamics.The GWELD products include Nadir Bidirectional Reflectance Distribution Function (BRDF)-Adjusted Reflectance (NBAR) for the reflective wavelength bands and top of atmosphere (TOA) brightness temperature for the thermal bands. The products are defined in the Sinusoidal coordinate system to promote continuity of NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) land tile grid.Provided in the GWELDYR product are layers for surface reflectance bands 1 through 5 and 7, TOA brightness temperature for thermal bands, Normalized Difference Vegetation Index (NDVI), day of year, ancillary angle, and data quality information. A low-resolution red, green, blue (RGB) browse image of bands 5, 4, 3 is also available for each granule.Known Issues GWELDYR known issues can be found in Section 4 of the Algorithm Theoretical Basis Document (ATBD).Improvements/Changes from Previous Version Version 3.1 products use Landsat Collection 1 products as input and have improved per-pixel cloud mask, new quality data, improved calibration information, and improved product metadata that enable view and solar geometry calculations.

  18. d

    Job Postings Dataset for Labour Market Research and Insights

    • datarade.ai
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Sep 20, 2023
    Dataset authored and provided by
    Oxylabs
    Area covered
    Togo, Zambia, Kyrgyzstan, Luxembourg, Switzerland, Jamaica, Tajikistan, Sierra Leone, Anguilla, British Indian Ocean Territory
    Description

    Introducing Job Posting Datasets: Uncover labor market insights!

    Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

    Job Posting Datasets Source:

    1. Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

    2. Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

    3. StackShare: Access StackShare datasets to make data-driven technology decisions.

    Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

    Choose your preferred dataset delivery options for convenience:

    Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

    Why Choose Oxylabs Job Posting Datasets:

    1. Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

    2. Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

    3. Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.

  19. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset provided by
    Oxy Labs
    Authors
    Oxylabs
    Area covered
    El Salvador, Tuvalu, Philippines, Bahamas, Guyana, Saint Pierre and Miquelon, Marshall Islands, South Sudan, United Kingdom, Djibouti
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  20. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dmitry Yemelyanov (2021). Riga Data Science Club [Dataset]. https://www.kaggle.com/datasets/dmitryyemelyanov/rigadsclub
Organization logo

Riga Data Science Club

LinkedIn stats, meetup transcriptions, website analytics, typeform responses

Explore at:
zip(494849 bytes)Available download formats
Dataset updated
Mar 29, 2021
Authors
Dmitry Yemelyanov
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
Riga
Description

Context

Riga Data Science Club is a non-profit organisation to share ideas, experience and build machine learning projects together. Data Science community should known own data, so this is a dataset about ourselves: our website analytics, social media activity, slack statistics and even meetup transcriptions!

Content

Dataset is split up in several folders by the context: * linkedin - company page visitor, follower and post stats * slack - messaging and member activity * typeform - new member responses * website - website visitors by country, language, device, operating system, screen resolution * youtube - meetup transcriptions

Inspiration

Let's make Riga Data Science Club better! We expect this data to bring lots of insights on how to improve.

"Know your c̶u̶s̶t̶o̶m̶e̶r̶ member" - Explore member interests by analysing sign-up survey (typeform) responses - Explore messaging patterns in Slack to understand how members are retained and when they are lost

Social media intelligence * Define LinkedIn posting strategy based on historical engagement data * Define target user profile based on LinkedIn page attendance data

Website * Define website localisation strategy based on data about visitor countries and languages * Define website responsive design strategy based on data about visitor devices, operating systems and screen resolutions

Have some fun * NLP analysis of meetup transcriptions: word frequencies, question answering, something else?

Search
Clear search
Close search
Google apps
Main menu