70 datasets found
  1. MONSTER_JOB_POSTING_USA

    • kaggle.com
    zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). MONSTER_JOB_POSTING_USA [Dataset]. https://www.kaggle.com/datasets/promptcloud/monster-job-posting-usa
    Explore at:
    zip(16498 bytes)Available download formats
    Dataset updated
    Aug 2, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 1093713  Domain Name : monter.usa.com  Date Range : 01st April 2022 - 31st June 2022   File Extension : ldjson

    Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score  

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  2. f

    Investigating the indoor environmental quality of different workplaces...

    • tandf.figshare.com
    docx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giorgia Chinazzo (2023). Investigating the indoor environmental quality of different workplaces through web-scraping and text-mining of Glassdoor reviews [Dataset]. http://doi.org/10.6084/m9.figshare.14393067.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Giorgia Chinazzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The analysis of occupants’ perception can improve building indoor environmental quality (IEQ). Going beyond conventional surveys, this study presents an innovative analysis of occupants’ feedback about the IEQ of different workplaces based on web-scraping and text-mining of online job reviews. A total of 1,158,706 job reviews posted on Glassdoor about 257 large organizations (with more than 10,000 employees) are scraped and analyzed. Within these reviews, 10,593 include complaints about at least one IEQ aspect. The analysis of this large number of feedbacks referring to several workplaces is the first of its kind and leads to two main results: (1) IEQ complaints mostly arise in workplaces that are not office buildings, especially regarding poor thermal and indoor air quality conditions in warehouses, stores, kitchens, and trucks; (2) reviews containing IEQ complaints are more negative than reviews without IEQ complaints. The first result highlights the need for IEQ investigations beyond office buildings. The second result strengthens the potential detrimental effect that uncomfortable IEQ conditions can have on job satisfaction. This study demonstrates the potential of User-Generated Content and text-mining techniques to analyze the IEQ of workplaces as an alternative to conventional surveys, for scientific and practical purposes.

  3. Iris Webpage

    • figshare.com
    html
    Updated Mar 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2020). Iris Webpage [Dataset]. http://doi.org/10.6084/m9.figshare.7053392.v4
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Mar 9, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A simple web page containing Fisher's Iris Dataset.

  4. Monster USA Job Posting

    • kaggle.com
    zip
    Updated Feb 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). Monster USA Job Posting [Dataset]. https://www.kaggle.com/datasets/promptcloud/monster-usa-job-posting
    Explore at:
    zip(55313772 bytes)Available download formats
    Dataset updated
    Feb 22, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 254888  Domain Name : monster.usa.com  Date Range : 01st Jul 2021 - 30th Sep 2021   File Extension : ldjson

    Available Fields : url, job_title, category, industry, company_name, logo_url, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, apply_url, job_board, geo, is_remote, test_contact_email, contact_email, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score  

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  5. f

    Data from: DigiMOF: A Database of Metal–Organic Framework Synthesis...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moosavi, Seyed Mohamad; Glasby, Lawson T.; Bence, Rosalee; Oktavian, Rama; Cordiner, Joan L.; Moghadam, Peyman Z.; Cole, Jason C.; Isoko, Kesler; Gubsch, Kristian (2023). DigiMOF: A Database of Metal–Organic Framework Synthesis Information Generated via Text Mining [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000955398
    Explore at:
    Dataset updated
    May 18, 2023
    Authors
    Moosavi, Seyed Mohamad; Glasby, Lawson T.; Bence, Rosalee; Oktavian, Rama; Cordiner, Joan L.; Moghadam, Peyman Z.; Cole, Jason C.; Isoko, Kesler; Gubsch, Kristian
    Description

    The vastness of materials space, particularly that which is concerned with metal–organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data-mine published MOF papers to extract the materials informatics knowledge contained within journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials, and text-mined over 52,680 associated properties including the synthesis method, solvent, organic linker, metal precursor, and topology. Additionally, we developed an alternative data extraction technique to obtain and transform the chemical names assigned to each CSD entry in order to determine linker types for each structure in the CSD MOF subset. This data enabled us to match MOFs to a list of known linkers provided by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these important chemicals. This centralized, structured database reveals the MOF synthetic data embedded within thousands of MOF publications and contains further topology, metal type, accessible surface area, largest cavity diameter, pore limiting diameter, open metal sites, and density calculations for all 3D MOFs in the CSD MOF subset. The DigiMOF database and associated software are publicly available for other researchers to rapidly search for MOFs with specific properties, conduct further analysis of alternative MOF production pathways, and create additional parsers to search for additional desirable properties.

  6. Job Data USA CareerBuilder

    • kaggle.com
    zip
    Updated Feb 18, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). Job Data USA CareerBuilder [Dataset]. https://www.kaggle.com/promptcloud/job-data-usa-careerbuilder
    Explore at:
    zip(52064933 bytes)Available download formats
    Dataset updated
    Feb 18, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 2470771  Domain Name : careerbuilder.usa.com  Date Range : 01st Jul 2021 - 30th Sep 2021   File Extension : ldjson

    Available Fields : url, job_title, category, company_name, logo_url, city, state, country, post_date, test_months_of_experience, test_educational_credential, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, test_onetsoc_code, test_onetsoc_name, uniq_id, crawl_timestamp, apply_url, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score    

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  7. R

    Reorganized2 Dataset

    • universe.roboflow.com
    zip
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bruce baur (2023). Reorganized2 Dataset [Dataset]. https://universe.roboflow.com/bruce-baur/reorganized2/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2023
    Dataset authored and provided by
    bruce baur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Wegpages Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Web Accessibility Analysis: This model can be used to analyze the accessibility of web pages by identifying different elements and ensuring they follow good practices in design and user accessibility standards, such as having appropriate contrast between text and image, or usage of icons and buttons for UI/UX.

    2. Web Page Redesign: By identifying the classes of elements on a webpage, "Reorganized2" could be used by designers and developers to analyze a current website layout and assist in redesigning a more intuitive and user-friendly interface.

    3. UX Research and Testing: The model can be utilized in user experience (UX) research. It can help in identifying which elements (buttons, icons, dropdowns) on a webpage are getting more attention thus allowing UX designers to create more effective webpages.

    4. Web Scraping: In the field of data mining, the model can serve as a smart web scraper, identifying different elements on a page, thus making web scraping more efficient and targeted rather than pulling irrelevant information.

    5. E-commerce Optimization: "Reorganized2" can be used to analyze various e-commerce websites, spotting common design features amongst the most successful ones, especially regarding the usage and placement of 'cart', 'field', and 'dropdown' elements. These insights can be used to optimize other online retail sites.

  8. m

    Data Driven Start-Ups

    • data.mendeley.com
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denilton Darold (2024). Data Driven Start-Ups [Dataset]. http://doi.org/10.17632/9czp5vg5ym.1
    Explore at:
    Dataset updated
    Aug 30, 2024
    Authors
    Denilton Darold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web scraped text of data-driven start-ups founded between 2010 and 2023. The data was used for data-driven business models analysis, identifying emergent trends and business models transformation over time. The dataset contains the text split into sentences along a reference text (description) and respective embeddings. The foundation model used for the embeddings is: paraphrase-multilingual-MiniLM-L12-v2.

    The data collection process not only respected websites' privacy but also adhered to best practices. The scraper tool was configured to read the robots.txt file at the root of each website and proceed only with actions explicitly allowed by the respective site. Additionally, the collection was limited to 50 pages per firm to avoid excessive harvesting.

  9. Amazon Product Listing Dataset

    • kaggle.com
    zip
    Updated Oct 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2021). Amazon Product Listing Dataset [Dataset]. https://www.kaggle.com/promptcloud/amazon-product-listing-dataset
    Explore at:
    zip(4267812 bytes)Available download formats
    Dataset updated
    Oct 12, 2021
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 715945  Domain Name : amazon.com  Date Range : 01st Nov 2020 - 31st Dec 2020   File Extension : csv

    Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Broken Link, Discontinued 

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  10. m

    Sentiment Analysis ChatGPT YouTube Comments Dataset

    • data.mendeley.com
    • kaggle.com
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Dwi Nugroho (2024). Sentiment Analysis ChatGPT YouTube Comments Dataset [Dataset]. http://doi.org/10.17632/4vkdjfc4v2.1
    Explore at:
    Dataset updated
    Jun 14, 2024
    Authors
    Arif Dwi Nugroho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    The dataset YouTube Comments about ChatGPT in Indonesian, obtained by web scraping technique on the video page "ChatGPT dan Masa Depan Pekerjaan Kita". contains 1249 data consisting of Comment attributes.

  11. S

    China Coal Mining Energy Consumption Dataset

    • scidb.cn
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gang Lin (2024). China Coal Mining Energy Consumption Dataset [Dataset]. http://doi.org/10.57760/sciencedb.15634
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Gang Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    China's coal mining industry boasts a long history, with most mines entering mid to late stages of development. Consequently, many coal resource-dependent cities are facing gradual resource depletion, leading to a proliferation of internal issues and pronounced conflicts across various dimensions. Annually, significant wastage of energy resources occurs due to coal mining in these cities. Understanding the spatial distribution of energy consumption from coal mining is crucial for advancing energy system transformation and planning future coal mining strategies. This study employs web scraping techniques to gather coal production data from coal resource cities across China. Using a top-down approach, the comprehensive energy consumption based on coal production was estimated, and consequently, the energy consumption per unit of coal production for each province was calculated. The results indicate that cities with energy consumption from coal mining exceeding 5 million tonnes of standard coal are primarily concentrated in the eastern regions and the Jin-Shan-Mongolia-Ningxia-Gansu area.

  12. Performance of relationship extraction.

    • plos.figshare.com
    xls
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bei Li; Changbiao Li; Jianwei Sun; Xu Zeng; Xiaofan Chen; Jing Zheng (2025). Performance of relationship extraction. [Dataset]. http://doi.org/10.1371/journal.pone.0325082.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bei Li; Changbiao Li; Jianwei Sun; Xu Zeng; Xiaofan Chen; Jing Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe field of information extraction (IE) is currently exploring more versatile and efficient methods for minimization of reliance on extensive annotated datasets and integration of knowledge across tasks and domains.ObjectiveWe aim to evaluate and refine the application of the universal IE (UIE) technology in the field of Chinese medical expertise in terms of processing accuracy and efficiency.MethodsOur model integrates ontology modeling, web scraping, UIE, fine-tuning strategies, and graph databases, thereby covering knowledge modeling, extraction, and storage techniques. The Enhanced Representation through Knowledge Integration-UIE (ERNIE-UIE) model is fine-tuned and optimized using a small amount of annotated data. A medical knowledge graph is then constructed, followed by validating the graph and conducting knowledge mining on the data stored within it.ResultsIncorporating the characteristics of whole-course management, we implemented a comprehensive medical knowledge graph–construction model and methodology. Entities and relationships were jointly extracted using the pretrained language model, resulting in 8,525 entity data points and 9,522 triple data points. The accuracy of the knowledge graph was verified using graph algorithms.ConclusionWe optimized the construction process of a Chinese medical knowledge graph with minimal annotated data by utilizing a generative extraction paradigm, validating the graph’s efficacy and achieving commendable results. This approach addresses the challenge of insufficient annotated training corpora in low-resource knowledge graph construction, thereby contributing to cost savings in the development of knowledge graphs.

  13. monster_uk-monster_uk_job

    • kaggle.com
    zip
    Updated Aug 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). monster_uk-monster_uk_job [Dataset]. https://www.kaggle.com/datasets/promptcloud/monster-ukmonster-uk-job
    Explore at:
    zip(16422 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United Kingdom
    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Total Records Count : 192481  Domain Name : Monster.uk.com  Date Range : 01st April 2022 - 31st June 2022   File Extension : ldjson

    Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score  

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  14. Comparison of training results of different models.

    • plos.figshare.com
    xls
    Updated May 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bei Li; Changbiao Li; Jianwei Sun; Xu Zeng; Xiaofan Chen; Jing Zheng (2025). Comparison of training results of different models. [Dataset]. http://doi.org/10.1371/journal.pone.0325082.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bei Li; Changbiao Li; Jianwei Sun; Xu Zeng; Xiaofan Chen; Jing Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of training results of different models.

  15. Data from: The geography of digital and green (twin) firms in Germany

    • tandf.figshare.com
    html
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Kriesch; Milad Abbasiharofteh; Sebastian Losacker (2025). The geography of digital and green (twin) firms in Germany [Dataset]. http://doi.org/10.6084/m9.figshare.29301095.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Lukas Kriesch; Milad Abbasiharofteh; Sebastian Losacker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    The twin transition, which combines green and digital innovation in economic activities, is increasingly central to policy agendas and is also receiving growing attention in regional research. However, accurately mapping green, digital and twin (both green and digital) economic activities across regions remains challenging, particularly due to data constraints. In this study, we advance this research frontier and present a geographic analysis of digital, green and twin economic activities in Germany, using a web-mined dataset of website texts from 678,381 firms, collected through web scraping in 2023. By processing over 44 million text paragraphs from these websites and applying a cosine similarity filter with green and AI-related terms, we filtered firms that are likely engaged in green, digital and twin activities. Based on this subset, 1437 text paragraphs were manually annotated to fine-tune two transformer models within a SetFit framework, accurately classifying firms as green, digital or both. We aggregate this firm-level data into hexagonal cells to reveal the geographic concentration of the twin transition in Germany. The final map shows a higher number of firms involved in green activities, widely spread across Germany, while AI activities are concentrated in urban centres. We identify 23,819 firms engaged in both green and digital activities, with major hubs like Berlin and Munich leading, and peripheral regions potentially being left behind. Our findings offer critical insights into the geography of the twin transition and highlight the need for policies that address potentially induced spatial inequalities.

  16. Job Data USA Indeed

    • kaggle.com
    zip
    Updated Feb 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). Job Data USA Indeed [Dataset]. https://www.kaggle.com/promptcloud/job-data-usa-indeed
    Explore at:
    zip(57406403 bytes)Available download formats
    Dataset updated
    Feb 16, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 3907506  Domain Name : indeed.usa.com  Date Range : 01st Jul 2021 - 30th Sep 2021   File Extension : ldjson

    Available Fields : uniq_id, crawl_timestamp, url, job_title, category, company_name, logo_url, city, state, country, post_date, job_description, job_type, apply_url, company_description, job_board, geo, job_post_lang, inferred_iso2_lang_code, extra_fields, is_remote, test1_cities, test1_states, test1_countries, site_name, html_job_description, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score    

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  17. Walmart Product Data

    • kaggle.com
    zip
    Updated Nov 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2021). Walmart Product Data [Dataset]. https://www.kaggle.com/promptcloud/walmart-product-data
    Explore at:
    zip(6360417 bytes)Available download formats
    Dataset updated
    Nov 9, 2021
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 639987  Domain Name : walmart.com  Date Range : 01st Jan 2021 - 31st Mar 2021   File Extension : csv

    Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Discontinued, Broken Link, Joining Key  

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  18. m

    Medicinal Plant Leaf Dataset with name table(mostly found in Paschim...

    • data.mendeley.com
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheetal Patil (2024). Medicinal Plant Leaf Dataset with name table(mostly found in Paschim Maharashtra.) [Dataset]. http://doi.org/10.17632/xzy9mh2z65.1
    Explore at:
    Dataset updated
    Apr 29, 2024
    Authors
    Sheetal Patil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Paschim Maharashtra, Maharashtra
    Description

    Digitizing a medicinal plant leaf dataset enhances its utility, accessibility, and potential for research and innovation in the fields of botany, pharmacology, and natural medicine. Digital datasets enable advanced data analysis techniques, such as machine learning algorithms, statistical analysis, and data mining. Researchers can uncover patterns, correlations, and trends within the dataset, leading to new insights and discoveries. With advancements in technology and analytical techniques, future generations can leverage this dataset to identify potential drug candidates from natural sources. By studying the chemical composition and biological activity of medicinal leaves, they can develop new pharmaceuticals with improved efficacy and fewer side effects. dataset consists of 45 classes of plant species found in Paschim Maharashtra, totaling around 8000 images. These images were captured using a Redmi K50 with a 64 MP camera. The dataset was likely compiled through a combination of methods, including manual collection and web scraping. Each plant species is associated with its common name and medicinal significance. This dataset serves as a valuable resource for researchers and enthusiasts interested in studying the medicinal properties of various plant species native to the region. Accurate classification of medicinal leaves helps in identifying plants with therapeutic properties. This is crucial for Ayush practitioners who rely on specific plants for preparing herbal medicines and remedies. The dataset can be used to train machine learning models for image classification tasks. By feeding the model with labeled images of medicinal plants, it can learn to classify new images into one of the predefined classes. This can aid in automated identification of plant species, which is useful for botanists, pharmacologists, and herbalists.

  19. Product Listing Walmart

    • kaggle.com
    zip
    Updated Jan 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2022). Product Listing Walmart [Dataset]. https://www.kaggle.com/datasets/promptcloud/product-listing-walmart/discussion
    Explore at:
    zip(6419348 bytes)Available download formats
    Dataset updated
    Jan 13, 2022
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

    Content

    Total Records Count : 439956  Domain Name : walmart.com  Date Range : 01st Apr 2021 - 30th Apr 2021   File Extension : csv

    Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Discontinued, Broken Link, Joining Key    

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

  20. Walmart Product Review Data

    • kaggle.com
    zip
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PromptCloud (2021). Walmart Product Review Data [Dataset]. https://www.kaggle.com/promptcloud/walmart-product-review-data
    Explore at:
    zip(3487110 bytes)Available download formats
    Dataset updated
    Oct 22, 2021
    Authors
    PromptCloud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 5K records. You can download the full dataset here

    Content

    Total Records Count : 53584  Domain Name : walmart.com  Date Range : 01st Jan 2021 - 28th Feb 2021   File Extension : tsv

    Available Fields : Uniq Id, Crawl Timestamp, Product Id, Product Company Type Source, Product Category Group Code, Product Category Code, Product Market Code, Product Sector Code, Product Brand Code, Retailer, Product Category, Product Brand, Product Name, Product Price, Sku, Upc, Product Url, Market, Product Description, Product Currency, Product Available Inventory, Product Image Url, Product Model Number, Product Tags, Product Contents, Product Rating, Product Reviews Count, Bsr, Joining Key, Expected Category Count, Expected Brand Count 

    Acknowledgements

    We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.

    Inspiration

    This dataset was created keeping in mind our data scientists and researchers across the world.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
PromptCloud (2022). MONSTER_JOB_POSTING_USA [Dataset]. https://www.kaggle.com/datasets/promptcloud/monster-job-posting-usa
Organization logo

MONSTER_JOB_POSTING_USA

This dataset includes job data from monster USA

Explore at:
zip(16498 bytes)Available download formats
Dataset updated
Aug 2, 2022
Authors
PromptCloud
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
United States
Description

Context

This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here

Content

Total Records Count : 1093713  Domain Name : monter.usa.com  Date Range : 01st April 2022 - 31st June 2022   File Extension : ldjson

Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score  

Acknowledgements

We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.

Inspiration

This dataset was created keeping in mind our data scientists and researchers across the world.

Search
Clear search
Close search
Google apps
Main menu