Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 1093713 Domain Name : monter.usa.com Date Range : 01st April 2022 - 31st June 2022 File Extension : ldjson
Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The analysis of occupants’ perception can improve building indoor environmental quality (IEQ). Going beyond conventional surveys, this study presents an innovative analysis of occupants’ feedback about the IEQ of different workplaces based on web-scraping and text-mining of online job reviews. A total of 1,158,706 job reviews posted on Glassdoor about 257 large organizations (with more than 10,000 employees) are scraped and analyzed. Within these reviews, 10,593 include complaints about at least one IEQ aspect. The analysis of this large number of feedbacks referring to several workplaces is the first of its kind and leads to two main results: (1) IEQ complaints mostly arise in workplaces that are not office buildings, especially regarding poor thermal and indoor air quality conditions in warehouses, stores, kitchens, and trucks; (2) reviews containing IEQ complaints are more negative than reviews without IEQ complaints. The first result highlights the need for IEQ investigations beyond office buildings. The second result strengthens the potential detrimental effect that uncomfortable IEQ conditions can have on job satisfaction. This study demonstrates the potential of User-Generated Content and text-mining techniques to analyze the IEQ of workplaces as an alternative to conventional surveys, for scientific and practical purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A simple web page containing Fisher's Iris Dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 254888 Domain Name : monster.usa.com Date Range : 01st Jul 2021 - 30th Sep 2021 File Extension : ldjson
Available Fields : url, job_title, category, industry, company_name, logo_url, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, apply_url, job_board, geo, is_remote, test_contact_email, contact_email, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterThe vastness of materials space, particularly that which is concerned with metal–organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data-mine published MOF papers to extract the materials informatics knowledge contained within journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials, and text-mined over 52,680 associated properties including the synthesis method, solvent, organic linker, metal precursor, and topology. Additionally, we developed an alternative data extraction technique to obtain and transform the chemical names assigned to each CSD entry in order to determine linker types for each structure in the CSD MOF subset. This data enabled us to match MOFs to a list of known linkers provided by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these important chemicals. This centralized, structured database reveals the MOF synthetic data embedded within thousands of MOF publications and contains further topology, metal type, accessible surface area, largest cavity diameter, pore limiting diameter, open metal sites, and density calculations for all 3D MOFs in the CSD MOF subset. The DigiMOF database and associated software are publicly available for other researchers to rapidly search for MOFs with specific properties, conduct further analysis of alternative MOF production pathways, and create additional parsers to search for additional desirable properties.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 2470771 Domain Name : careerbuilder.usa.com Date Range : 01st Jul 2021 - 30th Sep 2021 File Extension : ldjson
Available Fields : url, job_title, category, company_name, logo_url, city, state, country, post_date, test_months_of_experience, test_educational_credential, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, test_onetsoc_code, test_onetsoc_name, uniq_id, crawl_timestamp, apply_url, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Web Accessibility Analysis: This model can be used to analyze the accessibility of web pages by identifying different elements and ensuring they follow good practices in design and user accessibility standards, such as having appropriate contrast between text and image, or usage of icons and buttons for UI/UX.
Web Page Redesign: By identifying the classes of elements on a webpage, "Reorganized2" could be used by designers and developers to analyze a current website layout and assist in redesigning a more intuitive and user-friendly interface.
UX Research and Testing: The model can be utilized in user experience (UX) research. It can help in identifying which elements (buttons, icons, dropdowns) on a webpage are getting more attention thus allowing UX designers to create more effective webpages.
Web Scraping: In the field of data mining, the model can serve as a smart web scraper, identifying different elements on a page, thus making web scraping more efficient and targeted rather than pulling irrelevant information.
E-commerce Optimization: "Reorganized2" can be used to analyze various e-commerce websites, spotting common design features amongst the most successful ones, especially regarding the usage and placement of 'cart', 'field', and 'dropdown' elements. These insights can be used to optimize other online retail sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web scraped text of data-driven start-ups founded between 2010 and 2023. The data was used for data-driven business models analysis, identifying emergent trends and business models transformation over time. The dataset contains the text split into sentences along a reference text (description) and respective embeddings. The foundation model used for the embeddings is: paraphrase-multilingual-MiniLM-L12-v2.
The data collection process not only respected websites' privacy but also adhered to best practices. The scraper tool was configured to read the robots.txt file at the root of each website and proceed only with actions explicitly allowed by the respective site. Additionally, the collection was limited to 50 pages per firm to avoid excessive harvesting.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 715945 Domain Name : amazon.com Date Range : 01st Nov 2020 - 31st Dec 2020 File Extension : csv
Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Broken Link, Discontinued
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset YouTube Comments about ChatGPT in Indonesian, obtained by web scraping technique on the video page "ChatGPT dan Masa Depan Pekerjaan Kita". contains 1249 data consisting of Comment attributes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China's coal mining industry boasts a long history, with most mines entering mid to late stages of development. Consequently, many coal resource-dependent cities are facing gradual resource depletion, leading to a proliferation of internal issues and pronounced conflicts across various dimensions. Annually, significant wastage of energy resources occurs due to coal mining in these cities. Understanding the spatial distribution of energy consumption from coal mining is crucial for advancing energy system transformation and planning future coal mining strategies. This study employs web scraping techniques to gather coal production data from coal resource cities across China. Using a top-down approach, the comprehensive energy consumption based on coal production was estimated, and consequently, the energy consumption per unit of coal production for each province was calculated. The results indicate that cities with energy consumption from coal mining exceeding 5 million tonnes of standard coal are primarily concentrated in the eastern regions and the Jin-Shan-Mongolia-Ningxia-Gansu area.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe field of information extraction (IE) is currently exploring more versatile and efficient methods for minimization of reliance on extensive annotated datasets and integration of knowledge across tasks and domains.ObjectiveWe aim to evaluate and refine the application of the universal IE (UIE) technology in the field of Chinese medical expertise in terms of processing accuracy and efficiency.MethodsOur model integrates ontology modeling, web scraping, UIE, fine-tuning strategies, and graph databases, thereby covering knowledge modeling, extraction, and storage techniques. The Enhanced Representation through Knowledge Integration-UIE (ERNIE-UIE) model is fine-tuned and optimized using a small amount of annotated data. A medical knowledge graph is then constructed, followed by validating the graph and conducting knowledge mining on the data stored within it.ResultsIncorporating the characteristics of whole-course management, we implemented a comprehensive medical knowledge graph–construction model and methodology. Entities and relationships were jointly extracted using the pretrained language model, resulting in 8,525 entity data points and 9,522 triple data points. The accuracy of the knowledge graph was verified using graph algorithms.ConclusionWe optimized the construction process of a Chinese medical knowledge graph with minimal annotated data by utilizing a generative extraction paradigm, validating the graph’s efficacy and achieving commendable results. This approach addresses the challenge of insufficient annotated training corpora in low-resource knowledge graph construction, thereby contributing to cost savings in the development of knowledge graphs.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 192481 Domain Name : Monster.uk.com Date Range : 01st April 2022 - 31st June 2022 File Extension : ldjson
Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of training results of different models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The twin transition, which combines green and digital innovation in economic activities, is increasingly central to policy agendas and is also receiving growing attention in regional research. However, accurately mapping green, digital and twin (both green and digital) economic activities across regions remains challenging, particularly due to data constraints. In this study, we advance this research frontier and present a geographic analysis of digital, green and twin economic activities in Germany, using a web-mined dataset of website texts from 678,381 firms, collected through web scraping in 2023. By processing over 44 million text paragraphs from these websites and applying a cosine similarity filter with green and AI-related terms, we filtered firms that are likely engaged in green, digital and twin activities. Based on this subset, 1437 text paragraphs were manually annotated to fine-tune two transformer models within a SetFit framework, accurately classifying firms as green, digital or both. We aggregate this firm-level data into hexagonal cells to reveal the geographic concentration of the twin transition in Germany. The final map shows a higher number of firms involved in green activities, widely spread across Germany, while AI activities are concentrated in urban centres. We identify 23,819 firms engaged in both green and digital activities, with major hubs like Berlin and Munich leading, and peripheral regions potentially being left behind. Our findings offer critical insights into the geography of the twin transition and highlight the need for policies that address potentially induced spatial inequalities.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 3907506 Domain Name : indeed.usa.com Date Range : 01st Jul 2021 - 30th Sep 2021 File Extension : ldjson
Available Fields : uniq_id, crawl_timestamp, url, job_title, category, company_name, logo_url, city, state, country, post_date, job_description, job_type, apply_url, company_description, job_board, geo, job_post_lang, inferred_iso2_lang_code, extra_fields, is_remote, test1_cities, test1_states, test1_countries, site_name, html_job_description, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 639987 Domain Name : walmart.com Date Range : 01st Jan 2021 - 31st Mar 2021 File Extension : csv
Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Discontinued, Broken Link, Joining Key
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Digitizing a medicinal plant leaf dataset enhances its utility, accessibility, and potential for research and innovation in the fields of botany, pharmacology, and natural medicine. Digital datasets enable advanced data analysis techniques, such as machine learning algorithms, statistical analysis, and data mining. Researchers can uncover patterns, correlations, and trends within the dataset, leading to new insights and discoveries. With advancements in technology and analytical techniques, future generations can leverage this dataset to identify potential drug candidates from natural sources. By studying the chemical composition and biological activity of medicinal leaves, they can develop new pharmaceuticals with improved efficacy and fewer side effects. dataset consists of 45 classes of plant species found in Paschim Maharashtra, totaling around 8000 images. These images were captured using a Redmi K50 with a 64 MP camera. The dataset was likely compiled through a combination of methods, including manual collection and web scraping. Each plant species is associated with its common name and medicinal significance. This dataset serves as a valuable resource for researchers and enthusiasts interested in studying the medicinal properties of various plant species native to the region. Accurate classification of medicinal leaves helps in identifying plants with therapeutic properties. This is crucial for Ayush practitioners who rely on specific plants for preparing herbal medicines and remedies. The dataset can be used to train machine learning models for image classification tasks. By feeding the model with labeled images of medicinal plants, it can learn to classify new images into one of the predefined classes. This can aid in automated identification of plant species, which is useful for botanists, pharmacologists, and herbalists.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 439956 Domain Name : walmart.com Date Range : 01st Apr 2021 - 30th Apr 2021 File Extension : csv
Available Fields : Uniq Id, Crawl Timestamp, Pageurl, Website, Title, Num Of Reviews, Average Rating, Number Of Ratings, Model Num, Sku, Upc, Manufacturer, Model Name, Price, Monthly Price, Stock, Carrier, Color Category, Internal Memory, Screen Size, Specifications, Five Star, Four Star, Three Star, Two Star, One Star, Discontinued, Broken Link, Joining Key
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 5K records. You can download the full dataset here
Total Records Count : 53584 Domain Name : walmart.com Date Range : 01st Jan 2021 - 28th Feb 2021 File Extension : tsv
Available Fields : Uniq Id, Crawl Timestamp, Product Id, Product Company Type Source, Product Category Group Code, Product Category Code, Product Market Code, Product Sector Code, Product Brand Code, Retailer, Product Category, Product Brand, Product Name, Product Price, Sku, Upc, Product Url, Market, Product Description, Product Currency, Product Available Inventory, Product Image Url, Product Model Number, Product Tags, Product Contents, Product Rating, Product Reviews Count, Bsr, Joining Key, Expected Category Count, Expected Brand Count
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud and DataStock.
This dataset was created keeping in mind our data scientists and researchers across the world.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by our in-house Web Scraping and Data Mining teams at PromptCloud and DataStock. You can download the full dataset here. This sample contains 30K records. You can download the full dataset here
Total Records Count : 1093713 Domain Name : monter.usa.com Date Range : 01st April 2022 - 31st June 2022 File Extension : ldjson
Available Fields : url, job_title, category, company_name, city, state, country, post_date, occupation_category, job_description, job_type, valid_through, html_job_description, extra_fields, uniq_id, crawl_timestamp, job_board, geo, job_post_lang, inferred_iso2_lang_code, is_remote, test1_cities, test1_states, test1_countries, site_name, domain, postdate_yyyymmdd, predicted_language, inferred_iso3_lang_code, test1_inferred_city, test1_inferred_state, test1_inferred_country, inferred_city, inferred_state, inferred_country, has_expired, last_expiry_check_date, latest_expiry_check_date, dataset, postdate_in_indexname_format, segment_name, duplicate_status, job_desc_char_count, ijp_reprocessed_flag_1, ijp_reprocessed_flag_2, ijp_reprocessed_flag_3, ijp_is_production_ready, fitness_score
We wouldn't be here without the help of our in house web scraping and data mining teams at PromptCloud, DataStock and live job data from JobsPikr.
This dataset was created keeping in mind our data scientists and researchers across the world.