Facebook
TwitterUnited States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work and any original materials produced and published by Open Development Mekong herein are licensed under a CC BY-SA 4.0. News article summaries are extracted from their sources, as guided by fair-use principles and are copyrighted by their respective sources. Materials on the Open Development Mekong (ODM) website and its accompanying database are compiled from publicly available documentation and provided without fee for general informational purposes only. This is neither a commercial research service nor a domain managed by any governmental or inter-governmental agency; it is managed as a private non-profit open data/open knowledge media group. Information is publicly posted only after a careful vetting and verification process. However, ODM cannot guarantee accuracy, completeness or reliability from third party sources in every instance. ODM makes no representation or warranty, either expressed or implied, in fact or in law, with respect to the accuracy, completeness or appropriateness of the data, materials or documents contained or referenced herein or provided. Site users are encouraged to do additional research in support of their activities and to share the results of that research with our team, contact us to further improve the site accuracy. By accessing this ODM website or database users agree to take full responsibility for reliance on any site information provided and to hold harmless and waive any and all liability against individuals or entities associated with its development, form and content for any loss, harm or damage suffered as a result of its use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This MTB file contains a collection of maps publicly available that can be used with the MetaPath platform to search and analyse experimental data on metabolism or catabolism. The External Scientific Report by EFSA's contractor, the German Federal Institute for Risk Assessment (BfR), describes how the extraction and coding of the data on which this database was based was conducted.
For more details and background information please consult EFSA website
Intellectual Property Rights Notice
The reproduction, distribution, redistribution, exploiting, making commercial or further use of information, documents and data posted or otherwise made available on this website or in the websites linked to it may be subject to protection under intellectual property rights regulations, data exclusivity clauses or other applicable law, and their utilisation without obtaining the prior permission from the right(s)holder(s) of the respective information, documents and data might violate the pre-existing rights of the respective right(s)holder(s).
For materials subject to intellectual property rights or other rights of a third party, the User must comply with the terms of use associated with such material or obtain the necessary and written permission for reproduction, distribution or any other use from the right(s)holder(s).
EFSA does not accept any responsibility, and shall not be held liable, for any violation of any pre-existing rights or other infringements related to information, documents and data made available on this website or in the websites linked to it.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work and any original materials produced and published by Open Development Mekong herein are licensed under a CC BY-SA 4.0. News article summaries are extracted from their sources, as guided by fair-use principles and are copyrighted by their respective sources. Materials on the Open Development Mekong (ODM) website and its accompanying database are compiled from publicly available documentation and provided without fee for general informational purposes only. This is neither a commercial research service nor a domain managed by any governmental or inter-governmental agency; it is managed as a private non-profit open data/open knowledge media group. Information is publicly posted only after a careful vetting and verification process. However, ODM cannot guarantee accuracy, completeness or reliability from third party sources in every instance. ODM makes no representation or warranty, either expressed or implied, in fact or in law, with respect to the accuracy, completeness or appropriateness of the data, materials or documents contained or referenced herein or provided. Site users are encouraged to do additional research in support of their activities and to share the results of that research with our team, contact us to further improve the site accuracy. By accessing this ODM website or database users agree to take full responsibility for reliance on any site information provided and to hold harmless and waive any and all liability against individuals or entities associated with its development, form and content for any loss, harm or damage suffered as a result of its use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work and any original materials produced and published by Open Development Mekong herein are licensed under a CC BY-SA 4.0. News article summaries are extracted from their sources, as guided by fair-use principles and are copyrighted by their respective sources. Materials on the Open Development Mekong (ODM) website and its accompanying database are compiled from publicly available documentation and provided without fee for general informational purposes only. This is neither a commercial research service nor a domain managed by any governmental or inter-governmental agency; it is managed as a private non-profit open data/open knowledge media group. Information is publicly posted only after a careful vetting and verification process. However, ODM cannot guarantee accuracy, completeness or reliability from third party sources in every instance. ODM makes no representation or warranty, either expressed or implied, in fact or in law, with respect to the accuracy, completeness or appropriateness of the data, materials or documents contained or referenced herein or provided. Site users are encouraged to do additional research in support of their activities and to share the results of that research with our team, contact us to further improve the site accuracy. By accessing this ODM website or database users agree to take full responsibility for reliance on any site information provided and to hold harmless and waive any and all liability against individuals or entities associated with its development, form and content for any loss, harm or damage suffered as a result of its use.
Facebook
TwitterProvides information on all items designated or under consideration for designation (i.e. calendared) by the New York City Landmarks Preservation Commission (LPC). This dataset contains information on all items designated or under consideration for designation (i.e. calendared) by the New York City Landmarks Preservation Commission (LPC). The dataset contains records for each individual, scenic, or interior landmark, as well as properties or sites located within the boundaries of historic districts. Please note that points in this dataset represent individual buildings in addition to non-building sites (such as vacant lots or monuments) regulated by LPC. It is possible for a single property to have multiple designations (such as individual and interior designations, or individual and historic district). For this reason, it is not uncommon to see multiple points on a single tax lot and multiple records for a single property within the database. Please pay close attention to the "MOST_CURRENT," "BBL_STATUS and "LAST_ACTION_ON_LP" fields, which together denote the designation status of a site/property (see Attribute Definitions for more information). The geographic locations of the points in this dataset are derived primarily from the Department of City Planning's PLUTO data in combination with the Department of Information Technology & Telecommunication's building footprint information. Because this dataset is not automatically updated when changes occur in the underlying dataset, BIN numbers and tax lot information are potentially out of date. Please pay close attention to the field descriptions present in the file's metadata to understand how to use this data set. Time values are auto-generated and do not reflect the official time of any LPC action, including designation or calendaring.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/38538/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/38538/terms
The Child Care and Development Fund (CCDF) provides federal money to states and territories to provide assistance to low-income families, to obtain quality child care so they can work, attend training, or receive education. Within the broad federal parameters, states and territories set the detailed policies. Those details determine whether a particular family will or will not be eligible for subsidies, how much the family will have to pay for the care, how families apply for and retain subsidies, the maximum amounts that child care providers will be reimbursed, and the administrative procedures that providers must follow. Thus, while CCDF is a single program from the perspective of federal law, it is in practice a different program in every state and territory. The CCDF Policies Database project is a comprehensive, up-to-date database of CCDF policy information that supports the needs of a variety of audiences through (1) analytic data files, (2) a project website and search tool, and (3) an annual report (Book of Tables). These resources are made available to researchers, administrators, and policymakers with the goal of addressing important questions concerning the effects of child care subsidy policies and practices on the children and families served. A description of the data files, project website and search tool, and Book of Tables is provided below: 1. Detailed, longitudinal analytic data files provide CCDF policy information for all 50 States, the District of Columbia, and the United States Territories and outlying areas that capture the policies actually in effect at a point in time, rather than proposals or legislation. They capture changes throughout each year, allowing users to access the policies in place at any point in time between October 2009 and the most recent data release. The data are organized into 32 categories with each category of variables separated into its own dataset. The categories span five general areas of policy including: Eligibility Requirements for Families and Children (Datasets 1-5) Family Application, Terms of Authorization, and Redetermination (Datasets 6-13) Family Payments (Datasets 14-18) Policies for Providers, Including Maximum Reimbursement Rates (Datasets 19-27) Overall Administrative and Quality Information Plans (Datasets 28-32) The information in the data files is based primarily on the documents that caseworkers use as they work with families and providers (often termed "caseworker manuals"). The caseworker manuals generally provide much more detailed information on eligibility, family payments, and provider-related policies than the CCDF Plans submitted by states and territories to the federal government. The caseworker manuals also provide ongoing detail for periods in between CCDF Plan dates. Each dataset contains a series of variables designed to capture the intricacies of the rules covered in the category. The variables include a mix of categorical, numeric, and text variables. Most variables have a corresponding notes field to capture additional details related to that particular variable. In addition, each category has an additional notes field to capture any information regarding the rules that is not already outlined in the category's variables. 2. The project website and search tool provide access to a point-and-click user interface. Users can select from the full set of public data to create custom tables. The website also provides access to the full range of reports and products released under the CCDF Policies Database project. The project website and search tool and the data files provide a more detailed set of information than what the Book of Tables provides, including a wider selection of variables and policies over time. 3. The annual Book of Tables provides key policy information for October 1 of each year. The report presents policy variations across the states and territories and is available on the project website. The Book of Tables summarizes a subset of the information available in the full database and data files, and includes information about eligibility requirements for families; application, redetermination, priority, and waiting list policies; family co-payments; and provider policies and reimbursement rates. In many cases, a variable in the Book of Tables will correspond to a single variable in the data files. Usuall
Facebook
TwitterMeasurement data of aboveground litterfall and littermass and litter carbon, nitrogen, and nutrient concentrations were extracted from 685 original literature sources and compiled into a comprehensive database to support the analysis of global patterns of carbon and nutrients in litterfall and litter pools. Data are included from sources dating from 1827 to 1997. The reported data include the literature reference, general site information (description, latitude, longitude, and elevation), site climate data (mean annual temperature and precipitation), site vegetation characteristics (management, stand age, ecosystem and vegetation-type codes), annual quantities of litterfall (by class, kg m-2 yr-1), litter pool mass (by class and litter layer, kg m-2), and concentrations of nitrogen (N), phosphorus (P), and base cations for the litterfall (g m-2 yr-1) and litter pool components (g m-2). The investigators intent was to compile a comprehensive data set of individual direct field measurements as reported by researchers. While the primary emphasis was on acquiring C data, measurements of N, P, and base cations were also obtained, although the database is sparse for elements other than C and N. Each of the 1,497 records in the database represents a measurement site. Replicate measurements were averaged according to conventions described in Section 5 and recorded for each site in the database. The sites were at 575 different locations.
Facebook
TwitterThis dataset contains in-situ soil moisture profile and soil temperature data collected at 30-minute intervals at SoilSCAPE (Soil moisture Sensing Controller and oPtimal Estimator) project sites since 2021 in the United States and New Zealand. The SoilSCAPE network has used wireless sensor technology to acquire high temporal resolution soil moisture and temperature data over varying durations since 2011. Since 2021, the SoilSCAPE has upgraded the two previously active sites in Arizona and added several new sites in the United States and New Zealand. These new sites typically use the METER Teros-12 soil moisture sensor. At its maximum, the new network consisted of 57 wireless sensor installations (nodes), with a range of 6 to 8 nodes per site. Each SoilSCAPE site contains multiple wireless end-devices (EDs). Each ED supports up to five soil moisture probes typically installed at 5, 10, 20, and 30 cm below the surface. Sites in Arizona have soil moisture probes installed at up to 75 cm below the surface. Soil conditions (e.g., hard soil or rocks) may have limited sensor placement. The data enables estimation of local-scale soil moisture at high temporal resolution and validation of remote sensing estimates of soil moisture at regional and national (e.g. NASA's Cyclone Global Navigation Satellite System - CYGNSS and Soil Moisture Active Passive - SMAP) scales. The data are provided in NetCDF format.
Facebook
TwitterThis CED spatial web service (ESRI ArcGIS Online Hosted Feature Layer) is a optimized quick visualization and unique value source for optimized CED web startup. The CED provides Sagebrush biome spatial representations and attribute information of conservation efforts entered into the Conservation Efforts Database (https://conservationefforts.org) by various data providers. This spatial web service is made of point and polygon layers and non-spatial tables. Feature records are group with their respective spatial feature type layers (point, polygon). The two spatial layers have identical attribute fields.Read only access to this data is ONLY available via an interactive web map on the Conservation Efforts Database website or authorized websites. Users who are interested in more access can directly contact the data providers by using the contact information available through the CED interactive map's pop-up/identify feature.The spatially explicit, web-based Conservation Efforts Database is capable of (1) allowing multiple-users to enter data from different locations, (2) uploading and storing documents, (3) linking conservation actions to one or more threats (one-to-many relationships), (4) reporting functions that would allow summaries of the conservation actions at multiple scales (e.g., management zones, populations, or priority areas for conservation), and (5) accounting for actions at multiple scales from small easements to statewide planning efforts.The sagebrush ecosystem is the largest ecosystem type in the continental U.S., providing habitat for more than 350 associated fish and wildlife species. In recognition of the need to conserve a healthy sagebrush ecosystem to provide for the long-term conservation of its inhabitants, the US Fish and Wildlife Service (Service) and United States Geological Survey (USGS) developed the Conservation Efforts Database version 2.0.0 (CED). The purpose of the CED is to efficiently capture the unprecedented level of conservation plans and actions being implemented throughout the sagebrush ecosystem and designed to capture actions not only for its most famous resident, the greater sage-grouse (Centrocercus urophasianus; hereafter, sage-grouse) but for the other species that rely on sagebrush habitats. Understanding the distribution and type of conservation actions happening across the landscape will allow visualization and quantification of the extent to which threats are being addressed.The purpose of this CED spatial web service (ESRI ArcGIS Online Hosted Feature Layer) is to provide CED data for authorized web sites or authorized users.
Facebook
TwitterFLUXNET is a global network of micrometeorological tower sites that use eddy covariance methods to measure the exchanges of carbon dioxide, water vapor, and energy between terrestrial ecosystems and the atmosphere. This dataset provides information from the ORNL DAAC-hosted FLUXNET site database which was discontinued in 2016. The files provided contain a list of investigators associated with each tower site, site locations and environmental data, and a bibliography of papers that used FLUXNET data. For more up to date information on FLUXNET sites, see http://fluxnet.fluxdata.org/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code:
Packet_Features_Generator.py & Features.py
To run this code:
pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j
-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j
Purpose:
Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.
Uses Features.py to calcualte the features.
startMachineLearning.sh & machineLearning.py
To run this code:
bash startMachineLearning.sh
This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags
Options (to be edited within this file):
--evaluate-only to test 5 fold cross validation accuracy
--test-scaling-normalization to test 6 different combinations of scalers and normalizers
Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use
--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'
Purpose:
Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.
Data
Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.
Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:
First number is a classification number to denote what website, query, or vr action is taking place.
The remaining numbers in each line denote:
The size of a packet,
and the direction it is traveling.
negative numbers denote incoming packets
positive numbers denote outgoing packets
Figure 4 Data
This data uses specific lines from the Virtual Reality.txt file.
The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.
The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.
The .xlsx and .csv file are identical
Each file includes (from right to left):
The origional packet data,
each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,
and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.
Facebook
TwitterOpenWeb Ninja’s Website Contacts Scraper API provides real-time access to B2B contact data directly from company websites and related public sources. The API delivers clean, structured results including B2B email data, phone number data, and social profile links, making it simple to enrich leads and build accurate company contact lists at scale.
What's included: - Emails & Phone Numbers: extract business emails and phone contacts from a website domain. - Social Profile Links: capture company accounts on LinkedIn, Facebook, Instagram, TikTok, Twitter/X, YouTube, GitHub, and Pinterest. - Domain Search: input a company website domain and get all available contact details. - Company Name Lookup: find a company’s website domain by name, then retrieve its contact data. - Comprehensive Coverage: scrape across all accessible website pages for maximum data capture.
Coverage & Scale: - 1,000+ emails and phone numbers per company website supported. - 8+ major social networks covered. - Real-time REST API for fast, reliable delivery.
Use cases: - B2B contact enrichment and CRM updates. - Targeted email marketing campaigns. - Sales prospecting and lead generation. - Digital ads audience targeting. - Marketing and sales intelligence.
With OpenWeb Ninja’s Website Contacts Scraper API, you get structured B2B email data, phone numbers, and social profiles straight from company websites - always delivered in real time via a fast and reliable API.
Facebook
TwitterMulti-institutional supported website and database that provides access to large number of globally used lipidomics resources. Internationally led the field of lipid curation, classification, and nomenclature since 2003. Produces new open-access databases, informatics tools and lipidomics-focused training activities will be generated and made publicly available for researchers studying lipids in health and disease.
Facebook
TwitterA database and storage service resource which allows users to create, view, share, and download information from companion websites. RunMyCode allows users to create companion websites for their scientific publications. Users can share and download computer code and data from companion websites made with RunMyCode. Any software and data format is compatible with RunMyCode.
Facebook
TwitterPrivately owned public spaces, also known by the acronym POPS, are outdoor and indoor spaces provided for public enjoyment by private owners in exchange for bonus floor area or waivers, an incentive first introduced into New York City's zoning regulations in 1961. To find out more about POPS, visit the Department of City Planning's website at http://nyc.gov/pops. This database contains detailed information about each privately owned public space in New York City.
Data Source: Privately Owned Public Space Database (2018), owned and maintained by the New York City Department of City Planning and created in collaboration with Jerold S. Kayden and The Municipal Art Society of New York. All previously released versions of this data are available on the DCP Website: BYTES of the BIG APPLE. Current version: 25v2
Facebook
TwitterPredictLeads Job Openings Data provides high-quality hiring insights sourced directly from company websites - not job boards. Using advanced web scraping technology, our dataset offers real-time access to job trends, salaries, and skills demand, making it a valuable resource for B2B sales, recruiting, investment analysis, and competitive intelligence.
Key Features:
✅232M+ Job Postings Tracked – Data sourced from 92 Million company websites worldwide. ✅7,1M+ Active Job Openings – Updated in real-time to reflect hiring demand. ✅Salary & Compensation Insights – Extract salary ranges, contract types, and job seniority levels. ✅Technology & Skill Tracking – Identify emerging tech trends and industry demands. ✅Company Data Enrichment – Link job postings to employer domains, firmographics, and growth signals. ✅Web Scraping Precision – Directly sourced from employer websites for unmatched accuracy.
Primary Attributes:
Job Metadata:
Salary Data (salary_data)
Occupational Data (onet_data) (object, nullable)
Additional Attributes:
📌 Trusted by enterprises, recruiters, and investors for high-precision job market insights.
PredictLeads Dataset: https://docs.predictleads.com/v3/guide/job_openings_dataset
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Multi-Agency Ground Plot (MAGPlot) database (DB) is a pan-Canadian forest ground-plot data repository. The database synthesize forest ground plot data from various agencies, including the National Forest Inventory (NFI) and 12 Canadian jurisdictions: Alberta (AB), British Columbia (BC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Northwest Territories (NT), Ontario (ON), Prince Edward Island (PE), Quebec (QC), Saskatchewan (SK), and Yukon Territory (YT), contributed in their original format. These datasets underwent data cleaning and quality assessment using the set of rules and standards set by the contributors and associated documentations, and were standardized, harmonized, and integrated into a single, centralized, and analysis-ready database. The primary objective of the MAGPlot project is to collate and harmonize forest ground plot data and to present the data in a findable, accessible, interoperable, and reusable (FAIR) format for pan-Canadian forest research. The current version includes both historical and contemporary forest ground plot data provided by data contributors. The standardized and harmonized dataset includes eight data tables (five site related and three tree measurement tables) in a relational database schema. Site-related tables contain information on geographical locations, treatments (e.g. stand tending, regeneration, and cutting), and disturbances caused by abiotic factors (e.g., weather, wildfires) or biotic factors (e.g., disease, insects, animals). Tree-related tables, on the other hand, focus on measured tree attributes, including biophysical and growth parameters (e.g., DBH, height, crown class), species, status, stem conditions (e.g., broken or dead tops), and health conditions. While most contributors provided large and small tree plot measurements, only NFI, AB, MB, and SK contributed datasets reported at regeneration plot level (e.g., stem count, regeneration species). Future versions are expected to include updated and/or new measurement records as well as additional tables and measured and compiled (e.g., tree volume and biomass) attributes. MAGPlot is hosted through Canada’s National Forest Information System (https://nfi.nfis.org/en/maps). --------------------------------------------------- LATEST SITE TREATMENTS LAYER: --------------------------------------------------- Shows the most recently applied treatment class for each MAGPlot site. These treatment classes are broad categories, with more specific treatment details available in the full dataset. ----------- NOTES: ----------- The MAGPlot release (v1.0 and v1.1) does not include NL and SK datasets due to pending Data Sharing Agreements, ongoing data processing, or restrictions on third-party sharing. These datasets will be included in future releases. While certain jurisdictions permit open or public data sharing, given that requestor signs and adheres the Data Use agreement, there are some jurisdictions that require a jurisdiction-specific request form to be signed in addition to the Data Use Agreement form. For the MAGPlot Data Dictionary, other metadata, datasets available for open sharing (with approximate locations), data requests (for other datasets or exact coordinates), and available data visualization products, please check all the folders in the “Data and Resources” section below. Coordinates in web services have been randomized within 5km of true location to preserve site integrity Access the WMS (Web Map Service) layers from the “Data and Resources” section below. A data request must be submitted to access historical datasets, datasets restricted by data-use agreements, or exact plot coordinates using the link below. NFI Data Request Form: https://nfi.nfis.org/en/datarequestform --------------------------------- ACKNOWLEDGEMENT: --------------------------------- We acknowledge and recognize the following agencies that have contributed data to the MAGPlot database: Government of Alberta - Ministry of Agriculture, Forestry, and Rural Economic Development - Forest Stewardship and Trade Branch Government of British Columbia - Ministry of Forests - Forest Analysis and Inventory Branch Government of Manitoba - Ministry of Economic, Development, Investment, Trade, and Natural Resources - Forestry and Peatlands Branch Government of New Brunswick - Ministry of Natural Resources and Energy Development - Forestry Division, Forest Planning and Stewardship Branch Government of Newfoundland & Labrador - Department of Fisheries, Forestry and Agriculture - Forestry Branch Government of Nova Scotia - Ministry of Natural Resources and Renewables - Department of Natural Resources and Renewables Government of Northwest Territories - Department of Environment & Climate Change - Forest Management Division Government of Ontario - Ministry of Natural Resources and Forestry - Science and Research Branch, Forest Resources Inventory Unit Government of Prince Edward Island - Department of Environment, Energy, and Climate Action - Forests, Fish, and Wildlife Division Government of Quebec - Ministry of Natural Resources and Forests - Forestry Sector Government of Saskatchewan - Ministry of Environment - Forest Service Branch Government of Yukon - Ministry of Energy, Mines, and Resources - Forest Management Branch Government of Canada - Natural Resources Canada - Canadian Forest Service - National Forest Inventory Projects Office
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
Facebook
Twitterhttps://www.nist.gov/open/copyright-fair-use-and-licensing-statements-srd-data-software-and-technical-series-publications#SRDhttps://www.nist.gov/open/copyright-fair-use-and-licensing-statements-srd-data-software-and-technical-series-publications#SRD
The NIST Chemistry WebBook provides users with easy access to chemical and physical property data for chemical species through the internet. The data provided in the site are from collections maintained by the NIST Standard Reference Data Program and outside contributors. Data in the WebBook system are organized by chemical species. The WebBook system allows users to search for chemical species by various means. Once the desired species has been identified, the system will display data for the species. Data include thermochemical properties of species and reactions, thermophysical properties of species, and optical, electronic and mass spectra.
Facebook
TwitterUnited States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt