Facebook
TwitterPublic Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
USPTO patent application no. 09407650 in the United States Patent and Trademark Office
Facebook
TwitterBelow is an explanation of the data along with some features that are available on this map (description is also provided in the "Getting Started" widget of the application).A variety of different colored circles appear throughout the map. They represent sites that are associated with the following programs:1) Department of Toxic Substances Control (DTSC) sites:a) Historical Inactive - Identifies sites from an older database that are non-active sites where, through a Preliminary Endangerment Assessment (PEA) or other evaluation, DTSC has determined that a removal or remedial action or further extensive investigation is required.b) School Cleanup - Identifies proposed and existing school sites that are being evaluated by DTSC for possible hazardous materials contamination. School sites are further defined as “Cleanup”, where remedial actions are or have occurred.c) School Evaluation - Identifies proposed and existing school sites that are being evaluated by DTSC for possible hazardous materials contamination. School sites are further defined as “Evaluation”, where further investigation is needed.d) Corrective Action - Investigation or cleanup activities at Resource Conservation and Recovery Act (RCRA) or state-only hazardous waste facilities (that were required to obtain a permit or have received a hazardous waste facility permit from DTSC or U.S. EPA).e) State Response - Identifies confirmed release sites where DTSC is involved in remediation, either in a lead or oversight capacity. These confirmed release sites are generally high-priority and high potential risk.f) Evaluation - Identifies suspected, but unconfirmed, contaminated sites that need or have gone through a limited investigation and assessment process.g) Tiered Permit - A corrective action cleanup project on a hazardous waste facility that either was eligible to treat or permitted to treat waste under the Tiered Permitting system.2) State Water Board or DTSC sites:a) Leaking Underground Storage Tank (LUST) Cleanup - Includes all Underground Storage Tank (UST) sites that have had an unauthorized release (i.e. leak or spill) of a hazardous substance, usually fuel hydrocarbons, and are being (or have been) cleaned up. These sites are regulated under the State Water Board's UST Cleanup Program and/or similar programs conducted by each of the nine Regional Water Boards or Local Oversight Programs.b) Cleanup Program - Includes all "non-federally owned" sites that are regulated under the State Water Board's Site Cleanup Program and/or similar programs conducted by each of the nine Regional Water Boards. Cleanup Program Sites are also commonly referred to as "Site Cleanup Program sites".c) Voluntary Cleanup - Identifies sites with either confirmed or unconfirmed releases, and the project proponents have requested that the State Water Board or DTSC oversee evaluation, investigation, and/or cleanup activities and have agreed to provide coverage for the lead agency’s costs.3) Othera) Permitted Tanks - The "Permitted Tanks" data set includes Facilities that are associated with permitted underground storage tanks from the California Environmental Reporting System (CERS) database. The CERS data consists of current and recently closed permitted underground storage tank (UST) facilities information provided to CERS by Certified Unified Program Agencies (CUPAs).*Note: Underground Storage Tank Cleanup and Cleanup Program project records are pulled from the State Water Board's GeoTracker database. The Permitted Tanks information was obtained from California EPA’s California Environmental Reporting System (CERS) database. All other project records were obtained from DTSC's EnviroStor database. Program descriptions come from DTSC’s EnviroStor Glossary of Terms and the State Water Board’s GeoTracker Site/Facility Type Definitions. The information associated with these records was last updated in the application on 4/24/2023.
Facebook
TwitterTHIS DATA ASSET NO LONGER ACTIVE: This is metadata documentation for the National Priorities List (NPL) Publication Assistance Databsae (PAD), a Lotus Notes application that holds Region 7's universe of NPL site information such as site description, threats and contaminants, cleanup approach, environmental process, community involvement, site repository, and regional contacts. This database used to be updated annually, at different times for different NPLs, but it is currently no longer being used. This work fell under objectives for EPA's 2003-2008 Strategic Plan (Goal 3) for Land Preservation & Restoration, which are to clean up and reuse contaminated land.
Facebook
Twitter*The data for this dataset is updated daily. The date(s) displayed in the details section on our Open Data Portal is based on the last date the metadata was updated and not the refresh date of the data itself.*The Cleanup Sites layer provides locations and document links for sites currently in the cleanup process and sites awaiting cleanup funding. Cleanup programs include: Brownfields, Petroleum, EPA Superfund (CERCLA), Drycleaning, Responsible Party Cleanup, State Funded Cleanup, State Owned Lands Cleanup and Hazardous Waste Cleanup.Please reference the metadata for contact information.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas
This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc
Facebook
TwitterCLEAR has public record information and is also used for law enforcement and investigations, including personal identification and financial records, police reports, and credential verification services.
Facebook
TwitterData to create the List of Contaminated or Potentially Contaminated Sites - Remediation Division is from historical program information or from new program applications and filings. More information regarding the generation of this list can be found at: https://portal.ct.gov/DEEP/Remediation--Site-Clean-Up/List-of-Contaminated-or-Potentially-Contaminated-Sites-in-Connecticut A seperate dataset is published for: List of Contaminated Sites or Potentially Contaminated - SASU Case Management System and provide a list of Leaking Underground Storage Tank Sites. The two database systems are maintained by different Divisions within the agency. There may be sites in both databases due to an overlap in responsibilities of the two Divisions. https://data.ct.gov/Environment-and-Natural-Resources/List-of-Contaminated-or-Potentially-Contaminated-S/77ya-7twa The data is updated when documents are received for responsible parties conducting site remediation. For more information regarding the individual remedial programs visit: https://portal.ct.gov/DEEP/Remediation--Site-Clean-Up/Remediation-Site-Clean-Up Those seeking additional information about information contained in this dataset may use the DEEP FOIA Process: https://portal.ct.gov/DEEP/About/FOIA-Requests Each Row represents a Remediation project (Property Transfer, Brownfield, Enforcement, Federal Remediation, State Remediation, Landfill Monitoring, RCRA Corrective Action, and Voluntary). Data to compile the list was gathered for each site from information provided to DEEP for requirements within each program. Sites may be in multiple Remediation programs and therefore may be listed more than once. Some sites have been fully cleaned up while others have limited information about the environmental conditions. The list includes only sites that been reported to DEEP or EPA. Additional information for site within the Hazard Notification program can be found at: https://portal.ct.gov/DEEP/Remediation--Site-Clean-Up/Significant-Environmental-Hazard-Program/List-of-Significant-Environmental-Hazards Significant Environmental Hazard Sites GIS Map: https://experience.arcgis.com/experience/9c100aa21fbe4ee180df9942d000f676 Details on columns which reference ELUR: Environmental Land Use Restriction (ELUR) or Notice and Use Limitation (NAUL) are used to minimize the risk of human exposure to pollutants and hazards to the environment by preventing specific uses or activities at a property or a portion of a property. Link to GIS map of ELUR and restriction type: https://ctdeep.maps.arcgis.com/apps/webappviewer/index.html?id=d37eccb2a5c3491d8f0d389a96d9a912 There may be errors in the data although we strive to minimize them. Examples of errors may include: misspelled or incomplete addresses and/or missing data.
Facebook
TwitterThe Alaska Geochemical Database Version 4.0 (AGDB4) contains geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving efficiency of use. The relational database includes historical geochemical data archived in the USGS National Geochemical Database (NGDB), the Atomic Energy Commission National Uranium Resource Evaluation (NURE) Hydrogeochemical and Stream Sediment Reconnaissance databases, and the Alaska Division of Geological and Geophysical Surveys (DGGS) Geochemistry database. Data from the U.S. Bureau of Mines and the U.S. Bureau of Land Management are included as well. The data tables describe historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 120 laboratory and field analytical methods performed on 416,333 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. The samples were collected as part of various agency programs and projects from 1938 through 2021. Most samples were collected by agency personnel and analyzed in agency laboratories or under contracts in commercial analytical laboratories. Mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are also included in this database. The data in the AGDB4 supersede data in the AGDB, AGDB2, and AGDB3 databases but the background about the data in these earlier versions is needed to understand what has been done to amend, clean up, correct, and format these data. Data that were not included in previous versions because they predate the earliest agency geochemical databases or were excluded for programmatic reasons are included here in the AGDB4. The AGDB4 data are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. They are provided as a Microsoft Access database, as comma-separated values (CSV), and as an Esri geodatabase consisting of point feature classes and related tables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BOLD CO1 databases reformatted to use in NanoClass (https://github.com/ejongepier/NanoClass; version 0.3.0-beta or higher) and QIIME2. Three separate databases are included for use in combination with primers mtD, LCO-HCO and CI. Databases include reference sequences and reference taxonomies for the use in NanoClass, as well as pre-trained classifiers for use in QIIME2. See usage instructions below.
For questions, please contact e.jongepier@uva.nl.
==========================================
Please note this version of a custom BOLD CO1 db comes with absolutely no warranties.
When using this db in NanoClass, mind that it has only been tested with methods: ["megablast","minimap","spingo"] NanoClass cannot be run in combination with these BOLD CO1 databases using methods ["mothur","centrifuge","kraken"]. Compatibility with ["blast","dcmegablast","qiime","rdp"] is untested. Just remove the tools you want to skip from the NanoClass/config.yaml (see also the NanoClass documentation here: https://ejongepier.github.io/NanoClass/)
Never use this data base in combination with the NanoClass snakemake -F parameter or this BOLD CO1 database will be overwriten by the default 16S SILVA database.
==========================================
BOLD CO1 database (last) downloaded on 20210420 and reformatted for use in QIIME2 and NanoClass. To clean-up BOLD CO1 db these steps were taken (step 7 to 11 were repeated for each of the 3 primers): - remove identical duplicates [3597874] - drop seqs with non-IUPAC characters [3597839] - remove leading and trailing ambiguous bases [3597839] - remove low quality reads - remove reads with homopolymer runs - filter by length - extract fragments between primer sequences [mtD:112450; CI:121391; LCO-HCO:65307] - dereplicate / cluster [mtD:55075; CI:46470; LCO-HCO:24835] - remove uninformative taxonomic labels [mtD:55073; CI:46466; LCO-HCO:24832] - reformat db for use in NanoClass - train classifier based on fragments
==========================================
Use in NanoClass:
Unzip the database and copy the reference taxonomy and (unzipped) reference sequences to the NanoClass/db/common directory, like so:
$ cp mtD/bold-v20210421-taxonomy-mtD.tsv /path/to/NanoClass/db/common/ref-taxonomy.txt $ gzip -d -c mtD/bold-v20210421-frags-mtD.fa.gz > /path/to/NanoClass/db/common/ref-seqs.fna
Something similar can be done for the other two primers (CI or LCO-HCO). Only these three primers are supported at this point.
Next, create an (empty) ref-seqs.aln file just to prevent NanoClass from automatically downloading the default 16S SILVA database, which would overwrite the BOLD db you just copied into NanoClass/db/common.
$ touch /path/to/NanoClass/db/common/ref-seqs.aln
Finally, you need to make a change to the NanoClass/Snakefile (i.e change first line into the second).
optrules.extend(["plots/precision.pdf"] if len(config["methods"]) > 2 else []) optrules.extend(["plots/precision.pdf"] if len(config["methods"]) > 200 else [])
This will disable the computation of precision plots by NanoClass as this is not supported in combination with the custom BOLD CO1 databases.
Also mind that you need to change the nanofilt minlen and maxlen in the NanoClass/config.yaml to capture the appropriate fragment length for your primer. For the mtD primer I used minlen 600 and maxlen 900 for testing.
Use in QIIME2:
You can use the trained classifier directly in QIIME2, like so:
$ qiime feature-classifier classify-sklearn
--i-classifier mtD/bold-v20210421-classifier-mtD.qza
--i-reads .qza
--o-classification .qza
--verbose
Something similar can be done for the other two primers (CI or LCO-HCO). Only these three primers are supported at this point. The classifiers have only been tested with with the sklearn algorithm.
Facebook
TwitterThe Michigan Department of Environment, Great Lakes, and Energy's (EGLE) Environmental Remediation Program manages and reduces risk at sites of environmental contamination. This is achieved through activities such as site evaluation, feasibility studies, operation and maintenance of systems, implementing land use and resource use restrictions, and monitoring. This data layer shows facilities that have been identified and mapped under Part 201, Environmental Remediation, of the Natural Resources and Environmental Protection Act, 1994 PA 451, as amended (NREPA) those areas, places, or parcels of property, or portion of a parcel of property where a hazardous substance in excess of the concentrations that satisfy the cleanup criteria for unrestricted residential use has been released, deposited, disposed of, or otherwise comes to be located. This data layer does not include all of the facilities that are subject to regulation under Part 201 because owners are not required to inform EGLE about the facilities and can pursue cleanup independently. Facilities that are not known to EGLE are not on the Inventory, nor are locations with releases that resulted in low environmental impact. This data is regularly updated. Field NameAliasDescriptionOBJECTIDN/AN/ASITENAME Site NameName for the location assigned by RRDADDRESS Address Street address for the site CITY City City associated with the street address ZIPCODE Zip Code Zip code the of the site COUNTY County County where the site is located LATITUDE Latitude Latitude (Y-Coordinate) of the siteLONGITUDE Longitude Longitude (X-Coordinate) of the siteSITEIDSite IDUnique identifier for the site within RRD’s RIDE database which connects to the Environmental MapperBusinessTypeBusiness TypeGeneral classification of the type of business that is/was associated with the Part 201 site.HorizontalReferenceDatumHorizontal Reference DatumHorizontal Reference Datum HorizontalCollectionMethodHorizontal reference Method of CollectionDescribes the method used for identifying the siteHorizontalAccuracyHorizontal Accuracy (m)An estimated measure of the horizontal accuracy of the point in meters.ReferencePointReference PointProvides a description of the relationship between the point feature and the overall siteSourceMapScale Source Map Scale The representative fraction or scale at which the point feature was mapped RiskCondition Risk ConditionRisk condition classification applied to the site by EGLE's Remediation and Redevelopment Division, which is used by the division to identify sites that are a priority to address, to manage workloads, and to report metrics on the overall facility status consistently across programs.ContaminantsContaminantsChemical classification identified on the siteHasBeaOrNomHasBeaOrNomIndicates whether EGLE has knowledge of a baseline environmental assessment or a notice of migration for the site.ProjectManagerProject ManagerThe RRD staff person assigned to manage that locationLastUpdatedLast UpdatedThe date the point was updated ShapeN/AN/A For more information about this data, please contact Matt Warner at WarnerM1@Michigan.gov.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The European Union Visitor Visa Database contains statistics on short-stay visa issuing practices. It is based on official administrative data, cleaned-up to provide a consistent time-series from 2005 to 2022.
Background As a non-immigrant visa, a visitor visa is typically valid for a visit of up to 3 months and grants access to the entire Schengen area (for the reporting states that are full members of Schengen). Short-stay visas are an important component of border control practices, providing a mechanism for screening visitors before they arrive at the physical borders.
The statistics are reported in the original administrative data on a per consulate level. A reporting state will typically be a member of the European Union (EU) and the Schengen free travel area. The dataset does include, however, statistics reported by states in the process of becoming Schengen members. It also includes data from European countries that are a part of the Schengen but not members of the EU, such as Norway.
The dataset includes a column with the refusal rate calculated as the share of visas not issued as a total of the number of visas issued and not issued. Note that visas issued includes both explicit refusals as well as lapsed or otherwise discontinued applications.
The dataset made available here includes application statistics. It should be noted that these cover only one albeit important component of the common visa policy. Other major elements include the common list of countries subject to a visa requirement in the first place as well as consular cooperation on visa issuing.
The code for creating the dataset, as well as further details on sources, can be found in the Github repository.
Use cases The dataset can be used to probe questions on the state and evolution of EU cooperation in the area of borders and migration control. Problems that can be investigated with the data include for example: - Patterns of liberal and restrictive border practices and their determinants. - The degree of harmonization, convergent and divergent visa practices, between EU states
Acknowledgements As detailed in the repository, the raw data is processed (cleaned-up) as evidenced in the source code. The data for 2005-2012 have been imported relying on earlier data clean-up done in connection with the construction of the European Visa Database (see background section). The country classification (income group and regions) are sourced from The World Bank: World Bank Country and Lending Groups: Country classification dataset
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Scribe Database Collection includes 14 databases containing data from the Deepwater Horizon (DWH) Oil Spill Event Response Phase. These databases are the work of federal agencies, state environmental management agencies and BP and its contractors. The types of information include locations, descriptions, and analysis of water, sediment, oil, tar, dispersant, air and other environmental samples. The versions of the databases included in this collection are the result of the second phase of a clean-up effort by the database owners and contributors to resolve inconsistencies in the initial databases and to harmonize content across the databases in order for these data to be comparable for reliable evaluation and reporting. This effort was initiated in order to meet requirements supporting the Unified Area Command.
Facebook
TwitterThis downloadable data package consists of location and facility identification information from EPA's Facility Registry Service (FRS) for all sites that are available in the FRS individual feature layers. The layers comprise the FRS major program databases, including: Assessment Cleanup and Redevelopment Exchange System (ACRES) : brownfields sites ; Air Facility System (AFS) : stationary sources of air pollution ; ICIS-AIR (AIR) : stationary sources of air pollution; Bureau of Indian Affairs (BIA) : schools data on Indian land; Base Realignment and Closure (BRAC) facilities; Clean Air Markets Division Business System (CAMDBS) : market-based air pollution control programs; Comprehensive Environmental Response, Superfund Enterprise Management System (SEMS): hazardous waste sites; Integrated Compliance Information System (ICIS) : integrated enforcement and compliance information; National Compliance Database (NCDB) : Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA) and the Toxic Substances Control Act (TSCA); National Pollutant Discharge Elimination System (NPDES) module of ICIS : NPDES surface water permits; Radiation Information Database (RADINFO) : radiation and radioactivity facilities; RACT/BACT/LAER Clearinghouse (RBLC) : best available air pollution technology requirements; Resource Conservation and Recovery Act Information System (RCRAInfo) : tracks generators, transporters, treaters, storers, and disposers of hazardous waste; Toxic Release Inventory (TRI) : certain industries that use, manufacture, treat, or transport more than 650 toxic chemicals; Emission Inventory System (EIS) : inventory of large stationary sources and voluntarily-reported smaller sources of air point pollution emitters; countermeasure (SPCC) and facility response plan (FRP) subject facilities; Electronic Greenhouse Gas Reporting Tool (E-GGRT) : large greenhouse gas emitters; Emissions and; Generation Resource Integrated Database (EGRID) : power plants. The Facility Registry Service (FRS) identifies and geospatially locates facilities, sites or places subject to environmental regulations or of environmental interest. Using vigorous verification and data management procedures, FRS integrates facility data from EPA's national program systems, other federal agencies, and State and tribal master facility records and provides EPA with a centrally managed, single source of comprehensive and authoritative information on facilities. This data set contains the FRS facilities that link to the programs listed above once the program data has been integrated into the FRS database. Additional information on FRS is available at the EPA website https://www.epa.gov/enviro/facility-registry-service-frs. Included in this package are a file geodatabase, Esri ArcMap map document and an XML file of this metadata record. Full FGDC metadata records for each layer are contained in the database.
Facebook
TwitterDownloadable Information on Waste Sites and Spills
Facebook
TwitterEXXON Valdez Oil Spill (EVOS) data were generated by the Nation Marine Fishery Service (NMFS). The EVOS area includes Prince William Sound and adjacent coastal areas. The data were put on a CD-ROM with EVOS Geographic Information Systems (GIS) database, data dictionary, and bibliography. Data are related to oil spill clean up, damage assessments, and restoration efforts. Data sets include physical features, biological features, cultural features, land status, boundaries, place names, human use, shoreline oiling, surface oiling, hydrocarbon analysis, EVOS research areas, and miscellaneous.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27577389%2Fdf18dcf571903a5ced883815d3ef72c1%2F1.1.jpg?generation=1773656347919651&alt=media" alt="">
You run a targeted marketing campaign with a seemingly clean list and sharp messaging. Yet, the results are disappointing. Low open rates, hard bounces, and a few replies that go nowhere. So, you and your team try to find out what went wrong. You pull the list and start checking contacts manually. What do you find? The people you were targeting are no longer working in the company. Many job titles no longer exist. Phone numbers that ring to the wrong departments. Email addresses that were never valid.
This is not a targeting problem or a messaging problem. It is a data quality problem. And it is far more common than most B2B teams are willing to admit. This article explains why data quality is a critical obstacle in CRM systems and clearly argues that data enrichment services provide a targeted solution to the problem of bad data.
Bad data rarely announces itself. And in the world of B2B, most teams find it out after the damage is done. They discover they have a data problem after their efforts fail. It quietly eats your budget, lowers your deliverability scores, and makes your pipeline projections look worse than they should. According to Gartner research, poor data quality costs organizations an average of $12.9 million per year. That number sounds abstract until you map it to a real pipeline. Imagine this. Your team spends on outbound tools, SDR time, and campaign execution. Now, a meaningful chunk of that spend is going toward contacts who cannot be reached. The math gets worse when you account for deliverability. When your emails hard bounce at scale, inbox providers flag your domain. Your sender reputation drops. Even your valid, accurate contacts stop seeing your emails. One bad list can poison months of outbound effort. And there is a subtler cost that rarely gets discussed. When sales reps spend time calling wrong numbers or researching contacts who have moved on, they are not selling. That time cost adds up fast across a team of ten or twenty people.
There is an uncomfortable truth about CRM data that most B2B marketers overlook. It starts decaying the moment it enters the system. People change jobs. They get promoted. Companies get acquired. Departments get restructured. A HubSpot research estimates that B2B data decays at a rate of about 22.5% per year. That means roughly one in five contacts in your CRM becomes inaccurate within twelve months of entry. Think about this. Your database has 50,000 contacts, and you last cleaned it eighteen months ago. Now do the math. You are potentially working with 15,000 to 20,000 contacts that are partially or fully wrong. Would you say that is a fringe problem? No! It is a structural issue that is costing your marketing and sales efforts. So, what are the sources of the data decay across most B2B organizations? Here are a few that you should be aware of: Job changes and promotions that update titles and email formats Company rebranding or domain changes that break email addresses Mergers and acquisitions that restructure buying committees Role eliminations that remove decision-makers entirely Manual data entry errors that slip in from the start None of these are exotic. They happen constantly. And without a systematic process to catch them, your CRM drifts further from reality with every passing quarter.
Data enrichment is not data cleansing, though the two often get confused. Cleansing removes what is wrong. Enrichment adds what is missing and updates what has changed. The distinction matters because a clean record is not the same as a complete or current one. Data cleansing and data enrichment are, in fact, two parts of the same process. First, you clean the data, and then you enrich it. A typical data enrichment services process works like this. You provide your existing contact or account records. The data enrichment provider runs them against verified, regularly updated data sources. What comes back is a record that has been checked for accuracy, filled in with missing fields, and updated to reflect current reality. In practical terms, this means: A contact who changed companies now has their current employer, title, and email A record missing a direct dial now has one appended from a verified source An account with outdated firmographics now reflects the current headcount and revenue range A contact with an invalid email has been flagged o...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.
The first part of the code samples was taken from a private version of this notebook.
Here is the statistics about classes of programming languages from Github Code Snippets database
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">
From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.
Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.
The resulted file is dataset-10000.csv - included to the data card
The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">
To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card
The statistics for rare languages code snippets is as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">
For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv
To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv
After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv
The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.
The final distribution of classes turned out to be the next one
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">
To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Qiime2 formatted NCBI ITS database (fasta + taxonomy) for analysis of fungi ITS amplicon sequencing. All sequences that have not been identified at least to Phylum level were removed.Data download: search -db nuccore -query ""(internal transcribed spacer 1"[All Fields] AND "fungi"[Filter] AND (250[SLEN] : 10000[SLEN])) NOT "uncultured Neocallimastigales"[porgn] NOT "bacteria"[Filter] NOT "uncultured fungus"[Filter] NOT "Uncultured fungus"[Filter] NOT "fungal sp."[Filter]" | efetch -format fasta -mode text > ./NCBI_ITS1_DB_raw.fastaData processing (https://github.com/gzahn/tools/blob/master/make_qiime_database_from_fasta.sh)### Search for and remove any empty sequences ###gawk 'BEGIN {RS = ">" ; FS = " " ; ORS = ""} {if ($2) print ">"$0}' NCBI_ITS1_DB_raw.fasta > NCBI_ITS1_DB_raw.fasta.tidy# Obtain NCBI taxonomy lineages for your input fastapython2 /home/bioinf/bin/entrez_qiime.py -i NCBI_ITS1_DB_raw.fasta.tidy -o NCBI_Taxonomy.txt -r kingdom,phylum,class,order,family,genus,species -a /media/bioinf/Data/NCBI_tax2021/nucl_gb.accession2taxid -n /media/bioinf/Data/NCBI_tax2021### Validate and Tidy up files ###### Edit output file to include rank IDs (QIIME needs them for some scripts)cat NCBI_Taxonomy.txt | sed 's/\t/\tk_/' | sed 's/;/>p_/' | sed 's/;/>c_/' | sed 's/;/>o_/' | sed 's/;/>f_/' | sed 's/;/>g_/' | sed 's/;/>s_/' | sed 's/>/;/g' > NCBI_QIIME_Taxonomy.txt### Edit database to single-line fasta formatawk '/^>/ {printf(" %s ",$0);next; } { printf("%s",$0);} END {printf(" ");}' < NCBI_ITS1_DB_raw.fasta.tidy > NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta### Remove first blank linesed -i '/^$/d' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta### Remove trailing descriptions after Accession No.sed -i 's/ .*//' NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta### compare read counts in fasta and txt filesgrep -c "^>" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fastawc -l NCBI_QIIME_Taxonomy.txt#if numbers are different, there are duplicates introduced by entrez_qiime.py### if some duplicates may appear in fasta file (i.e., more reads than taxonomy IDs), get lists of Seq/Taxonomy IDs and remove duplicates from fasta filecut -f 1 NCBI_QIIME_Taxonomy.txt > Tax_Namesgrep "^>" NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | cut -d " " -f 1 | sed 's/>//g' > DB_Namessort DB_Names | uniq -d > Duplicated_IDsgrep -A1 -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta | sed '/^--/d' > Duplicated_fastasfor fn in Duplicated_fastas; do count=$(wc -l add_back; donegrep -v -f Duplicated_IDs NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta > tidy.no_reps.fastacat tidy.no_reps.fasta add_back > DB_raw.fasta### Sort fasta database to same order as taxonomy mapecho "Sorting Database...This will take some time."cut -f 1 NCBI_QIIME_Taxonomy.txt > IDs_in_order.txtwhile read ID ; do grep -m 1 -A 1 "^>$ID" DB_raw.fasta ; done < IDs_in_order.txt > DB.fasta #This will take quite a long time to runmv NCBI_QIIME_Taxonomy.txt Taxonomy.txtrm DB_Names DB_raw.fasta Duplicated_fastas Duplicated_IDs IDs_in_order.txt NCBI_Taxonomy.txt Tax_Names tidy.no_reps.fasta NCBI_ITS1_DB_raw.fasta.tidy.oneline.fasta NCBI_ITS1_DB_raw.fasta.tidy add_backcat NCBI_ITS1_DB_raw.fasta.loggrep "^>" DB.fasta | sed 's/>//' >good_acc_listecho "Cleaning Taxonomy to match Database...This may take some time."while read ID ; do grep -m 1 $ID Taxonomy.txt ; done < good_acc_list > Taxonomy_ordered.txt#mv $4/Taxonomy_ordered.txt $4/Taxonomy.txt#rm $4/good_acc_listgrep "k_NA;p_NA;c_NA;o_NA;f_NA;g_NA;s_NA|^:" Taxonomy_ordered.txt | cut -f1 > bad_acc_listsed -e '/k_NA;p_NA;c_NA;o_NA;f_NA;g_NA;s_NA/d' Taxonomy_ordered.txt > Taxonomy_clean1.txtsed -e '/^:/d' Taxonomy_clean1.txt > Taxonomy.txtecho "Final cleanup to remove bad accessions..."while read bad; do echo "Removing $bad" ; sed -i -e "/$bad/,+1d" DB.fasta ; done < bad_acc_listsed -i -e '/^>:/,+1d' DB.fastagrep "^>" DB.fasta | sed 's/>//' > DB_IDs_orderedwhile read ID; do grep $ID Taxonomy_ordered.txt ; done < DB_IDs_ordered > Taxonomy_final.txtrm Taxonomy_clean1.txt Taxonomy_ordered.txtmv bad_acc_list bad_acc_list.txtecho -e "Process complete. Final database is DB_ordered.fasta, and associated taxonomy is Taxonomy_ordered.txt Accessions that were removed are in bad_acc_list.txt"
Facebook
TwitterThe Alaska Geochemical Database Version 4.0 (AGDB4) contains geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving efficiency of use. The relational database includes historical geochemical data archived in the USGS National Geochemical Database (NGDB), the Atomic Energy Commission National Uranium Resource Evaluation (NURE) Hydrogeochemical and Stream Sediment Reconnaissance databases, and the Alaska Division of Geological and Geophysical Surveys (DGGS) Geochemistry database. Data from the U.S. Bureau of Mines and the U.S. Bureau of Land Management are included as well. The data tables describe historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 120 laboratory and field analytical methods performed on 416,333 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. The samples were collected as part of various agency programs and projects from 1938 through 2021. Most samples were collected by agency personnel and analyzed in agency laboratories or under contracts in commercial analytical laboratories. Mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are also included in this database. The data in the AGDB4 supersede data in the AGDB, AGDB2, and AGDB3 databases but the background about the data in these earlier versions is needed to understand what has been done to amend, clean up, correct, and format these data. Data that were not included in previous versions because they predate the earliest agency geochemical databases or were excluded for programmatic reasons are included here in the AGDB4. The AGDB4 data are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. They are provided as a Microsoft Access database, as comma-separated values (CSV), and as an Esri geodatabase consisting of point feature classes and related tables.
Facebook
TwitterOur Contact Validation and Append solution identifies and fixes errors in your existing customer database whilst appending missing information, including email addresses and telephone numbers. This comprehensive approach allows you to provide excellent customer service, obtain accurate billing information, and achieve high collection rates across all your communications.
What is it? A combination of cleansing, validation, correction and appending solutions applied to your customer base, whether residential or commercial. The full process involves the following steps:
This multi-step approach ensures your contact database is not only clean and accurate, but also complete with the most up-to-date information available.
Use cases - Deliver more messaging to the right customers - Ensure your communications reach their intended recipients by maintaining accurate contact details - Less wastage for your messaging and marketing - Reduce bounce rates and failed delivery attempts, maximising your marketing budget efficiency - Increase delivery success and engagement propensity - Clean, validated contact data leads to higher open rates, click-through rates, and overall campaign performance - Improve customer service delivery - Reach customers through their preferred contact methods with confidence in data accuracy - Enhance billing and collection processes - Accurate contact information supports successful payment reminders and collection activities - Maintain GDPR compliance - Keep your contact database current and accurate in line with data protection requirements
Facebook
TwitterPublic Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
USPTO patent application no. 09407650 in the United States Patent and Trademark Office