36 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Dataset - A Computational Simulator for Incineration of Wastes Generated...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Dataset - A Computational Simulator for Incineration of Wastes Generated from Cleanup of Chemical and Biological Contamination Incidents [Dataset]. https://catalog.data.gov/dataset/dataset-a-computational-simulator-for-incineration-of-wastes-generated-from-cleanup-of-che
    Explore at:
    Dataset updated
    Jul 16, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The figures in the paper are outputs from the model based on example run conditions.Figures 4 and 5 show comparisons between the model predictions and measured data from the EPA Lab Kiln module. Figure 6 shows the difference between using the Z-Value and the Arrhenius approach to the kinetics of biological agent destruction. Figures 7 and 8 are 3-D heat maps of the temperature and oxygen concentration, respectively, in the commercial rotary kiln module. They are raster pictures so they are not the traditional x-y coordinate graphs. Figure 9 shows streamlines within the primary combustion chamber of the commercial rotary kiln and predicted destruction of the GB nerve agent along those streamlines. Figure 10 shows predicted gas temperature along a streamline in the commercial rotary kiln module. Figure 11 shows example predictions of the mole fraction of 3 chemical warfare agents along streamlines in the commercial rotary kiln module. Figures 12 and 13 show predicted destruction and waste "piece" temperature of the biological agent Bacillus anthracis in bundles of carpet in the commercial rotary kiln. This dataset is associated with the following publication: Lemieux, P., T. Boe, A. Tschursin, M. Denison, K. Davis, and D. Swenson. Computational simulation of incineration of chemically and biologically contaminated wastes. JOURNAL OF THE AIR & WASTE MANAGEMENT ASSOCIATION. Air & Waste Management Association, Pittsburgh, PA, USA, 71(4): 462-476, (2021).

  3. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset authored and provided by
    Oxylabs
    Area covered
    Isle of Man, Canada, Taiwan, Tunisia, Bangladesh, British Indian Ocean Territory, Nepal, Northern Mariana Islands, Andorra, Moldova (Republic of)
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  4. c

    Alaska Geochemical Database Version 3.0 (AGDB3) including best value data...

    • s.cnmilf.com
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Alaska Geochemical Database Version 3.0 (AGDB3) including best value data compilations for rock, sediment, soil, mineral, and concentrate sample media [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/alaska-geochemical-database-version-3-0-agdb3-including-best-value-data-compilations-for-r
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Alaska
    Description

    The Alaska Geochemical Database Version 3.0 (AGDB3) contains new geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving speed and efficiency of use. Like the Alaska Geochemical Database Version 2.0 before it, the AGDB3 was created and designed to compile and integrate geochemical data from Alaska to facilitate geologic mapping, petrologic studies, mineral resource assessments, definition of geochemical baseline values and statistics, element concentrations and associations, environmental impact assessments, and studies in public health associated with geology. This relational database, created from databases and published datasets of the U.S. Geological Survey (USGS), Atomic Energy Commission National Uranium Resource Evaluation (NURE), Alaska Division of Geological & Geophysical Surveys (DGGS), U.S. Bureau of Mines, and U.S. Bureau of Land Management serves as a data archive in support of Alaskan geologic and geochemical projects and contains data tables in several different formats describing historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 112 laboratory and field analytical methods on 396,343 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. Most samples were collected by personnel of these agencies and analyzed in agency laboratories or, under contracts, in commercial analytical laboratories. These data represent analyses of samples collected as part of various agency programs and projects from 1938 through 2017. In addition, mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are included in this database. The AGDB3 includes historical geochemical data archived in the USGS National Geochemical Database (NGDB) and NURE National Uranium Resource Evaluation-Hydrogeochemical and Stream Sediment Reconnaissance databases, and in the DGGS Geochemistry database. Retrievals from these databases were used to generate most of the AGDB data set. These data were checked for accuracy regarding sample _location, sample media type, and analytical methods used. In other words, the data of the AGDB3 supersedes data in the AGDB and the AGDB2, but the background about the data in these two earlier versions are needed by users of the current AGDB3 to understand what has been done to amend, clean up, correct and format this data. Corrections were entered, resulting in a significantly improved Alaska geochemical dataset, the AGDB3. Data that were not previously in these databases because the data predate the earliest agency geochemical databases, or were once excluded for programmatic reasons, are included here in the AGDB3 and will be added to the NGDB and Alaska Geochemistry. The AGDB3 data provided here are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. The AGDB3 data provided in the online version of the database may be updated or changed periodically.

  5. c

    Marine Benthic Abundance Data

    • s.cnmilf.com
    • data.kingcounty.gov
    • +1more
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.kingcounty.gov (2024). Marine Benthic Abundance Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/marine-benthic-abundance-data
    Explore at:
    Dataset updated
    Sep 13, 2024
    Dataset provided by
    data.kingcounty.gov
    Description

    This dataset contains Puget Sound benthic abundance data. Before accessing data please download the README file. King County monitors the community of animals that live in and on the sediments at the bottom of Puget Sound. These animals, called benthos, are crucial to the health of Puget Sound. We collect sediment samples to help us understand what the sediments are like physically and chemically and how they support different species. Benthic samples have been collected since 2015 from 14 sites. Every 2 years we sample eight sites in Elliott Bay and every 5 years we collect samples from three deep sites in the mainstem of the Central Basin and three smaller embayments. We also collect benthic samples near treatment plant outfalls as well as cleanup sites. Benthic samples are typically collected in replicates (3-5 samples per _location and period of time). These replicates help us understand the variability of the benthic community at that _location and are distinguished by the fields "Sample Rep ID" and "Field Replicate". For corresponding benthic biomass data see: Marine Benthic Biomass Data. For questions about the data, please contact MarineWQ@kingcounty.gov.

  6. d

    The Scribe Database Collection, compiled in response to the Deepwater...

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). The Scribe Database Collection, compiled in response to the Deepwater Horizon oil spill incident in the Gulf of Mexico from 2010-04-23 to 2011-11-08 (NCEI Accession 0086261) [Dataset]. https://catalog.data.gov/dataset/the-scribe-database-collection-compiled-in-response-to-the-deepwater-horizon-oil-spill-incident
    Explore at:
    Dataset updated
    Mar 1, 2025
    Dataset provided by
    (Point of Contact)
    Area covered
    Gulf of Mexico (Gulf of America)
    Description

    The Scribe Database Collection includes 14 databases containing data from the Deepwater Horizon (DWH) Oil Spill Event Response Phase. These databases are the work of federal agencies, state environmental management agencies and BP and its contractors. The types of information include locations, descriptions, and analysis of water, sediment, oil, tar, dispersant, air and other environmental samples. The versions of the databases included in this collection are the result of the second phase of a clean-up effort by the database owners and contributors to resolve inconsistencies in the initial databases and to harmonize content across the databases in order for these data to be comparable for reliable evaluation and reporting. This effort was initiated in order to meet requirements supporting the Unified Area Command.

  7. Z

    Cleaned LargeRDFBench dumps

    • data.niaid.nih.gov
    Updated Apr 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huf, Alexis (2022). Cleaned LargeRDFBench dumps [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5008279
    Explore at:
    Dataset updated
    Apr 20, 2022
    Dataset provided by
    Huf, Alexis
    Siqueira, Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dumps for each of the LargeRDFBench datasets in two formats:

    A single N-Triples (no prefixes, no unquoted numbers/booleans) file compressed with zstd

    A HDT file with sidecar index file (.hdt.index.v1-1) for faster querying.

    .mark files, which are JSON files storing the SHA-256 hashes of the above files and the input dump file from the original LargeRDFBench.

    In addition to individual datasets, there is LargeRDFBench-all.hdt, containing the union of all triples in all datasets.

    The files in this dataset where generated using the "fix" subcommand of the freqel-driver command-line utility. The files in this zenodo dataset where generated from commit 47cea26 of said repository. Nearly all of the cleanup code however is from rdfit version 1.0.6, which is available from maven central.

    There are four reasons to use this dataset as a substitute for the original:

    Flatter file structure: there is a single file per dataset

    All data is in N-Triples (no RDF/XML or Turtle syntax in .nt-named files)

    Valid IRIs and valid N-Triples syntax (no parsers errors, at most warnings)

    Provided .hdt files are directly queryable

    Since the original dumps have syntax errors and invalid IRIs, there are multiple ways to handle such issues and this dataset is one set of choices for handling them. For example, the Virtuoso endpoints of the original (as of commit 49d1401) ingest and expose invalid IRIs and langtags without complaining. Thus, there are SPARQL queries for which the results obtained using this cleaned version and the original Virtuoso endpoint bundles will differ. As far as we know, such possibility does not apply to the LargeRDFBench SPARQL queries. No triples were discarded in the cleaning process, rather triples with invalid IRIs (as per RFC 3987) and invalid language tags are mapped to valid counter parts. Literals are mostly unaffected, except for one particular syntax violation in the Affymetrix dataset: non-escaped null characters (U+0000) in lexical forms were replaced with spaces (U+0020) to make HDT files possible. The syntax fixes were made using RIt.tolerant() functionality of the rdfit library, version 1.0.6. The list of transformations (beyond flattening the file structure and storing as N-Ttiples and HDT) was:

    Percent-encode characters not allowed at their current position in the IRI by RFC 3987.

    If percent-encoding is not allowed at that position by RFC 3987 (e.g., port rule), the character will be erased

    Erase invalid character encodings (when the binary representation is so messed up it does not appear as the wrong character but is straight up invalid UTF-8)

    Replace '_' in language tags with '-' (e.g., en_US becomes en-US)

    For NT/Turtle, -escape occurrences of \r (0x0D) and (0x0A) inside single-quoted lexical forms.

    For NT/Turtle, replace \ with \ in any \x-escape where x is not in tbnrf"' (see ECHAR).

    For NT/Turtle, identify UCHAR) escape sequences that represent an UTF-8 encoding instead of an unicode code point. Such sequences are composed of only byte-sized code points, which value sequence correspond to a valid UTF-8 sequence and where at least one such byte has a value that is the code point of a control character. Given such conditions, the sequence of UCHARs is replaced by a single UCHAR for the character encoded in UTF-8. Example: \x00C3\x0085, which corresponds to Å in UTF-8 becomes \u00C5 since U+0085 is a control character.

    For NT/Turtle, @PREFIX and @BASE are rewritten to @prefix and @base

    For NT/Turtle, literals true and false with any variation in case (e.g., True) are replaced

    with the standard true and false.

    For NT/Turtle, a lexical form followed by an

    characters different from 2 is replaced with ^^

    For NT/Turtle, replace invalid unquoted plain literals with plain string literals. For this, the code assumes the invalid unquoted literal has no spaces (i.e., whitespace is a separator and never part of the invalid literal). Examples of this fix in action:

    2e-3.4 becomes

    "2e-3.4" (expoent must be an integer) and

    falseful becomes

    "falseful"

    Strip leading whitespace, %20, %09, %0A %0D and strip underlines at any position from IRI schemes. Affymetrix and Jamendo are affected

    For Turtle/NT/TriG, replace NULL characters (U+0000) in string literals with (U+0020). Use case: only Affymetrix

    Changelog

    1.0.1: Re-generated LMDB.index.v1-1 to fix wrong results on queries with unbound subject, owl:sameAs predicate and bound object.

    1.0.2: Added LargeRDFBench-all.hdt and sidecar index file

  8. S

    Spill Incidents

    • data.ny.gov
    • datadiscoverystudio.org
    • +3more
    application/rdfxml +5
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York State Department of Environmental Conservation (2025). Spill Incidents [Dataset]. https://data.ny.gov/widgets/u44d-k5fk
    Explore at:
    xml, json, application/rdfxml, csv, application/rssxml, tsvAvailable download formats
    Dataset updated
    Mar 26, 2025
    Dataset authored and provided by
    New York State Department of Environmental Conservation
    Description

    This dataset contains records of spills of petroleum and other hazardous materials. Under State law and regulations, spills that could pollute the lands or waters of the state must be reported by the spiller (and, in some cases, by anyone who has knowledge of the spill). Examples of what may be included in a spill record includes: Administrative information (DEC region and unique seven-digit spill number). Program facility name. Spill date/time. Location. Spill source and cause. Material(s) and material type spilled. Quantity spilled and recovered. Units measured. Surface water bodies affected. Close date (cleanup activity finished and all paperwork completed).

  9. Z

    Dataset for: The Evolution of the Manosphere Across the Web

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manoel Horta Ribeiro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
    Explore at:
    Dataset updated
    Aug 30, 2020
    Dataset provided by
    Jeremy Blackburn
    Summer Long
    Barry Bradlyn
    Gianluca Stringhini
    Savvas Zannettou
    Stephanie Greenberg
    Manoel Horta Ribeiro
    Emiliano De Cristofaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Evolution of the Manosphere Across the Web

    We make available data related to subreddit and standalone forums from the manosphere.

    We also make available Perspective API annotations for all posts.

    You can find the code in GitHub.

    Please cite this paper if you use this data:

    @article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

    1. Reddit data

    We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

    { "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

    Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

    Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

    No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

    I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

    Tallcels are fakecels and they all can (and should) suck my cock.

    If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

    Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

    1. Forums

    We here describe the .sqlite and .ndjson files that contain the data from the following forums.

    (avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

    The files are in folders /sqlite/ and /ndjson.

    2.1 .sqlite

    All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

    idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

    "type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
    "title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

    processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

    "post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

    2.2 .ndjson

    Each line consists of a json object representing a different comment with the following fields:

    "author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

    1. Perspective

    We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

    { "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

    1. Working with sqlite

    A nice way to read some of the files of the dataset is using SqliteDict, for example:

    from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

    for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

    1. Helpers

    Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

    channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

    author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

    These are used in the paper for the migration analyses.

    1. Examples and particularities for forums

    Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

    6.1 incels

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

    quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

    6.2 LoveShy

    Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

    types: no types were parsed. There are some rules in the forum, but not significant.

    quotes: quotes were obtained from exact text+author match, or author match + a jaccard

  10. d

    Geolytica POIData.xyz Points of Interest (POI) Geo Data - UAE

    • datarade.ai
    .csv
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geolytica (2021). Geolytica POIData.xyz Points of Interest (POI) Geo Data - UAE [Dataset]. https://datarade.ai/data-products/geolytica-poidata-xyz-points-of-interest-poi-geo-data-uae-geolytica
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset authored and provided by
    Geolytica
    Area covered
    United Arab Emirates
    Description

    Point-of-interest (POI) is defined as a physical entity (such as a business) in a geo location (point) which may be (of interest).

    We strive to provide the most accurate, complete and up to date point of interest datasets for all countries of the world. The United Arab Emirates POI Dataset is one of our worldwide POI datasets with over 98% coverage.

    This is our process flow:

    Our machine learning systems continuously crawl for new POI data
    Our geoparsing and geocoding calculates their geo locations
    Our categorization systems cleanup and standardize the datasets
    Our data pipeline API publishes the datasets on our data store
    

    POI Data is in a constant flux - especially so during times of drastic change such as the Covid-19 pandemic.

    Every minute worldwide on an average day over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist.

    In today's interconnected world, of the approximately 200 million POIs worldwide, over 94% have a public online presence. As a new POI comes into existence its information will appear very quickly in location based social networks (LBSNs), other social media, pictures, websites, blogs, press releases. Soon after that, our state-of-the-art POI Information retrieval system will pick it up.

    We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via a recurring payment plan on our data update pipeline.

    The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.

    The core attribute coverage is as follows:

    Poi Field Data Coverage (%) poi_name 100 brand 4 poi_tel 48 formatted_address 100 main_category 96 latitude 100 longitude 100 neighborhood 2 source_url 47 email 6 opening_hours 43

    The data may be visualized on a map at https://store.poidata.xyz/ae and a data sample may be downloaded at https://store.poidata.xyz/datafiles/ae_sample.csv

  11. d

    Johns Hopkins COVID-19 Case Tracker

    • data.world
    csv, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Mar 25, 2025
    Authors
    The Associated Press
    Time period covered
    Jan 22, 2020 - Mar 9, 2023
    Area covered
    Description

    Updates

    • Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

    • April 9, 2020

      • The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.
    • April 20, 2020

      • Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.
    • April 29, 2020

      • The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.
    • September 1st, 2020

      • Johns Hopkins is now providing counts for the five New York City counties individually.
    • February 12, 2021

      • The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."
      • Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.
    • February 16, 2021

      - Johns Hopkins has reconciled Ohio's historical deaths data with the state.

      Overview

    The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

    The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

    This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

    The AP is updating this dataset hourly at 45 minutes past the hour.

    To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

    Queries

    Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

    Interactive

    The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

    @(https://datawrapper.dwcdn.net/nRyaf/15/)

    Interactive Embed Code

    <iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
    

    Caveats

    • This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.
    • In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.
    • In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"
    • This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.
    • Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
    • Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.
    • The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

    Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

    Attribution

    This data should be credited to Johns Hopkins University COVID-19 tracking project

  12. Student Marks Dataset

    • kaggle.com
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Student Marks Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/student-marks-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">

    Description:

    The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.

    Properties of the Dataset:
    Number of Instances: 100
    Number of Attributes: 3 including the target variable.

    The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the student marks wrt multiple features.
    • Also evaluate the models & compare their respective scores like R2, RMSE, etc.
  13. w

    Books called Get Your Hands Dirty on Clean Architecture : a Hands-On Guide...

    • workwithdata.com
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Books called Get Your Hands Dirty on Clean Architecture : a Hands-On Guide to Creating Clean Web Applications with Code Examples in Java [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Get+Your+Hands+Dirty+on+Clean+Architecture+%3A+a+Hands-On+Guide+to+Creating+Clean+Web+Applications+with+Code+Examples+in+Java
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books and is filtered where the book is Get Your Hands Dirty on Clean Architecture : a Hands-On Guide to Creating Clean Web Applications with Code Examples in Java, featuring 7 columns including author, BNB id, book, book publisher, and ISBN. The preview is ordered by publication date (descending).

  14. Department of Ecology Facility and Site Interactions

    • data-wa-geoservices.opendata.arcgis.com
    • geo.wa.gov
    • +2more
    Updated Dec 25, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Washington State Department of Ecology (2015). Department of Ecology Facility and Site Interactions [Dataset]. https://data-wa-geoservices.opendata.arcgis.com/datasets/e4905453d2a8426a934c8f56fea6fd35
    Explore at:
    Dataset updated
    Dec 25, 2015
    Dataset authored and provided by
    Washington State Department of Ecologyhttps://ecology.wa.gov/
    Area covered
    Description

    The Washington State Department of Ecology has defined a facility/site as an operation at a fixed location that is of interest to the agency because it has an active or potential impact upon the environment. Ecology recognizes that this definition is broad and generic; but the agency has found that such a definition is required in order to encompass all the facilities and sites in Washington that are within the purview of its programs. These programs cover a wide variety of environmental aspects and conditions including air quality, water quality, shorelands, water resources, toxics cleanup, hazardous waste, toxics reduction, and nuclear waste. The definitions of a facility and/or a site vary significantly across these programs, both in practice and law. Examples of facilities/sites include: operation that pollutes the air or water, spill cleanup site, hazardous waste management facility, hazardous waste generator, licensed laboratory, SUPERFUND site, farm which draws water from a well, solid waste recycling center, etc.

  15. d

    Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record...

    • datarade.ai
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syntegra (2022). Syntegra Synthetic EHR Data | Structured Healthcare Electronic Health Record Data [Dataset]. https://datarade.ai/data-products/syntegra-synthetic-ehr-data-structured-healthcare-electroni-syntegra
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Feb 23, 2022
    Dataset authored and provided by
    Syntegra
    Area covered
    United States of America
    Description

    Organizations can license synthetic, structured data generated by Syntegra from electronic health record systems of community hospitals across the United States, reaching beyond just claims and Rx data.

    The synthetic data provides a detailed picture of the patient's journey throughout their hospital stay, including patient demographic information and payer type, as well as rich data not found in any other sources. Examples of this data include: drugs given (timing and dosing), patient location (e.g., ICU, floor, ER), lab results (timing by day and hour), physician roles (e.g., surgeon, attending), medications given, and vital signs. The participating community hospitals with bed sizes ranging from 25 to 532 provide unique visibility and assessment of variation in care outside of large academic medical centers and healthcare networks.

    Our synthetic data engine is trained on a broadly representative dataset made up of deep clinical information of approximately 6 million unique patient records and 18 million encounters over 5 years of history. Notably, synthetic data generation allows for the creation of any number of records needed to power your project.

    EHR data is available in the following formats: — Cleaned, analytics-ready (a layer of clean and normalized concepts in Tuva Health’s standard relational data model format — FHIR USCDI (labs, medications, vitals, encounters, patients, etc.)

    The synthetic data maintains full statistical accuracy, yet does not contain any actual patients, thus removing any patient privacy liability risk. Privacy is preserved in a way that goes beyond HIPAA or GDPR compliance. Our industry-leading metrics prove that both privacy and fidelity are fully maintained.

    — Generate the data needed for product development, testing, demo, or other needs — Access data at a scalable price point — Build your desired population, both in size and demographics — Scale up and down to fit specific needs, increasing efficiency and affordability

    Syntegra's synthetic data engine also has the ability to augment the original data: — Expand population sizes, rare cohorts, or outcomes of interest — Address algorithmic fairness by correcting bias or introducing intentional bias — Conditionally generate data to inform scenario planning — Impute missing value to minimize gaps in the data

  16. L3DAS21 Challenge

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danilo Comminiello; Danilo Comminiello; Eric Guizzo; Eric Guizzo (2021). L3DAS21 Challenge [Dataset]. http://doi.org/10.5281/zenodo.4642005
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 10, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Danilo Comminiello; Danilo Comminiello; Eric Guizzo; Eric Guizzo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    L3DAS21: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING

    IEEE MLSP Data Challenge 2021

    SCOPE OF THE CHALLENGE

    The L3DAS21 Challenge for the IEEE MLSP 2021 aims at encouraging and fostering research on machine learning for 3D audio signal processing. In multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others). To this end, L3DAS21 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environment.

    Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. The use of two first-order Ambisonics microphones definitely represents one of the main novelties of the L3DAS21 Challenge.

    • Task 1: 3D Speech Enhancement
      The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises.The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI) and word error rate (WER).
    • Task 2: 3D Sound Event Localization and Detection
      The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space.Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task are evaluated according to the location-sensitive detection error, which joins the localization and detection errors.

    DATASETS

    The LEDAS21 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.

    The dataset is divided in two main sections, respectively dedicated to the challenge tasks.

    The first section is optimized for 3D Speech Enhancement and contains more than 30000 virtual 3D audio environments with a duration up to 10 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals.

    The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 60-seconds-long audio files Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.

    We split both dataset sections into: a training set (44 hours for SE and 600 hours for SELD) and a test set (6 hours for SE and 5 hours for SELD), paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 10 seconds). All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 ore 3, respectively.

    The evaluation test datasets can be downloaded here:

    CHALLENGE WEBSITE AND CONTACTS

    L3DAS21 Challenge Website: www.l3das.com/mlsp2021

    GitHub repository: github.com/l3das/L3DAS21

    Paper: arxiv.org/abs/2104.05499

    IEEE MLSP 2021: 2021.ieeemlsp.org/

    Email contact: l3das@uniroma1.it

    Twitter: https://twitter.com/das_l3

  17. The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1)...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency, Office of Research and Development (ORD), Center for Public Health and Environmental Assessment (CPHEA), Pacific Ecological Systems Division (PESD), (2025). The LakeCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments for the Conterminous United States: Dam Density and Storage Volume [Dataset]. https://catalog.data.gov/dataset/the-lakecat-dataset-accumulated-attributes-for-nhdplusv2-version-2-1-catchments-for-the-co-bb77d
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Contiguous United States, United States
    Description

    This dataset represents the dam density and storage volumes within individual local and accumulated upstream catchments for NHDPlusV2 Waterbodies based on the National Inventory of Dams (NID). Catchment boundaries in LakeCat are defined in one of two ways, on-network or off-network. The on-network catchment boundaries follow the catchments provided in the NHDPlusV2 and the metrics for these lakes mirror metrics from StreamCat, but will substitute the COMID of the NHDWaterbody for that of the NHDFlowline. The off-network catchment framework uses the NHDPlusV2 flow direction rasters to define non-overlapping lake-catchment boundaries and then links them through an off-network flow table. The NID database contains information about the dam2019s location, size, purpose, type, last inspection, regulatory facts, and other technical data. Structures on streams reduce the longitudinal and lateral hydrologic connectivity of the system. For example, impoundments above dams slow stream flow, cause deposition of sediment and reduce peak flows. Dams change both the discharge and sediment supply of streams, causing channel incision and bed coarsening downstream. Downstream areas are often sediment deprived, resulting in degradation, i.e., erosion of the stream bed and stream banks. This database was improved upon by locations verified by work from the USGS National Map (Jeff Simley Group). It was observed that some dams, some of them major and which do exist, were not part of the 2009 NID, but were represented in the USGS National Map dataset, and had been in the 2006 NID. Approximately 1,100 such dams were added, based on the USGS National Map lat/long and the 2006 NID attributes (dam height, storage, etc.) Finally, as clean-up, a) about 600 records with duplicate NIDID were removed, and b) about 300 records were removed which represented the same location of the same dam but with a different NIDID, for the largest dams (did visual check of dams with storage above 5000 acre feet and are likely duplicated - about the 10,000 largest dams). The (dams/catchment) and (dam_storage/catchment) were summarized and accumulated into watersheds to produce local catchment-level and watershed-level metrics as a point data type.

  18. d

    Autoscraping | Google Places Review Data | 10M+ Reviews with Ratings &...

    • datarade.ai
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AutoScraping (2024). Autoscraping | Google Places Review Data | 10M+ Reviews with Ratings & Comments | Global Coverage [Dataset]. https://datarade.ai/data-products/autoscraping-s-google-places-review-data-consumer-review-da-autoscraping
    Explore at:
    .json, .xml, .csv, .sqlAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    AutoScraping
    Area covered
    Palestine, Vanuatu, Saint Pierre and Miquelon, Cyprus, Pitcairn, New Zealand, Montserrat, Haiti, Saint Helena, New Caledonia
    Description

    What Makes Our Data Unique?

    Autoscraping’s Google Places Review Data is a premium resource for organizations seeking in-depth consumer insights from a trusted global platform. What sets our data apart is its sheer volume and quality—spanning over 10 million reviews from Google Places worldwide. Each review includes critical attributes such as ratings, comment titles, comment bodies, and detailed sentiment analysis. This data is meticulously curated to capture the authentic voice of consumers, offering a rich source of information for understanding customer satisfaction, brand perception, and market trends.

    Our dataset is unique not only because of its scale but also due to the richness of its metadata. We provide granular details about each review, including the review source, place ID, and post date, allowing for precise temporal and spatial analysis. This level of detail enables users to track changes in consumer sentiment over time, correlate reviews with specific locations, and conduct deep dives into customer feedback across various industries.

    Moreover, the dataset is continuously updated to ensure it reflects the most current opinions and trends, making it an invaluable tool for real-time market analysis and competitive intelligence.

    How is the Data Generally Sourced?

    The data is sourced directly from Google Places, one of the most widely used platforms for business reviews and location-based feedback globally. Our robust web scraping infrastructure is specifically designed to extract every relevant piece of information from Google Places efficiently and accurately. We employ advanced scraping techniques that allow us to capture a wide array of review data across multiple industries and geographic locations.

    The scraping process is conducted at regular intervals to ensure that our dataset remains up-to-date with the latest consumer feedback. Each entry undergoes rigorous data validation and cleaning processes to remove duplicates, correct inconsistencies, and enhance data accuracy. This ensures that users receive high-quality, reliable data that can be trusted for critical decision-making.

    Primary Use-Cases and Verticals

    This Google Places Review Data is a versatile resource with a wide range of applications across various verticals:

    Consumer Insights and Market Research: Companies can leverage this data to gain a deeper understanding of consumer opinions and preferences. By analyzing ratings, comments, and sentiment across different locations and industries, businesses can identify emerging trends, discover potential areas for improvement, and better align their products or services with customer needs.

    Brand Reputation Management: Organizations can use this data to monitor their brand reputation across multiple locations. The dataset enables users to track customer sentiment over time, identify patterns in feedback, and respond proactively to negative reviews. This helps businesses maintain a positive brand image and enhance customer loyalty.

    Competitive Analysis: By analyzing reviews and ratings of competitors, companies can gain valuable insights into their strengths and weaknesses. This data can inform strategic decisions, such as product development, marketing campaigns, and customer engagement strategies.

    Location-Based Marketing: Marketers can utilize this data to tailor their campaigns based on regional customer preferences and sentiments. The geolocation aspect of the data allows for precise targeting, ensuring that marketing efforts resonate with local audiences.

    Product and Service Improvement: Businesses can use the detailed feedback from Google Places reviews to identify specific areas where their products or services may be falling short. This information can be used to drive improvements and innovations, ultimately enhancing customer satisfaction and business performance.

    Real-Time Sentiment Analysis: The continuous update of our dataset makes it ideal for real-time sentiment analysis. Companies can track how customer sentiment evolves in response to new products, services, or market events, allowing them to react quickly and adapt to changing market conditions.

    How Does This Data Product Fit into Our Broader Data Offering?

    Autoscraping’s Google Places Review Data is a vital component of our comprehensive data offering, which spans various industries and geographies. This dataset complements our broader portfolio of consumer feedback data, which includes reviews from other major platforms, social media sentiment data, and customer satisfaction surveys.

    By integrating this Google Places data with other datasets in our portfolio, users can develop a more holistic view of consumer behavior and market dynamics. For example, combining review data with sales data or demographic information can provide deeper insights into how different factors influence customer satisfaction and purchasing decisions.

    Our commitment to delivering high-...

  19. Great Barrier Reef Genomics Database: Seawater Illumina Reads

    • researchdata.edu.au
    • geonetwork.apps.aims.gov.au
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Institute of Marine Science (AIMS); Yeoh, YK; Yeoh, YK; Yeoh, YK (2024). Great Barrier Reef Genomics Database: Seawater Illumina Reads [Dataset]. https://researchdata.edu.au/great-barrier-reef-illumina-reads/2131233
    Explore at:
    Dataset updated
    2024
    Dataset provided by
    Australian Institute Of Marine Sciencehttp://www.aims.gov.au/
    Authors
    Australian Institute of Marine Science (AIMS); Yeoh, YK; Yeoh, YK; Yeoh, YK
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    This dataset comprises of microbial metagenomics sequencing reads of seawater collected across 48 reef sites across the Great Barrier Reef. Samples were collected across four Long Term Monitoring Program (LTMP) field trips between November 2019-July 2020, combining water chemistry data, LTMP field surveys and microbial metagenomics data. This data collection was a major part of the QRCIF IMOS GBR microbial genomic database project, which aims to generate a comprehensive open access repositor of microbial genomic data from across the region. Seawater was collected in quadruplicate either by SCUBA or using Niskin Bottles at each reef site, 5L of seawater was pre-filtered using a 5µm filter and applied to a 0.22µm sterivex filter, snap frozen and stored at -20°C in preparation of DNA extraction. DNA was extracted from sterivex filters using phenol:chloroform:Iso-amyl alcolol extraction, ethanol precipitation and cleanup using the Zymo Clean and Concentrator® kit before submission for sequencing at the Australian Centre for Ecogenomics sequencing facility, Illumina. The data presented as illumina paired-end shotgun metagenomics sequencing runs, in fastq format, generated by Microba Life Sciences, Brisbane, QLD, Australia. Each downloadable archive contains forward and reverse reads for all replicate sampling performed at that particular site. Water quality particulate and dissolved nutrient data was generated as previously described (https://doi.org/10.25845/5c09b551f315b) from water samples collected simultaneously at each reef site.

    Zip files are available through the spatial layer under each site's 'illumina.seawater.zip' - please note these are large downloads (between 6 - 14 GB).

  20. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
141 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu