7 datasets found
  1. MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository

    • healthdata.gov
    csv, xlsx, xml
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository [Dataset]. https://healthdata.gov/dataset/MeSH-2023-Update-Delete-Report-4at4-q6rg-Archive-R/bjnp-cusd
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jul 16, 2025
    Description

    This dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.

  2. d

    Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Contiguous United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  3. KC_House Dataset -Linear Regression of Home Prices

    • kaggle.com
    zip
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
    Explore at:
    zip(776807 bytes)Available download formats
    Dataset updated
    May 15, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    1. Dataset: House pricing dataset containing 21 columns and 21613 rows.
    2. Programming Language : R
    3. Objective : To predict house prices by creating a model
    4. Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
  4. This file contains the R and Python codes generated for this study.

    • plos.figshare.com
    zip
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Annie Kim; Inkyung Jung (2023). This file contains the R and Python codes generated for this study. [Dataset]. http://doi.org/10.1371/journal.pone.0288540.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Annie Kim; Inkyung Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains the R and Python codes generated for this study.

  5. Data from: Valid Inference Corrected for Outlier Removal

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Shuxiao Chen; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

  6. Articles metadata from CrossRef

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata
    Explore at:
    zip(72982417 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Kea Kohv
    Description

    This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

    How to recreate this dataset in Jupyter Notebook:

    1) Prepare list of articles to query ```python import pandas as pd

    See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

    CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

    Load the citation pairs from the Parquet file

    citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

    Remove all rows where https is in the 'publication' column but no "doi.org" is present

    citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

    Remove all rows where figshare is in the dataset name

    citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

    citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

    citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

    articles = list(set(citation_pairs_doi['publication'].to_list()))

    articles = [doi.replace("_", "/") for doi in articles]

    Save list articles to a file

    with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

    2) Query articles from CrossRef API

    
    %%writefile enrich.py
    #!pip install -q aiolimiter
    import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
    from aiolimiter import AsyncLimiter
    
    # ---------- config ----------
    HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
    MAX_RPS  = 45           # polite pool limit (50), leave head-room
    BATCH_SIZE = 10_000         # rows per INSERT
    DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
    ARTICLES  = pathlib.Path("articles.txt")
    # -----------------------------
    
    # ---- platform tweak: prefer selector loop on Windows ----
    if sys.platform == "win32":
      asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    
    # ---- read the DOI list ----
    with ARTICLES.open(encoding="utf-8") as f:
      DOIS = [line.strip() for line in f if line.strip()]
    
    # ---- make sure DB & table exist BEFORE the async part ----
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(DB_PATH) as db:
      db.execute("""
        CREATE TABLE IF NOT EXISTS works (
          doi  TEXT PRIMARY KEY,
          json TEXT
        )
      """)
      db.execute("PRAGMA journal_mode=WAL;")   # better concurrency
    
    # ---------- async section ----------
    limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
    sem   = asyncio.Semaphore(100)        # cap overall concurrency
    
    async def fetch_one(session, doi: str):
      url = f"https://api.crossref.org/works/{doi}"
      async with limiter, sem:
        try:
          async with session.get(url, headers=HEADERS, timeout=10) as r:
            if r.status == 404:         # common “not found”
              return doi, None
            r.raise_for_status()        # propagate other 4xx/5xx
            return doi, await r.json()
        except Exception as e:
          return doi, None            # log later, don’t crash
    
    async def main():
      start = time.perf_counter()
      db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
      db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak
    
      async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
        for chunk_start in range(0, len(DOIS), BATCH_SIZE):
          slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
          tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
          results = await asyncio.gather(*tasks)    # all tuples, no exc
    
          good_rows, bad_dois = [], []
          for doi, payload in results:
            if payload is None:
              bad_dois.append(doi)
            else:
              good_rows.append((doi, orjson.dumps(payload).decode()))
    
          if good_rows:
            db.executemany(
              "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
              good_rows,
            )
            db.commit()
    
          if bad_dois:                # append for later retry
            with open("failures.log", "a", encoding="utf-8") as fh:
              fh.writelines(f"{d}
    " for d in bad_dois)
    
          done = chunk_start + len(slice_)
          rate = done / (time.perf_counter() - start)
          print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
    
      db.close()
    
    if _name_ == "_main_":
      asyncio.run(main())
    

    Then run: python !python enrich.py

    3) Finally extract the necessary fields

    import sqlite3
    import orjson
    i...
    
  7. Food Reviews - Text Mining & Sentiment Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
    Explore at:
    zip(1075643 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

    • We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

    Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

    • TEXT MODELLING
    • Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">
    • "like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)
    • "tast" does not have an association with any word with the set correlation limit
    • "flavor" has an association with the word "chip"(they appear about 27% of the time together)
    • Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">
    • element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.
    • Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">
    • This indicates that the entire food review document has a marginally positive score
    • Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">
    • (-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment
    • Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository [Dataset]. https://healthdata.gov/dataset/MeSH-2023-Update-Delete-Report-4at4-q6rg-Archive-R/bjnp-cusd
Organization logo

MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository

Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Jul 16, 2025
Description

This dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.

Search
Clear search
Close search
Google apps
Main menu