Facebook
TwitterThis dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file contains the R and Python codes generated for this study.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Facebook
TwitterThis data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.
How to recreate this dataset in Jupyter Notebook:
1) Prepare list of articles to query ```python import pandas as pd
CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"
citation_pairs = pd.read_parquet(CITATIONS_PARQUET)
citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]
citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]
citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)
citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()
articles = list(set(citation_pairs_doi['publication'].to_list()))
articles = [doi.replace("_", "/") for doi in articles]
with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```
2) Query articles from CrossRef API
%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter
# ---------- config ----------
HEADERS = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS = 45 # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000 # rows per INSERT
DB_PATH = pathlib.Path("crossref.sqlite").resolve()
ARTICLES = pathlib.Path("articles.txt")
# -----------------------------
# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
DOIS = [line.strip() for line in f if line.strip()]
# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
db.execute("""
CREATE TABLE IF NOT EXISTS works (
doi TEXT PRIMARY KEY,
json TEXT
)
""")
db.execute("PRAGMA journal_mode=WAL;") # better concurrency
# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1) # 45 req / second
sem = asyncio.Semaphore(100) # cap overall concurrency
async def fetch_one(session, doi: str):
url = f"https://api.crossref.org/works/{doi}"
async with limiter, sem:
try:
async with session.get(url, headers=HEADERS, timeout=10) as r:
if r.status == 404: # common “not found”
return doi, None
r.raise_for_status() # propagate other 4xx/5xx
return doi, await r.json()
except Exception as e:
return doi, None # log later, don’t crash
async def main():
start = time.perf_counter()
db = sqlite3.connect(DB_PATH) # KEEP ONE connection
db.execute("PRAGMA synchronous = NORMAL;") # speed tweak
async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
for chunk_start in range(0, len(DOIS), BATCH_SIZE):
slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
results = await asyncio.gather(*tasks) # all tuples, no exc
good_rows, bad_dois = [], []
for doi, payload in results:
if payload is None:
bad_dois.append(doi)
else:
good_rows.append((doi, orjson.dumps(payload).decode()))
if good_rows:
db.executemany(
"INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
good_rows,
)
db.commit()
if bad_dois: # append for later retry
with open("failures.log", "a", encoding="utf-8") as fh:
fh.writelines(f"{d}
" for d in bad_dois)
done = chunk_start + len(slice_)
rate = done / (time.perf_counter() - start)
print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
db.close()
if _name_ == "_main_":
asyncio.run(main())
Then run:
python
!python enrich.py
3) Finally extract the necessary fields
import sqlite3
import orjson
i...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?
Steps:
- Set the working directory and read the data.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt="">
- Data cleaning. Check for missing values and data types of variables
- Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer")
- TEXT ACQUISITION and AGGREGATION. Create corpus.
- TEXT PRE-PROCESSING. Cleaning the text
- Replace special characters with " ". We use the tm_map function for this purpose
- make all the alphabets lower case
- remove punctuations
- remove whitespace
- remove stopwords
- remove numbers
- stem the document
- create term document matrix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt="">
- convert into matrix and find out frequency of words
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt="">
- convert into a data frame
- TEXT EXPLORATION find out the words which appear most frequently and least frequently
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt="">
- Create Wordcloud
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.