Facebook
TwitterThis dataset was created by Zahra Zolghadr
Facebook
TwitterSafely upload client/brand first party CRM/loyalty data. TrueData will append with relevant digital identifiers (HEM, MAID, UID 2.0, CTV IDs) and distribute to directly or via LiveRamp to any destination. Activate across Desktop, Mobile App/Web, CTV, DOOH, Audio. Ingest raw data to build derivative internal products.
Facebook
TwitterWith Versium REACH's Firmographic Append tool in the Business to Business Direct product suite you unlock the ability to append valuable firmographic data for your customer and prospect contact lists. With only a few available attributes needed you can tap into Versium's industry-leading identity resolution engine and proprietary database to append rich firmographic data. To append data you will only need any of the following: - Email - Business Domain - Business Name, Address, City, State - Business Name, Phone
Facebook
TwitterWith Versium REACH Demographic Append you will have access to many different attributes for enriching your data.
Basic, Household and Financial, Lifestyle and Interests, Political and Donor.
Here is a list of what sorts of attributes are available for each output type listed above:
Basic:
- Senior in Household
- Young Adult in Household
- Small Office or Home Office
- Online Purchasing Indicator
- Language
- Marital Status
- Working Woman in Household
- Single Parent
- Online Education
- Occupation
- Gender
- DOB (MM/YY)
- Age Range
- Religion
- Ethnic Group
- Presence of Children
- Education Level
- Number of Children
Household, Financial and Auto: - Household Income - Dwelling Type - Credit Card Holder Bank - Upscale Card Holder - Estimated Net Worth - Length of Residence - Credit Rating - Home Own or Rent - Home Value - Home Year Built - Number of Credit Lines - Auto Year - Auto Make - Auto Model - Home Purchase Date - Refinance Date - Refinance Amount - Loan to Value - Refinance Loan Type - Home Purchase Price - Mortgage Purchase Amount - Mortgage Purchase Loan Type - Mortgage Purchase Date - 2nd Most Recent Mortgage Amount - 2nd Most Recent Mortgage Loan Type - 2nd Most Recent Mortgage Date - 2nd Most Recent Mortgage Interest Rate Type - Refinance Rate Type - Mortgage Purchase Interest Rate Type - Home Pool
Lifestyle and Interests:
- Mail Order Buyer
- Pets
- Magazines
- Reading
- Current Affairs and Politics
- Dieting and Weight Loss
- Travel
- Music
- Consumer Electronics
- Arts
- Antiques
- Home Improvement
- Gardening
- Cooking
- Exercise
- Sports
- Outdoors
- Womens Apparel
- Mens Apparel
- Investing
- Health and Beauty
- Decorating and Furnishing
Political and Donor: - Donor Environmental - Donor Animal Welfare - Donor Arts and Culture - Donor Childrens Causes - Donor Environmental or Wildlife - Donor Health - Donor International Aid - Donor Political - Donor Conservative Politics - Donor Liberal Politics - Donor Religious - Donor Veterans - Donor Unspecified - Donor Community - Party Affiliation
Facebook
TwitterThis table feeds multiple apps for Tippecanoe County. Columns are the bare minimum details for generating a GRM. When new GRM data points are collected they should be appended here.
Facebook
TwitterWith Versium REACH Demographic Append you will have access to many different attributes for enriching your data.
Basic, Household and Financial, Lifestyle and Interests, Political and Donor.
Here is a list of what sorts of attributes are available for each output type listed above:
Basic:
- Senior in Household
- Young Adult in Household
- Small Office or Home Office
- Online Purchasing Indicator
- Language
- Marital Status
- Working Woman in Household
- Single Parent
- Online Education
- Occupation
- Gender
- DOB (MM/YY)
- Age Range
- Religion
- Ethnic Group
- Presence of Children
- Education Level
- Number of Children
Household, Financial and Auto: - Household Income - Dwelling Type - Credit Card Holder Bank - Upscale Card Holder - Estimated Net Worth - Length of Residence - Credit Rating - Home Own or Rent - Home Value - Home Year Built - Number of Credit Lines - Auto Year - Auto Make - Auto Model - Home Purchase Date - Refinance Date - Refinance Amount - Loan to Value - Refinance Loan Type - Home Purchase Price - Mortgage Purchase Amount - Mortgage Purchase Loan Type - Mortgage Purchase Date - 2nd Most Recent Mortgage Amount - 2nd Most Recent Mortgage Loan Type - 2nd Most Recent Mortgage Date - 2nd Most Recent Mortgage Interest Rate Type - Refinance Rate Type - Mortgage Purchase Interest Rate Type - Home Pool
Lifestyle and Interests:
- Mail Order Buyer
- Pets
- Magazines
- Reading
- Current Affairs and Politics
- Dieting and Weight Loss
- Travel
- Music
- Consumer Electronics
- Arts
- Antiques
- Home Improvement
- Gardening
- Cooking
- Exercise
- Sports
- Outdoors
- Womens Apparel
- Mens Apparel
- Investing
- Health and Beauty
- Decorating and Furnishing
Political and Donor: - Donor Environmental - Donor Animal Welfare - Donor Arts and Culture - Donor Childrens Causes - Donor Environmental or Wildlife - Donor Health - Donor International Aid - Donor Political - Donor Conservative Politics - Donor Liberal Politics - Donor Religious - Donor Veterans - Donor Unspecified - Donor Community - Party Affiliation
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The real motivation behind creating this dataset was to work on a project of IOT health monitoring device
There are columns heart rate , sysBP , diaBP, height, weight, BMI etc. these parameters are necessary for predicting heart condition
The height/weight tables with heart rate are taken from this website
https://www.mymathtables.com/chart/health-wellness/height-weight-table-for-all-ages.html
The following code has been used to generate the data according t research from different resources on the web: `import numpy as np import pandas as pd
age = np.random.randint(1,70,500000) sex = np.random.randint(0,2,500000) SysBP = np.random.randint(105,147,500000) DiaBP = np.random.randint(73,120,500000) HR = np.random.randint(78,200,500000) weightKg = np.random.randint(2,120,500000) heightCm = np.random.randint(48,185,500000) BMI = weightKg / heightCm / heightCm * 10000 \data=[] for age,sex,SysBP,DiaBP,HR,weightKg,heightCm,BMI in zip(age,sex,SysBP,DiaBP,HR,weightKg,heightCm,BMI): if BMI > 40 or BMI < 10: continue elif ( age < 20): continue elif ( weightKg < 45): continue elif (1 <= age <= 10) & (17 < BMI < 31) & (104< SysBP <121) & ( 73 < DiaBP < 81) & ( 99 < HR <= 200) & ( 3 < weightKg <= 36) & ( 48 < heightCm <= 139) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif (10 < age <= 20) & (17 < BMI < 31) & (104< SysBP <121) & ( 73 < DiaBP <= 81) & ( 99 < HR <= 200) & ( 36 < weightKg < 60) & ( 139 < heightCm < 170) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif (20 < age <= 30) & (17 < BMI < 31) & (108< SysBP <=134) & ( 75 <= DiaBP <= 84) & ( 94 < HR <= 190) & ( 28 < weightKg < 80) & ( 137 <= heightCm <= 180) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif (30 < age <= 40) & (17 < BMI < 31) & (110< SysBP <=135) & ( 81 <= DiaBP <= 86) & ( 93 <= HR <= 180) & ( 50 < weightKg < 90) & ( 137 <= heightCm <= 213) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif (40 < age <= 50) & (17 < BMI < 31) & (112< SysBP <=140) & ( 79 <= DiaBP <= 89) & ( 90 <= HR <= 170) & ( 50 < weightKg < 90) & ( 137 <= heightCm <= 213) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif (50 < age <= 90) & (17 < BMI < 31) & (116< SysBP <=147) & ( 81 <= DiaBP <= 91) & ( 85 <= HR <= 160) & ( 50 < weightKg < 90) & ( 137 <= heightCm <= 213) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) elif ( 20 <= age < 90) & (17 < BMI < 31) : data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),0]))) else: data.append(dict(zip(['age','sex', 'SysBP', 'DiaBP', 'HR', 'weightKg','heightCm', 'BMI','indication'], [age,sex,SysBP,DiaBP,HR,weightKg,heightCm,np.round(BMI),1]))) df1 = pd.DataFrame(data) df1.to_csv("Health_heart_experimental.csv") `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the main data of the paper "Optimal Rejection-Free Path Sampling," and the source code for generating/appending the independent RFPS-AIMMD and AIMMD runs.
Due to size constraints, the data has been split into separate repositories. The following repositories contain the trajectory files generated by the runs:
all the WQ runs: 10.5281/zenodo.14830317chignolin, fps0: 10.5281/zenodo.14826023chignolin, fps1: 10.5281/zenodo.14830200chignolin, fps2: 10.5281/zenodo.14830224chignolin, tps0: 10.5281/zenodo.14830251chignolin, tps1: 10.5281/zenodo.14830270chignolin, tps2: 10.5281/zenodo.14830280
The trajectory files are not required for running the main analysis, as all necessary information for machine learning and path reweighting is contained in the "PatEnsemble" object files stored in this repository. However, these trajectories are essential for projecting the path ensemble estimate onto an arbitrary set of collective variables.
To reconstruct the full dataset, please merge all the data folders you find in the supplemental repositories.
Data structure and content
analysis (code for analyzing the data and generating the figures of the| paper)|- figures.ipynb (Jupyter notebook for the analysis)|- figures (the figures created by the Jupyter notebook) |- ...
data (all the AIMMD and reference runs, plus general info about the| simulated systems)|- chignolin |- *.py: (code for generating/appending AIMMD runs on a Workstation or | HPC cluster via Slurm; see the "src" folder below) |- run.gro (full system positions in the native conformation) |- mol.pdb (only the peptide positions in the native conformation) |- topol.top (the system's topology for the GROMACS MD engine) |- charmmm22star.ff (force field parameter files) |- run.mdp (GROMACS MD parameters when appending a simulation) |- randomvelocities.mdp (GROMACS MD parameters when initializing a | simulation with random velocities) |- signature.npy, r0.npy (parameters for defining the fraction of native | contacts involved in the folded/unfolded states | definition; used by params.py function | "states_function") |- dmax.npy, dmin.npy (parameters for defining the feature representation | of the AIMMD NN model; used by params.py | function "descriptors_function") |- equilibrium (reference long equilibrium trajectory files; only the | peptide positions are saved!) |- run0.xtc, ..., run3.xtc |- validation |- validation.xtc (the validation SPs all together in an XTC file) |- validation.npy (for each SP, collects the cumulative shooting results after 10 two-way shooting simulations) |- fps0 (the first AIMMD-RFPS independent run) |- equilibriumA (the free simulations around A, already processed | in PathEnsemble files) |- traj000001.h5 |- traj000001.tpr (for running the simulation; in that case, please | retrieve all the trajectory files in the right | supplemental repository first) |- traj000001.cpt (for appending the simulation; in that case, please | retrieve all the trajectory files in the right | supplemental repository first) |- traj000002.h5 (in case of re-initialization) |- ... |- equilibriumB (the free simulations around B, ...) |- ... |- shots0 |- chain.h5 (the path sampling chain) |- pool.h5 (the selection pool, containing the frames from which | shooting points are currently selected from) |- params.py (file containing the states and descriptors definitions, | the NN fit function, and the AIMMD runs hyperparameters; | if can be modified to allow for RFPS-AIMMD or the original | algorithm AIMMD runs) |- initial.trr (the initial transition for path sampling) |- manager.log (reports info about the run) |- network.h5 (NN weights of the model at different path | sampling steps) |- fps1, fps2 (the other RFPS-AIMMD runs) |- tps0 (the first AIMMD-TPS, or "standard" AIMMD, run) |- ... |- shots0 |- ... |- chain_weights.npy (weights of the trials in TPS; only the trials | with non zero weight had been accepted) |- tps1, tps2 (the other AIMMD runs, with TPS for the shooting simulations)|- wq (Wolfe-Quapp 2D system) |- *.py: (code for generating/appending AIMMD runs on a Workstation or | HPC cluster via Slurm) |- run.gro (dummy gro file produced for compatibility reasons) |- integrator.py (custom MD engine) |- equilibrium (reference long simulation) |- transition000001.xtc (extracted from reference long simulation) |- transition000002.xtc |- ... |- transitions.h5 (PathEnsemble file with all the transitions) |- reference |- grid_X.npy, grid_Y.npy (X, Y grid for 2D plots) |- grid_V.npy (PES projected on the grid) |- grid_committor_relaxation.npy (true committor on the grid solved | with the relaxation method on the | backward Kolmogorov equation; the | code for doing this is in utils.py) |- grid_boltzmann_distribution.npy (Boltzmann distribution on the grid) |- pe.h5 (equilibrium distribution processed as a PathEnsemble file) |- tpe.h5 (TPE distribution processed as a PathEnsemble file) |- ... |- uniform_tps (reference TPS run with uniform SP selection) |- chain.h5 (PathEnsemble file containin all the accepted paths | with their correct weight) |- fps0, ..., fps9 (the independent AIMMD-RFPS runs) |- ... |- tps0, ..., tps9 (the independent AIMMD-TPS, or "standard" AIMMD runs)
src (code for generating/appending AIMMD runs on a Workstation or HPC| cluster via Slurm)|- generate.py (on a Workstation: initializes the processes; on an HPC| cluster: creates the sh file for submitting a job)|- slurm_options.py (to customize and use in case of running on HPC)|- manager.py (controls SP selection; reweights the paths)|- shooter.py (performs path sampling simulations)|- equilibrium.py (performs free simulations)|- pathensemble.py (code of the PathEnsemble class)|- utils.py (auxiliary functions for data production and analysis)
Running/appending AIMMD runs
Create a "run directory" folder (same depth as "fps0")
Copy "initial.trr" and "params.py" from another AIMMD run folder. It is possible to change "params.py" to customize the run.
(On a Workstation) call:
python generate.py
where nsteps is the final number of path sampling steps for the run, n the number of independent path sampling chains, nA the number of independent free simulators around A, and nB that of free simulators around B.
python generate.py -s slurm_options.pysbatch ._job.sh
Merge the supplemental repository with the trajectory files into this one.
Just call again (on a Workstation)
python generate.py
or (on a HPC cluster)
sbatch ._job.sh
after updating the "nsteps" parameters.
Reproducing the analysis
Run the analysis/figures.ipynb notebook. Some groups of cells have to be run multiple times after changing the parameters in the preamble.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The sample_id is created sequentially from the 1st year (hence, it is different from the sample_id in the Kaggle Dataset). Note that while the original data follows a naming convention with 'train_...', this dataset simply uses integer IDs.
The data is divided into 12 chunks, each containing data from 8th year February to 9th year January.
Source code:
from pathlib import Path
import click
import pandas as pd
import polars as pl
@click.command()
@click.argument("subsample-rate")
@click.argument("offset")
def main(subsample_rate, offset):
NUM_YEARS = 8
MONTH_DAY = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
NUM_SAMPLES = (8 * sum(MONTH_DAY) * 72) // subsample_rate
assert (
0 <= offset < subsample_rate
), f"assertion failed: 0 <= offset < subsample_rate, got {offset} and {subsample_rate}."
idx = 0
file_id = 0
data = []
try:
for year in range(NUM_YEARS):
for month in range(1, 13):
for day in range(MONTH_DAY[month % 12]):
for term in range(72):
if file_id == NUM_SAMPLES:
raise Exception
if idx % subsample_rate == offset:
data.append(
dict(
sample_id=file_id,
year=year + 1,
real_year=year + (month // 12) + 1,
month=month % 12 + 1,
day=day + 1,
min_of_day=term * 1200,
)
)
file_id += 1
idx += 1
except Exception:
print("error")
pass
output_path = Path(
"/ml-docker/working/kaggle-leap-private/data/hugging_face_download"
)
if not output_path.exists():
output_path.mkdir()
df = pl.from_pandas(pd.DataFrame(data))
print(df.filter(pl.col("year").eq(8)))
df.write_parquet(output_path / f"subsample_s{subsample_rate}_o{offset}.pqt")
if _name_ == "_main_":
main()
Low-Resolution Real Geography
11.5° x 11.5° horizontal resolution (384 grid columns)
100 million total samples (744 GB)
1.9 MB per input file, 1.1 MB per output file
Therefore, it is appropriate to evaluate the model trained on the Kaggle Dataset with this dataset.
Facebook
Twitter拡張型 https://thomasnyberg.com/cpp_extension_modules.html
https://pypi.org/project/ct-python/
# coding: utf-8 import numpy as np from scipy.sparse import lil_matrix,csc_matrix from scipy.sparse.linalg import lsqr from PIL import Image import math def gen_col_index(NX,NY,angle,step): L = int(max(NX,NY)*1.42) nx = math.cos(angle*math.pi/180) ny = math.sin(angle*math.pi/180) x0 = NX/2 + nx * step y0 = NY/2 + ny * step c = -nx*x0 - ny*y0 cols=[ ] if abs(nx)>abs(ny): for y in range(NY): x1 = int(-(ny*y+c)/nx+0.5) for x in range(x1-5,x1+6): if x<0 or x>=NX: continue dist = abs(nx*x+ny*y+c) if dist<5: wt = math.exp(-dist**2/3.0) k = x + y*NX cols.append([k,wt]) else: for x in range(NX): y1 = int(-(nx*x+c)/ny + 0.5) for y in range(y1-5,y1+6): if y<0 or y>=NY: continue dist = abs(nx*x+ny*y+c) if dist<5: wt = math.exp(-dist**2/3.0) k = x + y*NX cols.append([k,wt]) return cols def gen_matrix(NX,NY,data): N = data.shape[0] M = NX*NY W = lil_matrix((N,M)) F = np.zeros((N,)) for j,x in enumerate(data): angle = x[0] step = x[1] val = x[2] cols = gen_col_index(NX,NY,angle,step) for k,wt in cols: W[j,k]=wt F[j]=val return W,F ### data = np.load('scanned_data.npy') NX=128 NY=128 W,F = gen_matrix(NX,NY,data) W = csc_matrix(W) res = lsqr(W,F,show=True,iter_lim=100) X = res[0] image = np.zeros((NX,NY)) for x in range(NX): for y in range(NY): val = X[x + y*NX] image[x,y] = val pil_img = Image.fromarray(image.astype(np.uint8)) pil_img.show() pil_img.save('reconstructed_image.png')
# coding: utf-8 import numpy as np from PIL import Image import math def calc_signal(img, angle, step): NX = img.shape[0] NY = img.shape[1] x_c = NX/2 y_c = NY/2 nx = math.cos(angle*math.pi/180) ny = math.sin(angle*math.pi/180) x0 = x_c + nx * step y0 = y_c + ny * step c = -nx*x0 - ny*y0 sig=0 for x in range(NX): for y in range(NY): dist = abs(nx*x + ny*y + c) if dist<5: wt = math.exp(-dist*2/3.0) sig = sig + wt*img[x,y] return sig def scan_image(img): data=[] L = int(np.max(img.shape)/2*math.sqrt(2)+1) for angle in np.arange(0,180,5): # step angle of 5 degrees print("angle=",angle) for step in range(-L,L,1): sig = calc_signal(img, angle, step) if sig>0: data.append([angle,step,sig]) return np.array(data) # img = np.array(Image.open('lena.png')) data = scan_image(img) np.save('scanned_data.npy',data)
pip install ct-python
pip install git+https://github.com/configtree/ct-python-sdk
import ct_python
python setup.py install --user
import ct_python
configuration.api_key_prefix['Authorization'] = 'Bearer' # create an instance of the API class api_instance = ct_python.CTApi(ct_python.ApiClient(configuration)) body = ct_python.Version() # Version | id = 'id_example' # str | organization_slug = 'organization_slug_example' # str | try: api_response = api_instance.partial_update_version(body, id, organization_slug) pprint(api_response) except ApiException as e: print("Exception when calling CTApi->partial_update_version: %s " % e) # Configure API key authorization: Bearer configuration = ct_python.Configuration() configuration.api_key['Authorization'] = 'YOUR_API_KEY' configuration.api_key_prefix['Authorization'] = 'Bearer' # create an instance of the API class api_instance = ct_python.CTApi(ct_python.ApiClient(configuration)) body = ct_python.TokenRefresh() # TokenRefresh | try: api_response = api_instance.refresh_token(body) pprint(api_response) except ApiException as e: print("Exception when calling CTApi->refresh_token: %s " % e) # Configure API key authorization: Bearer configuration = ct_python.Configuration() configuration.api_key['Authorization'] = 'YOUR_API_KEY' configuration.api_key_prefix['Authorization'] = 'Bearer' # create an instance of the API class api_instance = ct_python.CTApi(ct_python.ApiClient(configuration)) body = ct_python.Application() # Application | id = 'id_example' # str | organization_slug = 'organization_slug_example' # str | try: api_response = api_instance.update_application(body, id, organization_slug) pprint(api_response) except ApiException as e: print("Exception when calling CTApi->update_application: %s " % e) # Configure API key authorization: Bearer configuration = ct_python.Configuration() configuration.api_key['Authorization'] = 'YOUR_API_KEY' configuration.api_key_prefix['Authorization'] = 'Bearer' # create an instance of the API class api_instance = ct_python.CTApi(ct_python.ApiClient(configuration)) body = ct_python.Configuration() # Configuration | id = 'id_example' # str | organization_slug = 'organization_slug_example' # str | try: api_response = api_instance.update_configurat...
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
API for www.legislation.gov.uk - launched by The National Archives on 29/07/2010 - giving access to the statute book at various levels, for various times, as reusable html fragments, xml and rdf. The API is RESTful and uses content negotiation, so full access to the data can be achieved using http requests. Alternatively, just append data.xml or data.rdf to any legislation page on the website to return the underlying data. The full API is also available from http://legislation.data.gov.uk.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.kaggle.com/code/tmyok1984/mayo-tile-generation-using-pyvips
def make_tiles(image_path, tile_size=1024, max_tiles=64, avg_thr=230, std_thr=15, clip_edge=0.05):
image = pyvips.Image.new_from_file(image_path, access='sequential')
# cropping
offset_x = int(image.width * clip_edge)
offset_y = int(image.height * clip_edge)
w = int(image.width * (1-clip_edge*2))
h = int(image.height * (1-clip_edge*2))
image = image.crop(offset_x, offset_y, w, h)
# padding
pad_w = (tile_size - image.width%tile_size)%tile_size
pad_h = (tile_size - image.height%tile_size)%tile_size
image = image.embed(
pad_w//2, pad_h//2,
image.width+pad_w, image.height+pad_h,
extend="mirror")
# Get the scanning position of the image
x_pos_list = []
y_pos_list = []
for y in range(0, image.height, tile_size):
for x in range(0, image.width, tile_size):
x_pos_list.append(x)
y_pos_list.append(y)
# Get the cropping position of the image
selected_x_pos_list = []
selected_y_pos_list = []
avg_list = []
for x, y in zip(x_pos_list, y_pos_list):
tile = image.crop(x, y, tile_size, tile_size)
avg = tile.avg()
std = tile.deviate()
if avg < avg_thr and std > std_thr:
selected_x_pos_list.append(x)
selected_y_pos_list.append(y)
avg_list.append(avg)
# Sort by ascending order of average brightness
sorted_idx = np.argsort(np.array(avg_list))
selected_x_pos_array = np.array(selected_x_pos_list)[sorted_idx][:max_tiles]
selected_y_pos_array = np.array(selected_y_pos_list)[sorted_idx][:max_tiles]
# crop
images = []
for x, y in zip(selected_x_pos_array, selected_y_pos_array):
tile = image.crop(x, y, tile_size, tile_size)
img = tile.numpy()
images.append(img)
if len(images) > 0:
images = np.stack(images)
del image
gc.collect()
return images
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The data was collected on 2024-04-05 containing 3492 problems.
Cleaned via the following script.
import json
import csv
from io import TextIOWrapper
def clean(data: dict):
questions = data['data']['problemsetQuestionList']['questions']
for q in questions:
yield {
'id': q['frontendQuestionId'],
'difficulty': q['difficulty'],
'title': q['title'],
'titleCn': q['titleCn'],
'titleSlug': q['titleSlug'],
'paidOnly': q['paidOnly'],
'acRate': round(q['acRate'], 3),
'topicTags': [t['name'] for t in q['topicTags']],
}
def out_jsonl(f: TextIOWrapper):
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
for q in clean(data):
f.write(json.dumps(q, ensure_ascii=False))
f.write('
')
def out_json(f: TextIOWrapper):
l = []
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
for q in clean(data):
l.append(q)
json.dump(l, f, ensure_ascii=False)
def out_csv(f: TextIOWrapper):
writer = csv.DictWriter(f, fieldnames=[
'id', 'difficulty', 'title', 'titleCn', 'titleSlug', 'paidOnly', 'acRate', 'topicTags'
])
writer.writeheader()
for id in range(0, 35):
with open(f'data/{id}.json', encoding='u8') as f2:
data = json.load(f2)
writer.writerows(clean(data))
with open('data.jsonl', 'w', encoding='u8') as f:
out_jsonl(f)
with open('data.json', 'w', encoding='u8') as f:
out_json(f)
with open('data.csv', 'w', encoding='u8', newline='') as f:
out_csv(f)
Facebook
Twitter
According to our latest research, the global Data Enrichment Platform market size reached USD 2.47 billion in 2024, reflecting robust adoption across multiple industries. The market is projected to grow at a CAGR of 14.2% from 2025 to 2033, with the total market value expected to reach USD 7.72 billion by 2033. This remarkable growth is fueled by the increasing demand for high-quality, actionable data to drive decision-making, enhance customer engagement, and support digital transformation initiatives across sectors.
A primary growth driver for the Data Enrichment Platform market is the exponential rise in data volumes generated by businesses worldwide. Organizations are increasingly recognizing the importance of transforming raw data into valuable insights, which is only possible through advanced data enrichment solutions. These platforms enable companies to append, cleanse, and validate their datasets, ensuring high data accuracy and relevancy. The proliferation of digital channels, IoT devices, and cloud-based applications has further intensified the need for real-time data enrichment, as enterprises strive to personalize customer experiences and optimize operational efficiency. Additionally, the rapid adoption of artificial intelligence and machine learning technologies within data enrichment platforms has significantly improved the speed and accuracy of data processing, making these solutions indispensable for modern enterprises.
Another significant factor propelling market growth is the rising focus on regulatory compliance and risk mitigation. With stringent data privacy regulations such as GDPR, CCPA, and others coming into effect, organizations must ensure that their data repositories are accurate, up-to-date, and compliant. Data enrichment platforms help businesses identify outdated or incorrect information, reduce compliance risks, and maintain robust audit trails. This capability is especially crucial for sectors such as BFSI, healthcare, and government, where data integrity and compliance are paramount. The integration of enrichment solutions with existing CRM, ERP, and marketing automation systems has further expanded their applications, making it easier for organizations to maintain clean and compliant datasets across all functions.
The evolving landscape of customer engagement and marketing strategies is also fueling demand for data enrichment platforms. Businesses are increasingly leveraging enriched data to gain a 360-degree view of their customers, segment audiences more effectively, and deliver hyper-personalized content. Enhanced data quality empowers sales and marketing teams to target prospects with precision, improve lead scoring, and drive higher conversion rates. Moreover, in highly competitive sectors like retail and e-commerce, enriched data supports dynamic pricing, inventory management, and customer retention initiatives. As digital transformation accelerates across industries, the ability to derive actionable insights from enriched data is becoming a key differentiator for businesses seeking to gain a competitive edge.
From a regional perspective, North America continues to dominate the Data Enrichment Platform market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology providers, high adoption rates of advanced analytics solutions, and a mature digital infrastructure. Europe follows closely, driven by stringent data privacy regulations and a strong focus on data-driven decision-making. The Asia Pacific region is emerging as a high-growth market, supported by rapid digitalization, expanding e-commerce sectors, and increasing investments in cloud and AI technologies. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions ramp up their digital transformation efforts.
Merchant Data Enrichment is becoming a pivotal aspect of the data enrichment landscape, especially as businesses seek to enhance their understanding of transaction data. By leveraging merchant data enrichment, organizations can gain deeper insights into consumer spending patterns and merchant behaviors, which are critical for tailoring marketing strategies and improving customer engagement. This process involves appending
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Code and data to reproduce all results and graphs reported in Tannenbaum et al. (2022). This folder contains data files (.dta files) and a Stata do-file (code.do) that stitches together the different data files and executes all analyses and produces all figures reported in the paper. The do-file uses a number of user-written packages, which are listed below. Most of these can be installed using the ssc install command in Stata. Also, users will need to change the current directory path (at the start of the do-file) before executing the code. List of user written packages (descriptions): revrs (reverse-codes variable) ereplace (extends the egen command to permit replacing) grstyle (changes the settings for the overall look of graphs) spmap (used for graphing spatial data) qqvalue (used for obtaining Benjamini-Hochberg corrected p-values) parmby (creates a dataset by calling an estimation command for each by-group) domin (used to perform dominance analyses) coefplot (used for creating coefficient plots) grc1leg (combine graphs with a single common legend) xframeappend (append data frames to the end of the current data frame)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
It is Sergey's Home Credict Public Notebook code repo.
The calculation was obtained by using the below snippet:
shapes, nan_total_count = [], []
for fp in tqdm(train_disk_usage.path):
df = pl.read_csv(fp)
shapes.append(df.shape)
nan_total_count.append(df.null_count().to_pandas().sum().sum())
del df
train_disk_usage[['height', 'width']] = shapes
train_disk_usage['null_count'] = nan_total_count
train_disk_usage['isna_%'] = train_disk_usage.null_count / np.prod(shapes, 1)
train_disk_usage.to_csv('data/train_disk_usage.csv', index=False)
Facebook
TwitterVivli is an independent, non-profit organization that has developed a global data-sharing and analytics platform to serve all elements of the international research community. Our mission is to promote, coordinate, and facilitate scientific sharing and reuse of clinical research data through the creation and implementation of a sustainable global data-sharing enterprise. The Vivli platform includes an independent data repository, in-depth search engine and a cloud-based, secure analytics platform.
Facebook
TwitterVivli is an independent, non-profit organization that has developed a global data-sharing and analytics platform to serve all elements of the international research community. Our mission is to promote, coordinate, and facilitate scientific sharing and reuse of clinical research data through the creation and implementation of a sustainable global data-sharing enterprise. The Vivli platform includes an independent data repository, in-depth search engine and a cloud-based, secure analytics platform.
Facebook
Twitterfrom datasets import Dataset, DatasetDict from collections import defaultdict import re import random import json
data = defaultdict(list)
paths = { "test_online": "./test_online.jsonl", "train_online": "./train_online.jsonl", "test_rejection": "./test_rejection.jsonl", "train_rejection": "./train_rejection.jsonl", }
for name, path in paths.items(): with open(path, "r") as f: for line in f: data[name].append(json.loads(line))
def split_data(text):… See the full description on the dataset page: https://huggingface.co/datasets/kh4dien/hh-rlhf-helpful-only.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpsdataverse-unc-eduoai--hdl1902-2911631https://search.gesis.org/research_data/datasearch-httpsdataverse-unc-eduoai--hdl1902-2911631
Part 1 of the course will offer an introduction to SPSS and teach how to work with data saved in SPSS format. Part 2 will demonstrate how to work with SPSS syntax, how to create your own SPSS data files, and how to convert data in other formats to SPSS. Part 3 will teach how to append and merge SPSS files, demonstrate basic analytical procedures, and show how to work with SPSS graphics.
Facebook
TwitterThis dataset was created by Zahra Zolghadr