100+ datasets found

Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider (2023). Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information [Dataset]. http://doi.org/10.6084/m9.figshare.7405250.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7405250.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This table contains names, positions, and references for the samples contained in the sequence dataset and whether Prokaryotes and/or Eukaryotes were analyzed from the sample in this study. (CSV 3 kb)
n
Matlab example for Local Enrichment Analysis (LEA) analysis with real data
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend Snijder; Yannik Severin (2022). Matlab example for Local Enrichment Analysis (LEA) analysis with real data [Dataset]. http://doi.org/10.5061/dryad.2jm63xssk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xssk
Dataset updated
Aug 29, 2022
Dataset provided by
ETH Zurich
Authors
Berend Snijder; Yannik Severin
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Phenotypic plasticity is essential to the immune system, yet the factors that shape it are not fully understood. Here, we comprehensively analyze immune cell phenotypes including morphology across human cohorts by single-round multiplexed immunofluorescence, automated microscopy, and deep learning. Using the uncertainty of convolutional neural networks to cluster the phenotypes of 8 distinct immune cell subsets, we find that the resulting maps are influenced by donor age, gender, and blood pressure, revealing distinct polarization and activation-associated phenotypes across immune cell classes. We further associate T-cell morphology to transcriptional state based on their joint donor variability, and validate an inflammation-associated polarized T-cell morphology, and an age-associated loss of mitochondria in CD4+ T-cells. Taken together, we show that immune cell phenotypes reflect both molecular and personal health information, opening new perspectives into the deep immune phenotyping of individual people in health and disease. Methods This dataset accompanies the manuscript "Multiplexed high-throughput immune cell imaging reveals molecular health-associated phenotypes" by Yannik Severin et al., Science Advances, 2022. It includes: - knnlea.m: Matlab function for the presented Local Enrichment Analysis method - LEA_Example_Data.mat containing data from the manuscript to reproduce a LEA analysis - LEA_Example_Script.mat that runs through the analysis steps - README.txt
d
Consumer Marketing Data, B2C Consumer Address Enrichment, USA, CCPA...
datarade.ai
.json, .csv
Updated Mar 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Versium (2023). Consumer Marketing Data, B2C Consumer Address Enrichment, USA, CCPA Compliant [Dataset]. https://datarade.ai/data-products/versium-reach-b2c-consumer-address-enrichment-usa-gdpr-an-versium
Explore at:
.json, .csvAvailable download formats
Dataset updated
Mar 11, 2023
Dataset authored and provided by
Versium
Area covered
United States
Description
With Versium REACH's Contact Append or Contact Append Plus you can add consumer contact data, including multiple phone numbers or mobile-only to your list of customers or prospects. With Versium REACH you are connected to our proprietary database of over 300+ million consumers, 1 Billion emails, and over 150 million households in the United States. Through either our API or platform you can have contact data appended to your records with any of the following supplied values; Email Address Phone Postal Address, City, State, ZIP First Name, Last Name, City, State First Name, Last Name, ZIP
Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis...
springernature.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khong-Loon Tiong; Chen-Hsiang Yeang (2023). Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis [Dataset]. http://doi.org/10.6084/m9.figshare.7861256.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7861256.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Khong-Loon Tiong; Chen-Hsiang Yeang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary File S1. The R source codes of the MGSEA program, a toy example dataset, and a brief explanation for running the program. (ZIP 1832 kb)
Small Molecule-Protein Interaction Data
kaggle.com
zip
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Indranil Bhattacharyya (2024). Small Molecule-Protein Interaction Data [Dataset]. https://www.kaggle.com/datasets/photon98/leash-bio-engineered-data-training
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 19, 2024
Authors
Indranil Bhattacharyya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About the Dataset and How I augmented the data:

The dataset used in this augmentation process(used a subset of the original training data) is sourced from the Leash Bio - Predict New Medicines with BELKA competition(Read More). It comprises examples of small molecules categorized through binary classification, determining whether each molecule is a binder to one of three protein targets. The data collection method involves utilizing DNA-encoded chemical library (DEL) technology.

Chemical representations are expressed in SMILES (Simplified Molecular-Input Line-Entry System), while the labels denote binary binding classifications, corresponding to three distinct protein targets.

I've expanded the original dataset by augmenting it with additional features derived from the existing data. Specifically, I've calculated and included three new features:

mol_wt (Molecular Weight): Calculated based on the SMILES data using RDKit, providing insight into the mass of each molecule.

logP (Partition Coefficient): Also derived from the SMILES data using RDKit, representing the logarithm of the partition coefficient, a measure of a molecule's hydrophobicity and its ability to partition between a hydrophobic solvent and water.

rotamers (Number of Rotamers): Determined from the SMILES data using RDKit, indicating the number of distinct conformations or rotational isomers a molecule can adopt. These additional features aim to enrich the feature matrix, potentially enhancing the predictive power of models trained on the augmented dataset.

Data Description:

id- A unique example_id we use to identify the molecule-binding target pair. buildingblock1_smiles - The structure, in SMILES, of the first building block **buildingblock2_smiles **- The structure, in SMILES, of the second building block buildingblock3_smiles - The structure, in SMILES, of the third building block **molecule_smiles **- The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker. protein_name - The protein target name binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set. mol_wt - The molecule's molecular weight derived from SMILES data using RDKit. logP - The logP of the molecule derived from SMILES data using RDKit. **rotamers **- The number of rotamers of the molecule derived from SMILES data using RDKit.

Targets: binds

Proteins are encoded in the genome, and names of the genes encoding those proteins are typically bestowed by their discoverers and regulated by the Hugo Gene Nomenclature Committee. The protein products of these genes can sometimes have different names, often due to the history of their discovery.
c
Data from: Argon data for enriched MORB from the 8°20' N seamount chain
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Argon data for enriched MORB from the 8°20' N seamount chain [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/argon-data-for-enriched-morb-from-the-820-n-seamount-chain
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This dataset accompanies planned publication 'Near-Ridge Magmatism Constrained Using 40Ar/39Ar Dating of Enriched MORB from the 8°20' N Seamount Chain'. The Ar/Ar data are for samples that record the volcanic history of the area. The geochronology provides time constraints for the eruption of rocks studied in the manuscript. Samples were collected from the 8°20' N seamount chain by Molly Anderson (University of Florida), who sent them to the USGS Denver Argon Geochronology Laboratory for Ar/Ar analysis.
m
EESQ and EESQ-M reliability and validity raw data
data.mendeley.com
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nik nasihah nik ramli (2024). EESQ and EESQ-M reliability and validity raw data [Dataset]. http://doi.org/10.17632/ct74hk8wbw.1
Explore at:
Unique identifier
https://doi.org/10.17632/ct74hk8wbw.1
Dataset updated
Dec 2, 2024
Authors
nik nasihah nik ramli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains raw data from the pilot study samples used for the validity and reliability testing of the Environmental Enrichment Scale Questionnaire (EESQ) and its translated Malay version (EESQ-M).
Enhancing MovieLens Dataset: Enriching Recommendations with Audio...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Botti-Cebria; Victor Botti-Cebria; Laura Sebastia; Laura Sebastia; Vanessa Moscardo; Vanessa Moscardo (2023). Enhancing MovieLens Dataset: Enriching Recommendations with Audio Information, Transcriptions, and Metadata [Dataset]. http://doi.org/10.5281/zenodo.8037433
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8037433
Dataset updated
Jun 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Victor Botti-Cebria; Victor Botti-Cebria; Laura Sebastia; Laura Sebastia; Vanessa Moscardo; Vanessa Moscardo
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Nowadays, there are lots of datasets available for training and experimentation in the field of recommender systems. Specifically, in the recommendation of audiovisual content, the MovieLens dataset is a prominent example. It is focused on the user-item relationship, providing actual interaction data between users and movies. However, although movies can be described with several characteristics, this dataset only offers limited information about the movie genres.

In this work, we propose enriching the MovieLens dataset by incorporating metadata available on the web (such as cast, description, keywords, etc.) and movie trailers. By leveraging the trailers, we extract audio information and generate transcriptions for each trailer, introducing a crucial textual dimension to the dataset. The audio information was extracted by the waveform and frequency analysis, followed by the application of dimensionality reduction techniques. For the transcription generation, the deep learning model Whisper was used. Finally, metadata was obtained from TMDB, and the BERT model was applied to extract embeddings.

These additional attributes enrich the original dataset, providing deeper and more precise analysis. Then, the use of this extended and enhanced dataset could drive significant advancements in recommendation systems, enhancing user experiences by providing more relevant and tailored movie recommendations based on their tastes and preferences.
d
Factori USA Consumer Graph Data | socio-demographic, location, interest and...
datarade.ai
.json, .csv
Updated Jul 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Factori (2022). Factori USA Consumer Graph Data | socio-demographic, location, interest and intent data | E-Commere |Mobile Apps | Online Services [Dataset]. https://datarade.ai/data-products/factori-usa-consumer-graph-data-socio-demographic-location-factori
Explore at:
.json, .csvAvailable download formats
Dataset updated
Jul 23, 2022
Dataset authored and provided by
Factori
Area covered
United States of America
Description
Our consumer data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.

Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.

Geography - City, State, ZIP, County, CBSA, Census Tract, etc.

Demographics - Gender, Age Group, Marital Status, Language etc.

Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc

Persona - Consumer type, Communication preferences, Family type, etc

Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc.

Household - Number of Children, Number of Adults, IP Address, etc.

Behaviours - Brand Affinity, App Usage, Web Browsing etc.

Firmographics - Industry, Company, Occupation, Revenue, etc

Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc.

Auto - Car Make, Model, Type, Year, etc.

Housing - Home type, Home value, Renter/Owner, Year Built etc.

Consumer Graph Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:

Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).

Consumer Graph Use Cases:

360-Degree Customer View:Get a comprehensive image of customers by the means of internal and external data aggregation.

Data Enrichment:Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment

Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity.

Advertising & Marketing:Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.

Using Factori Consumer Data graph you can solve use cases like:

Acquisition Marketing Expand your reach to new users and customers using lookalike modeling with your first party audiences to extend to other potential consumers with similar traits and attributes.

Lookalike Modeling

Build lookalike audience segments using your first party audiences as a seed to extend your reach for running marketing campaigns to acquire new users or customers

And also, CRM Data Enrichment, Consumer Data Enrichment B2B Data Enrichment B2C Data Enrichment Customer Acquisition Audience Segmentation 360-Degree Customer View Consumer Profiling Consumer Behaviour Data

Here's the schema of Consumer Data: person_id first_name last_name age gender linkedin_url twitter_url facebook_url city state address zip zip4 country delivery_point_bar_code carrier_route walk_seuqence_code fips_state_code fips_country_code country_name latitude longtiude address_type metropolitan_statistical_area core_based+statistical_area census_tract census_block_group census_block primary_address pre_address streer post_address address_suffix address_secondline address_abrev census_median_home_value home_market_value property_build+year property_with_ac property_with_pool property_with_water property_with_sewer general_home_value property_fuel_type year month household_id Census_median_household_income household_size marital_status length+of_residence number_of_kids pre_school_kids single_parents working_women_in_house_hold homeowner children adults generations net_worth education_level occupation education_history credit_lines credit_card_user newly_issued_credit_card_user credit_range_new
credit_cards loan_to_value mortgage_loan2_amount mortgage_loan_type
mortgage_loan2_type mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender mortgage_loan2_lender
mortgage_loan2_ratetype mortgage_rate
mortgage_loan2_rate donor investor interest buyer hobby personal_email work_email devices phone employee_title employee_department employee_job_function skills recent_job_change company_id company_name company_description technologies_used office_address office_city office_country office_state office_zip5 office_zip4 office_carrier_route office_latitude office_longitude office_cbsa_code
office_census_block_group
office_census_tract office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl company_linkedinurl company_twitterurl
company_website company_fortune_rank
company_government_type company_headquarters_branch company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual company_msa company_msa_name
company_naics_code
company_naics_description
company_naics_code2 company_naics_description2
company_sic_code2
company_sic_code2_desc...
Data from: Large-Scale Learning of Structure−Activity Relationships Using a...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell (2023). Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics [Dataset]. http://doi.org/10.1021/ci100073w.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci100073w.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
Gene set enrichment data files
figshare.com
txt
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon French (2022). Gene set enrichment data files [Dataset]. http://doi.org/10.6084/m9.figshare.21404907.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21404907.v3
Dataset updated
Oct 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Leon French
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Files example_GO_groups.csv: example Gene Ontology group to gene symbol mapping.
d
Data from: Assessment of targeted enrichment locus capture across time and...
datadryad.org
search.dataone.org
+2more
zip
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware (2023). Assessment of targeted enrichment locus capture across time and museums using odonate specimens [Dataset]. http://doi.org/10.5061/dryad.kprr4xh8z
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.kprr4xh8z
Dataset updated
May 18, 2023
Dataset provided by
Dryad
Authors
Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware
Time period covered
2023
Description
IQ-Tree v.2.1.3 (Data matrix - fasta file) UNIX/Command line or a Text Editor for viewing (fastq files - raw data) FigTree (Tree file - .treefile) BBEdit (Partition files - Nexus)
Z
MAT-Builder datasets
data.niaid.nih.gov
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiara Pugliese (2023). MAT-Builder datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7839805
Explore at:
Dataset updated
Apr 19, 2023
Dataset provided by
Chiara Pugliese
Chiara Renso
Francesco Lettich
Fabio Pinelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The archive contains two datasets that have been used to empirically evaluate MAT-Builder, a system to generate multiple aspect trajectories.

The first one is located in the "rome" folder and contains 26395 trajectories from 3181 individuals. The trajectories move over the city of Rome and were collected from OpenStreetMap. The folder contains also auxiliary datasets, i.e., the set of POIs within the province of Rome's boundaries (downloaded from OpenStreetMap) (see the "poi" subfolder), historical weather information (downloaded from Meteostat https://meteostat.net/it/) (see the "weather" subfolder), and a dataset of social media posts from the individuals which was generated synthetically (see the "tweets" subfolder). All the datasets are pandas dataframes, except for the POI dataset which is a geopandas DataFrame. All the datasets have been stored according to the parquet format.

The second one is located in the "geolife" folder, and contains the GeoLife dataset. The dataset contains 17621 trajectories from 178 users. The timestamps of the trajectory samples have been adjusted from the GMT to the GMT+8 timezone. As in the former dataset's case, this folder contains also a dataset of POIs, a dataset of historical weather information, and a dataset of social media posts that were generated synthetically.

For more information on the MAT-Builder project (i.e., published papers, how to use to datasets, how the information within the datasets is structured, and so on) we refer to the MAT-Builder's GitHub page: https://github.com/chiarap2/MAT_Builder.
Clust_100_GE_datasets
zenodo.org
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Basel Abu-Jamous; Basel Abu-Jamous; Steven Kelly; Steven Kelly (2020). Clust_100_GE_datasets [Dataset]. http://doi.org/10.5281/zenodo.1298541
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1298541
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Basel Abu-Jamous; Basel Abu-Jamous; Steven Kelly; Steven Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our clust clustering method and to compare it with seven widely used clustering methods (Cross-Clustering, k-means, self-organising maps, MCL, hierarchical clustering, CLICK, and WGCNA). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.

The files are split into eight zipped parts, 100Datasets_0.zip to 100Datasets_7.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).

Below is a thorough description of the files and folders in this data resource.

Scripts

The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).

Datasets and clustering results (folders starting with D)

The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms. The files ending with _REACTOME and _REACTOME_E are similar to the GO term files but for the REACTOME pathway enrichment analysis. Each of these D###_Res/ folders includes a sub-folder "ParamSweepClust" which includes the results of applying clust multiple times to the same dataset while sweeping some parameters.

Large datasets analysis results

The folder LargeDatasets/ includes data and results for what we refer to as "large" datasets. These are 19 datasets that have more than 50 samples including replicates and have not therefore been included in the set of 100 datasets. However, they fit all of the other dataset selection criteria. We have compared clust with the other clustering methods over these datasets to demonstrate that clust still outperforms other datasets over larger datasets. This folder includes folders LD001/ to LD019/ and LD001_Res/ to LD019_Res/. These have similar format and contents as the D###/ and D###_Res/ folders described above.

Simultaneous analysis of multiple datasets (folders starting with MD)

As our clust method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses d randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all d values from 2 to 10 were tested, and at each one of these d values, 10 different runs were conducted, where at each run a different subset of d datasets is selected randomly.

The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the eight clustering methods at one of the 10 random runs of one of the selected d values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3^rd random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).

Our clust method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.

Evaluation metrics (folders starting with Metrics)

Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".

Other files and folders

The GO folder includes the reference GO term annotations for arabidopsis and yeast. Similarly, the REACTOME folder includes the reference REACTOME pathway annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the clust method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the eight methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.
d
Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records...
datarade.ai
.json
Updated Apr 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avanteer (2025). Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records / Updated Daily [Dataset]. https://datarade.ai/data-products/employee-data-the-largest-dataset-of-active-profiles-glob-avanteer
Explore at:
.jsonAvailable download formats
Dataset updated
Apr 19, 2025
Dataset authored and provided by
Avanteer
Area covered
Fiji, Gambia, Nicaragua, Pitcairn, United Arab Emirates, Tunisia, Bulgaria, Maldives, Anguilla, State of
Description
//// 🌍 Avanteer Employee Data ////

The Largest Dataset of Active Global Profiles 1B+ Records | Updated Daily | Built for Scale & Accuracy

Avanteer’s Employee Data offers unparalleled access to the world’s most comprehensive dataset of active professional profiles. Designed for companies building data-driven products or workflows, this resource supports recruitment, lead generation, enrichment, and investment intelligence — with unmatched scale and update frequency.

//// 🔧 What You Get ////

1B+ active profiles across industries, roles, and geographies

Work history, education history, languages, skills and multiple additional datapoints.

AI-enriched datapoints include: Gender Age Normalized seniority Normalized department Normalized skillset MBTI assessment

Daily updates, with change-tracking fields to capture job changes, promotions, and new entries.

Flexible delivery via API, S3, or flat file.

Choice of formats: raw, cleaned, or AI-enriched.

Built-in compliance aligned with GDPR and CCPA.

//// 💡 Key Use Cases ////

✅ Smarter Talent Acquisition Identify, enrich, and engage high-potential candidates using up-to-date global profiles.

✅ B2B Lead Generation at Scale Build prospecting lists with confidence using job-related and firmographic filters to target decision-makers across verticals.

✅ Data Enrichment for SaaS & Platforms Supercharge ATS, CRMs, or HR tech products by syncing enriched, structured employee data through real-time or batch delivery.

✅ Investor & Market Intelligence Analyze team structures, hiring trends, and senior leadership signals to discover early-stage investment opportunities or evaluate portfolio companies.

//// 🧰 Built for Top-Tier Teams Who Move Fast ////

Zero duplicate, by design

<300ms API response time

99.99% guaranteed API uptime

Onboarding support including data samples, test credits, and consultations

Advanced data quality checks

//// ✅ Why Companies Choose Avanteer ////

➔ The largest daily-updated dataset of global professional profiles

➔ Trusted by sales, HR, and data teams building at enterprise scale

➔ Transparent, compliant data collection with opt-out infrastructure baked in

➔ Dedicated support with fast onboarding and hands-on implementation help

////////////////////////////////

Empower your team with reliable, current, and scalable employee data — all from a single source.
e
Teosto Open Api – open interface for live music data
data.europa.eu
unknown
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yhdistykset ja säätiöt (2023). Teosto Open Api – open interface for live music data [Dataset]. https://data.europa.eu/data/datasets/3c7de080-ea97-4ddb-9a26-218579825170?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Dec 6, 2023
Dataset authored and provided by
Yhdistykset ja säätiöt
Description
The live music data collected by Teosto is the largest and most comprehensive in Finland. The data opened through the open interface now includes all live gigs announced to Teosto in Finland last year (2014): the dates of the gigs, the venues with their location and coordinates, the performers, the songs presented and the authors of the songs.

We challenge developers to enrich live music spatial data and develop new, innovative uses for it. Examples of data enrichment include combining other open spatial datasets with event data or music-related metadata with song-specific data.

The development of live data is part of the Open Finland Challenge competition and the Ultrahack event.
NBA WNBA play-by-play and shots data
kaggle.com
zip
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladislav Shufinskiy (2025). NBA WNBA play-by-play and shots data [Dataset]. https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021
Explore at:
zip(1683596108 bytes)Available download formats
Dataset updated
Jun 26, 2025
Authors
Vladislav Shufinskiy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

NBA anba WNBA dataset is a large-scale play-by-play and shot-detail dataset covering both NBA and WNBA games, collected from multiple public sources (e.g., official league APIs and stats sites). It provides every in-game event—from period starts, jump balls, fouls, turnovers, rebounds, and field-goal attempts through free throws—along with detailed shot metadata (shot location, distance, result, assisting player, etc.).

Also you can download dataset from github or GoogleDrive

Tutorials

NBA play-by-play dataset R example

I will be grateful for ratings and stars on github, but the best gratitude is use of dataset for your projects.

Useful links:

nba-on-court: package for work with NBA and WNBA play-by-play data

Ryan Davis: Analyze the Play by Play Data

Python nba_api package for work with NBA API: https://github.com/swar/nba_api

R hoopR package for work with NBA API: https://hoopr.sportsdataverse.org/

Motivation

I made this dataset because I want to simplify and speed up work with play-by-play data so that researchers spend their time studying data, not collecting it. Due to the limits on requests on the NBA and WNBA website, and also because you can get play-by-play of only one game per request, collecting this data is a very long process.

Using this dataset, you can reduce the time to get information about one season from a few hours to a couple of seconds and spend more time analyzing data or building models.

I also added play-by-play information from other sources: pbpstats.com, data.nba.com, cdnnba.com. This data will enrich information about the progress of each game and hopefully add opportunities to do interesting things.

Contact Me

If you have any questions or suggestions about the dataset, you can write to me in a convenient channel for you:

LinkedIn

GIthub

X

Telegram
d
Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forager.ai, Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+ Accuracy [Dataset]. https://datarade.ai/data-products/global-mobile-phone-number-data-90m-95-accuracy-api-b-forager-ai-905f
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Forager.ai
Area covered
Botswana, Martinique, South Georgia and the South Sandwich Islands, Macedonia (the former Yugoslav Republic of), Japan, Colombia, United Arab Emirates, Uruguay, Moldova (Republic of), Cambodia
Description
Global B2B Mobile Phone Number Database | 100M+ Verified Contacts | 95% Accuracy Forager.ai provides the world’s most reliable mobile phone number data for businesses that refuse to compromise on quality. With 100 million+ professionally verified mobile numbers refreshed every 3 weeks, our database ensures 95% accuracy – so your teams never waste time on dead-end leads.

Why Our Data Wins ✅ Accuracy You Can Trust 95% of mobile numbers are verified against live carrier records and tied to current job roles. Say goodbye to “disconnected number” voicemails.

✅ Depth Beyond Digits Each contact includes 150+ data points:

Direct mobile numbers

Current job title, company, and department

Full career history + education background

Location data + LinkedIn profiles

Company size, industry, and revenue

✅ Freshness Guaranteed Bi-weekly updates combat job-hopping and role changes – critical for sales teams targeting decision-makers.

✅ Ethically Sourced & Compliant First-party collected data with full GDPR/CCPA compliance.

Who Uses This Data?

Sales Teams: Cold-call C-suite prospects with verified mobile numbers.

Marketers: Run hyper-personalized SMS/WhatsApp campaigns.

Recruiters: Source passive candidates with up-to-date contact intel.

Data Vendors: License premium datasets to enhance your product.

Tech Platforms: Power your SaaS tools via API with enterprise-grade B2B data.

Flexible Delivery, Instant Results

API (REST): Real-time integration for CRMs, dialers, or marketing stacks

CSV/JSON: Campaign-ready files.

PostgreSQL: Custom databases for large-scale enrichment

Compliance: Full audit trails + opt-out management

Why Forager.ai? → Proven ROI: Clients see 62% higher connect rates vs. industry averages (request case studies). → No Guesswork: Test-drive free samples before committing. → Scalable Pricing: Pay per record, license datasets, or get unlimited API access.

B2B Mobile Phone Data | Verified Contact Database | Sales Prospecting Lists | CRM Enrichment | Recruitment Phone Numbers | Marketing Automation | Phone Number Datasets | GDPR-Compliant Leads | Direct Dial Contacts | Decision-Maker Data

Need Proof? Contact us to see why Fortune 500 companies and startups alike trust Forager.ai for mission-critical outreach.
d
Data from: Enriching the ant tree of life: enhanced UCE bait set for...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Feb 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth (2017). Enriching the ant tree of life: enhanced UCE bait set for genome-scale phylogenetics of ants and other Hymenoptera [Dataset]. http://doi.org/10.5061/dryad.89n87
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.89n87
Dataset updated
Feb 14, 2017
Dataset provided by
Dryad
Authors
Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth
Time period covered
Feb 12, 2017
Area covered
Global
Description
Targeted enrichment of conserved genomic regions (e.g., ultraconserved elements or UCEs) has emerged as a promising tool for inferring evolutionary history in many organismal groups. Because the UCE approach is still relatively new, much remains to be learned about how best to identify UCE loci and design baits to enrich them.

We test an updated UCE identification and bait design workflow for the insect order Hymenoptera, with a particular focus on ants. The new strategy augments a previous bait design for Hymenoptera by (a) changing the parameters by which conserved genomic regions are identified and retained, and (b) increasing the number of genomes used for locus identification and bait design. We perform in vitro validation of the approach in ants by synthesizing an ant-specific bait set that targets UCE loci and a set of “legacy” phylogenetic markers. Using this bait set, we generate new data for 84 taxa (16/17 ant subfamilies) and extract loci from an additional 17 genome-e...
d
Data on the Enrichment and Isolation of the Acetylenotrophic and...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data on the Enrichment and Isolation of the Acetylenotrophic and Diazotrophic Isolate Bradyrhizobium sp. strain I71 (ver. 2.0, September 2022) [Dataset]. https://catalog.data.gov/dataset/data-on-the-enrichment-and-isolation-of-the-acetylenotrophic-and-diazotrophic-isolate-brad
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Acetylene (C2H2) is a molecule rarely found in nature, with few known natural sources, but acetylenotrophic microorganisms can use acetylene as their primary carbon and energy source. As of 2018 there were 15 known strains of aerobic and anaerobic acetylenotrophs, however we hypothesized that there may be yet unrecognized diversity of acetylenotrophs in nature. In this study, we expanded this diversity by isolating an aerobic acetylenotroph, Bradyrhizobium sp. strain I71, from trichloroethene (TCE)-contaminated soils undergoing bioremediation. TCE-contaminated soils from the NASA Ames Research Center in California were used to establish soil microcosms with acetylene as the primary carbon substrate and acetylene uptake was tracked over time and reported in T1_soil_microcosm_v2.0.csv. DNA was extracted from soil microcosm samples for microbial community analysis based on 16S rRNA gene sequencing; the resulting operational taxonomic units are presented in T2_soil_OTU_v2.0.csv. Bradyrhizobium sp. strain I71 was isolated from the soil microcosms and acetylene uptake and cell growth data for the isolate over time are shown in T3_soil_isolate_v2.0.csv. Nitrogen fixation assays for the pure culture of Bradyrhizobium sp. strain I71 are reported in T4_N2_fixation_v2.0.csv. Acetylene concentrations and cell densities from acetylenotrophic and heterotrophic growth assays for Bradyrhizobium sp. strain I71 are reported in T5_GrowthCurve_v2.0.csv

Facebook

Twitter

Click to copy link

Link copied

Cite

Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider (2023). Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information [Dataset]. http://doi.org/10.6084/m9.figshare.7405250.v1

Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.7405250.v1

Dataset updated

May 31, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This table contains names, positions, and references for the samples contained in the sequence dataset and whether Prokaryotes and/or Eukaryotes were analyzed from the sample in this study. (CSV 3 kb)

Clear search

Close search

Google apps

Main menu

Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS...

Matlab example for Local Enrichment Analysis (LEA) analysis with real data

Consumer Marketing Data, B2C Consumer Address Enrichment, USA, CCPA...

Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis...

Small Molecule-Protein Interaction Data

About the Dataset and How I augmented the data:

Data Description:

Targets: binds

Data from: Argon data for enriched MORB from the 8°20' N seamount chain

EESQ and EESQ-M reliability and validity raw data

Enhancing MovieLens Dataset: Enriching Recommendations with Audio...

Factori USA Consumer Graph Data | socio-demographic, location, interest and...

Data from: Large-Scale Learning of Structure−Activity Relationships Using a...

Gene set enrichment data files

Data from: Assessment of targeted enrichment locus capture across time and...

MAT-Builder datasets

Clust_100_GE_datasets

Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records...

Teosto Open Api – open interface for live music data

NBA WNBA play-by-play and shots data

Description

Motivation

Contact Me

Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+...

Data from: Enriching the ant tree of life: enhanced UCE bait set for...

Data on the Enrichment and Isolation of the Acetylenotrophic and...

Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information

Targets: `binds`