Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This LatentJam dataset contains 12 music latent representations (9 content-based and 3 collaborative-filtering) for 29 275 music tracks from Jamendo. It is released as part of the publication "Similarity of Nearest-Neighbor Query Results in Deep Latent Spaces". For more details on the individual models used for extraction, refer to the text of the paper.
The example code that uses this dataset and reproduces the experiments and analysis done in the paper is available at https://github.com/philtgun/compare-embeddings. See the README for more details on how to use this dataset and individual files.
The content-based representations have been extracted with the Essentia library (essentia-tensorflow 2.1b6.dev374) during the internship of the first author in the Jamendo in 2021 as part of the MIP-Frontiers project. The collaborative filtering representations have been computed from data provided by Jamendo with the Implicit library (AlternatingLeastSquares algorithm, implicit 0.4.4). The Python version used is 3.7.13.
This dataset is released under CC BY-NC-SA 4.0 License.
Please cite the publication if you use this dataset:
@inproceedings{tovstogan_similarity_2009,
title = {Similarity of nearest-neighbor query results in deep latent spaces},
author = {Tovstogan, Philip and Serra, Xavier and Bogdanov, Dmitry},
booktitle = {Proceedings of the 19th Sound and Music Computing Conference ({SMC})},
year = {2022}
}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
OverviewHessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.A pre-print article associated with this dataset is available at here.Data recordsThe dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.Load the dataset:Use the following Python script to load the dataset dictionary: pythonfrom datasets import load_from_diskdataset = load_from_disk(root_directory)print(dataset)
Expected output:pythonDatasetDict({vacuum: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),thf: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),toluene: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),water: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645})})
DFT MethodsAll DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.Example model weightsAn example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the e3x
package with specific architecture details. To load the model weights, use:pythonparams = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()
This dataset contains all data, which have been used to write the linked paper. In addition, it contains all Python scripts used for the evaluation of the data. It should be noted that the Python module pynocular is used within the scripts. This module is not yet published, but it is planned for release via https://github.com/baw-de.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 10 files with the data used to create the 11 figures in the journal paper:
Elsender, Daniel & Bate, Matthew R., 2024, Monthly Notices of the Royal Astronomical Society
The files contain the data and plotting scripts required to make the figures. Some files contain smoothed particle hydrodynamics (SPH) dump files, within these files are the SPLASH (Price 2007) config files that were used to produce the figures in the paper.
The SPH dump files are Fortran binary files written in big endian format and generated by the sphNG code (Benz 1990; Bate 1995; Whitehouse & Bate 2004; Price & Monaghan 2007). They can be read, visualised, and manipulated using the free, publicly available SPLASH visualisation code (which reads sphNG dump files), written by Daniel J. Price, that can be downloaded from:
http://users.monash.edu.au/~dprice/splash/
Files are as follows:
fig_1: Used for plotting the results of the dustywave tests. Contains 4 sub folders; k_1, k_10, k_100, k_1000, which refer to the drag coefficients. Each sub folder has 2 SPH dump files, VD- and VG-, along with 3 SPLASH config files. VD- contains information to plot the dust quantities, and VG- contains information to plot the gas quantities
fig_2: Used to plot the results of the dustyshock test. Contains 1 SPH dump and 2 SPLASH configs.
fig_3_4: Used to plot maximum densities in 4 different protostellar collapse calculations. Within this file is a python script (central_dens.py) and two sub folders: non-rotating and B008. Within these sub folders are dust-as-mixture and dust-as-particles, these refer to the two different dust evolution methods used. Within each of these folders there are 6 folders, one for each dust grain species, containing a text file 'dustmass.txt'. The 'dustmass.txt' files have three columns: time, gas density, and dust density.
fig_5: Contains SPH dump files used to plot snapshots of three of the dust-as-mixture calculations, grain sizes 1, 30, 100 microns. The SPLASH config files are in the top level of this folder with the SPH dump files within sub folders 1mu, 30mu, and 100mu.
fig_6: The same as fig_5 except to plot snapshots from some of the dust-as-particles calculations, with grain sizes 1, 30, and 100 microns.
fig_7: Used to plot a radial dust density profile from 4 dust-as-mixture calculations; 2 calcultions use the implicit method described in the paper, and 2 use an explicit timestepping method. In the top level of the folder is the python script profile.py that will recreate the figure. The data used to make the plots are in files beginning QD-, radius is in units of [cm] and density in units of [g cm^(-3)].
fig_8: Contains data used to plot snapshots from 2 planet in disc calculations that employ either the implicit method from the paper or an explicit method. The SPLASH configs are in the top level of the folder and the explicit and implicit data files are in sub folders named as such. Each folder contains the 4 SPH dumps used in the paper.
fig_9: 3 text files and 1 python script. The text files contain 3 colunms: time [yrs], dust mass and gas mass [solar masses]. The files correspond to calculations of discs with embedded planets using different implementations of dust. The data from implicit method described in the paper is in dustmass_imp.txt, for the explicit version of this calculation in dustmass.txt, and in an explicit calculation with a different dust formulation (described in the paper) dustmass_old.txt
fig_10: Contains, in sub folders implicit and explicit, SPH dumps files for the results of the dust settle test after 20 orbits and the SPLASH config files to set the plot up.
fig_11: Two sub folders, explicit and implicit, each containing the SPH dump file needed to plot the cross section of the disc. The SPLASH config files are set up to take a cross section of the disc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data files containg code sources for dataset creation & model learning (neural-jsdf.zip) and collected synthetic dataset of free & collided postures for robotic arm Franka (sdf_3m_full_mesh.mat). Follow the Readme.MD files to launch the code if needed.
Corresponding Git repo: https://github.com/epfl-lasa/Neural-JSDF
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "LLM World of Words" (LWOW) [1] is a collection of datasets of English free association norms generated by various large language models (LLMs). Currently, the collection consists of datasets generated by Mistral, LLaMA3, and Claude Haiku. The datasets are modeled after the "Small World of Words" (SWOW) (https://smallworldofwords.org/en/project/) [2] English free association norms, generated by humans, consisting of over 12,000 cue words and over 3 million responses. The purpose of the LWOW datasets is to provide a way to investigate various aspects of the semantic memory of LLMs using an approach that has been applied extensively for investigating the semantic memory of humans. These datasets, together with the SWOW dataset, can be used to gain insights about similarities and differences in the language structures possessed by humans and LLMs.
Free associations are implicit mental connections between words or concepts. They are typically accessed by presenting humans (or AI agents) with a cue word and then asking them to respond with the first words that come to mind. The responses represent implicit associations that connect different concepts in the mind, reflecting the semantic representations that underly patterns of thought, memory, and language. For example, given the cue word "woman", a common free association response might be "man", reflecting the associative mental relation between these two concepts.
Free associations have been extensively used in cognitive psychology and linguistics as a tool for studying language and cognitive information processing. They provide a way for researchers to understand how conceptual knowledge is organized and accessed in the mind. Free associations are often used to built network models of semantic memory by connecting cue words to their responses. When thousands of cues and responses are connected in this way, the result is a complex network model that represents the complex organization of semantic knowledge. Such models enable the investigation of complex cognitive processes that take place within semantic memory, and can be used to study a variety of cognitive phenomena such as language learning, creativity, personality traits, and cognitive biases.
The LWOW datasets were validated using data from the Semantic Priming Project (https://www.montana.edu/attmemlab/spp.html) [3], which implements a lexical decision task (LDT) to study semantic priming. The semantic priming effect is the cognitive phenomenon that a target word (e.g. nurse) is more easily recognized when it is prompted by a related prime word (e.g. doctor) compared to an unrelated prime word (e.g. doctrine). We simulated the semantic priming effect within network models of semantic memory built from both the LWOW and the SWOW free association norms by implementing spreading activation processes within the networks [4]. We found that the final activation levels of prime-target pairs correlated significantly with reaction time data for the same prime-target pairs from the LDT. Specifically, the activation of a target node (e.g. nurse) is higher when a related prime node (e.g. doctor) is activated compared to an unrelated prime node (e.g. doctrine). These results demonstrate how the LWOW datasets can be used for investigating cognitive and linguistic phenomena in LLMs, demonstrating the validity of the datasets.
To demonstrate how this dataset can be used to investigate gender biases in LLMs compared to humans, we conducted an analysis using network models of semantic memory built from both the LWOW and the SWOW free association norms. We applied a methodology that simulates semantic priming within the networks to measure the strength of association between pairs of concepts, for example, "woman" and "forecful" vs. "man" and "forceful". We applied this methodology using a set of female-related and male-related primes, and a set of female-related and male-related targets. This analysis revealed that certain adjectives like "forceful" and "strong" are more strongly associated with certain genders, shedding light on the types of stereotypical gender biases that both humans and LLMs possess.
The free associations were generated (either via API or locally, depending on the LLM) by providing each LLM with a set of cue words and the following prompt: "You will be provided with an input word. Write the first 3 words you associate to it separated by a comma." This prompt was repeated 100 times for each cue word, resulting in a dataset of 11,545 unique cues words and 3,463,500 total responses for each LLM.
The LWOW datasets for Mistral, Llama3, and Haiku can be found in the LWOW_datasets folder, which contains two subfolders. The .csv files of the processed cues and responses can be found in the processed_datasets folder while the .csv files of the edge lists of the semantic networks constructed from the datasets can be found in the graphs/edge_lists folder.
Since the LWOW datasets are intended to be used in comparison to humans, we have further processed the original SWOW dataset to create a Human dataset that is aligned with the processing that we applied to the LWOW datasets. While this human dataset is not included in this repository due to the license of the original SWOW dataset, it can be easily reproduced by running the code provided in the reproducibility folder. We highly encourage you to generate this dataset as it enabales a direct comparison between humans and LLMs. The Human dataset can be generated with the following steps:
To reproduce the analyses, first the required external files need to be downloaded:
Once the files are saved in the correct folders, follow the instructions in each script, which can be found in the reproducibility folder. The scripts should be run in the following order:
Abramski, K., et al. (2024). The "LLM World of Words" English free association norms generated by large language models (https://arxiv.org/abs/2412.01330)
For speaking requests and enquiries, please contact:
[1] Abramski, K., et al. (2024). The" LLM World of Words" English free association norms generated
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This LatentJam dataset contains 12 music latent representations (9 content-based and 3 collaborative-filtering) for 29 275 music tracks from Jamendo. It is released as part of the publication "Similarity of Nearest-Neighbor Query Results in Deep Latent Spaces". For more details on the individual models used for extraction, refer to the text of the paper.
The example code that uses this dataset and reproduces the experiments and analysis done in the paper is available at https://github.com/philtgun/compare-embeddings. See the README for more details on how to use this dataset and individual files.
The content-based representations have been extracted with the Essentia library (essentia-tensorflow 2.1b6.dev374) during the internship of the first author in the Jamendo in 2021 as part of the MIP-Frontiers project. The collaborative filtering representations have been computed from data provided by Jamendo with the Implicit library (AlternatingLeastSquares algorithm, implicit 0.4.4). The Python version used is 3.7.13.
This dataset is released under CC BY-NC-SA 4.0 License.
Please cite the publication if you use this dataset:
@inproceedings{tovstogan_similarity_2009,
title = {Similarity of nearest-neighbor query results in deep latent spaces},
author = {Tovstogan, Philip and Serra, Xavier and Bogdanov, Dmitry},
booktitle = {Proceedings of the 19th Sound and Music Computing Conference ({SMC})},
year = {2022}
}