2 datasets found
  1. h

    simple_english_wikipedia

    • huggingface.co
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2024
    Authors
    Bowen Li
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

      How to use
    

    See notebook Wikipedia Q&A Retrieval-Semantic Search

      Installing the package
    

    !pip install sentence-transformers==2.3.1

      The converting process
    

    the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

  2. Constraining the origins of binary black holes using normalising flows

    • zenodo.org
    application/gzip
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Storm Colloms; Storm Colloms; Christopher Berry; Christopher Berry; John Veitch; John Veitch; Michael Zevin; Michael Zevin (2025). Constraining the origins of binary black holes using normalising flows [Dataset]. http://doi.org/10.5281/zenodo.14967688
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Storm Colloms; Storm Colloms; Christopher Berry; Christopher Berry; John Veitch; John Veitch; Michael Zevin; Michael Zevin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data release for Colloms et al. 2025 (arXiv, dcc). In this work we demonstrate the use of normalising flows for emulation of population synthesis simulations and continuous inference of the simulation inputs, natal spin, common envelope efficiency, and the relative rates between five formation channels.

    This data release includes hyperposterior samples from our discrete and continuous inference, the normalising flow models used for the analysis, the processed gravitational wave event samples, and additional auxiliary data for the remaining paper results.

    The analysis was produced using the updated AMAZE framework, which can be found in this git repository. The results were produced with this version of the code. The population synthesis models used to train the normalising flows were initially produced for Zevin et al. (2021) and are contained in this data release.

    Data included:

    • `inference_samples.tar.gz` contains the hyperposterior samples from our three inference results: continuous inference with normalising flows, discrete inference with normalising flows, and discrete inference with KDEs. Each inference result contains five hdf5 files, each for a different run instance with a different random seed, which were combined for the published result.
      • `cont_GWTC3/` contains the continuous result files with the natal spin and common envelope efficiency and the underlying branching fraction samples
      • `discrete_GWTC3/flow/` contains the discrete result files using normalising flows
      • `discrete_GWTC3/KDEs/` contains the discrete result files using KDEs

    Within each hdf5 file, the key:

      • ‘model_selection/samples’ contains the hyperposterior samples for natal spin, common envelope efficiency, and the five underlying branching fractions inferred (for the discrete results, natal spin common and envelope efficiency samples are represented by the model index).
      • ‘model_selection/obsdata’ contains the combined GW posterior samples of chirp mass, mass ratio, effective inspiral spin, and redshift for each event used in the inference.
      • ‘model_selection/lnprob’ contains the log probability of each hyperposterior sample.
      • ‘model_selection/raw_samples’ contains the raw MCMC samples without the flooring to a particular model index for the discrete result samples. These are identical to ‘model_selection/samples’ for the continuous inference.

    This also includes

      • `cont_detectable_GWTC3/`containing the hyperposterior samples for natal spin, common envelope efficiency, and the detectable branching fractions with continuous inference.
    • `flow_models.tar.gz` contains the normalising flow models (as pytorch version 1.12.1 Model objects), the training and validation losses for each training epoch, the mapping constants used for the initial transformation of the training data, and a config file with the architecture for each normalising flow. We include these data products for each formation channel: common envelope (CE), chemically homogeneous evolution (CHE), globular clusters (GC), nuclear star clusters (NSC), and stable mass transfer (SMT).
      • `{channel}.pt` is the trained normalising flow model used in the analysis, as a pytorch model. These may be loaded as an Nflow objects within the AMAZE framework.
      • `{channel}_loss_history.csv` contains the training epoch number, training loss, validation loss, and learning rate at each epoch of training for each normalising flow.
      • `{channel}_mappings.npy` contains the constants used in the logistic mapping for the chirp mass, mass ratio, and redshift samples for each channel. See Colloms et al. Appendix A for details of how these are used.
        • We also include `flows_mapping.json` as a human readable version of the mappings
      • `flowconfig.json` contains the network architecture (number of transforms, number of neurons per layer, and number of spline bins) used for each normalising flow.
    • `gw_events.tar.gz` contains the posterior samples from the GWTC-2.1 and GWTC-3 data releases. Each event contains samples of chirp mass, mass ratio, effective inspiral spin, and redshift, along with a prior value calculated for each sample `p_theta_jcb`. These samples were created with the notebook process_GWTC_data.ipynb.
    • `plot_data.tar.gz` contains auxiliary data used for plotting the samples drawn from the normalising flow and KDE models, and the log likelihood ratio between the normalising flows and the KDEs.
      • `dataspace_samps.hdf5`contains samples from the normalising flow used to make Figure 5, and samples from parametric results drawn from the default models used in Abbott et al. (2022). The normalising flow samples are stored in `flow_samps/{channel}`, where the number of samples for each channel is representative of the inferred branching fractions from continuous inference.
      • `emulation_samps.hdf5` contains normalising flow and KDE samples from the CE channel model representations at a natal spin value of 0 and a CE efficiency value of 2, used to make Figure 1.
      • `test_flow_samps.hdf5` contains samples from the CE flow model trained with this test population removed, at natal spin value of 0.1 and a CE efficiency value of 1, used to make Figure 2.
      • `KLs.json` contains the Kullbeck – Liebler divergence values between the normalising flows and the KDEs at for each channel, at each of the population synthesis model points.
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia

simple_english_wikipedia

aisuko/simple_english_wikipedia

Explore at:
41 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2024
Authors
Bowen Li
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

  How to use

See notebook Wikipedia Q&A Retrieval-Semantic Search

  Installing the package

!pip install sentence-transformers==2.3.1

  The converting process

the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

Search
Clear search
Close search
Google apps
Main menu