Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following submission contains the data reduction and processing files, dynamical refinement files, refinement files for theoretical structure factors, and CIF files of five inorganic compounds: quartz, natrolite, borane, caesium lead bromide, and lutetium aluminium garnet collected by 3D electron diffraction (3D ED) for studying ionisation of atoms by kappa refinement against 3D ED data.
The data set for quartz was collected using the precession-assisted 3D ED method and for borane, caesium lead bromide, and lutetium aluminium garnet was collected using the continuous-rotation 3D ED method. Two data sets were collected from the same crystal for natrolite using continuous-rotation and precession-assisted 3D ED method. The data reduction and processing were done using PETS2 (1) software and the dynamical refinements were performed using the JANA2020 (2) software. The refinements were performed in two primary stages: IAM refinements (without taking into consideration the effects of charge transfer between the atoms) and kappa refinements (by taking into consideration the effects of charge transfer between the atoms).
The submission also contains JANA2020 files of refinements against theoretical structure factors obtained using periodic DFT calculations and on the structure model obtained after IAM refinements of each of the experimental data sets.
The folders are divided according to the compounds. Each folder contains the relevant data reduction and processing files (PETS2 files), dynamical refinement files (JANA2020 files for IAM and kappa refinements), refinement files for theoretical structure factors (JANA2020 files for IAM and kappa refinements) and final CIF files (for IAM and kappa refinements).
References
L. Palatinus, P. Brázda, M. Jelínek, J. Hrdá, G. Steciuk, M. Klementová, Specifics of the data processing of precession electron diffraction tomography data and their implementation in the program PETS2.0. Acta Cryst B 75, 512–522 (2019).
V. Petříček, L. Palatinus, J. Plášil, M. Dušek, Jana2020 – a new version of the crystallographic computing system Jana. Zeitschrift für Kristallographie - Crystalline Materials 238, 271–282 (2023).
The following table summarises the crystallographic information and data collection parameters for the data sets.
Crystal data
Sample
Quartz
Natrolite
Natrolite
Borane
Caesium lead bromide
Lutetium Aluminium Garnet
Chemical formula
SiO2
Na2Al2Si3O12H4
Na2Al2Si3O12H4
B18H22
CsPbBr3
Lu3Al5O12
Mr
60.1
380.2
380.2
108.4
579.8
851.8
Crystal system, space group
Trigonal, P3221
Orthorhombic, Fdd2
Orthorhombic, Fdd2
Orthorhombic, Pccn
Orthorhombic, Pbnm
Cubic, Ia3 ̅d
a, b, c (Å)
4.9012(24), 4.9012, 5.4068(26)
18.3885(1), 18.7183(32), 6.6569(11)
18.4125(9), 18.7073(7), 6.6306(2)
10.7789(17), 11.9869(16), 10.7338(17)
8.1189(4), 8.359(4), 11.7593(5)
11.9105(4), 11.9105(4), 11.9105(4)
α, β, γ (°)
90, 90, 120
90, 90, 90
90, 90, 90
90, 90, 90
90, 90, 90
90, 90, 90
V (Å3)
112.48(8)
2291.31(54)
2283.90(16)
1386.87(36)
798.1(1)
1689.6(1)
Z
3
8
8
4
4
8
Crystal size (mm)
0.0004
0.0005
0.0005
0.0015
0.0004
0.0003
Data collection
Diffractometer
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
3D ED method
Precession
Precession
Continuous Rotation
Continuous Rotation
Continuous Rotation
Continuous Rotation
Detector
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Radiation source
LaB6
LaB6
LaB6
LaB6
LaB6
LaB6
Radiation type
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Temperature (K)
293
95
95
100
153
153
(sin θ/λ)max (Å−1)
1.25
1.1
1.00
0.85
1.00
1.4
No. of measured, independent andobserved [I > 3σ(I)] reflections
3631, 1076, 1004
15767, 6018, 4419
12368, 4546, 4422
30304, 13809, 4779
16736, 422, 363
23256, 1562, 1363
Software used
Data collection
RATS software
RATS software
RATS software
RATS software
RATS software
RATS software
Data reduction and processing
PETS2
PETS2
PETS2
PETS2
PETS2
PETS2
Refinement
JANA2020
JANA2020
JANA2020
JANA2020
JANA2020
JANA2020
DFT calculation
WIEN2k and Crystal23
WIEN2k
Crystal23
Crystal23
WIEN2k
WIEN2k
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Pile -- NIHExPorter (refined by Data-Juicer)
A refined version of NIHExPorter dataset in The Pile by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 2.0G).
Dataset Information
Number of samples: 858,492 (Keep ~91.36% from the original dataset)
Refining… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aRsym = ∑|Ii–|/|Ii| where Ii is the intensity of the ith measurement, and is the mean intensity for that reflection.bReflections with I>σ was used in the refinement.cRwork = |Fobs–Fcalc|/|Fobs| where Fcalc and Fobs are the calculated and observed structure factor amplitudes, respectively.dRfree = as for Rwork, but for 5% of the total reflections chosen at random and omitted from refinement.eIndividual B-factor refinements were calculated.*The high resolution bin details are in the parenthesis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aThe values in parentheses refer to statistics in the highest bin.bRmerge = ∑hkl∑i|Ii(hkl)- |/∑hkl∑iIi(hkl), where Ii(hkl) is the intensity of an observation and is the mean value for its unique reflection; Summations are over all reflections.cR-factor = ∑h|Fo(h)-Fc(h)|/∑hFo(h), where Fo and Fc are the observed and calculated structure-factor amplitudes, respectively.dR-free was calculated with 5% of the data excluded from the refinement.eRoot-mean square-deviation from ideal values.fCategories were defined by Molprobity.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- C4 (refined by Data-Juicer)
A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).
Dataset Information
Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)
Refining Recipe
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aData in parenthesis pertain to the highest resolution shell (2.0 Å-1.9 Å).bRint = ∑|I - |/∑I, where I is the observed intensity of a measured reflection and is the mean intensity for all observation of symmetry-related reflections.cR factor = Σ |Foh – Fch|/Σ Foh, where Foh and Fch are the observed and calculated structure factor amplitudes for the 32,658 reflections h that were used in structure refinement.dR free = Σ |Foh – Fch|/Σ Foh, where Foh and Fch are the observed and calculated structure factor amplitudes pertaining to the 2,070 reflections h that were not used in structure refinement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aValues in parentheses apply to the high-resolution shell.b; Nh, multiplicity for each reflection; Ii, the intensity of the ith observation of reflection h; , the mean of the intensity of all observations of reflection h, with ; is taken over all reflections; is taken over all observations of each reflection.c; ; Rcryst and Rfree were calculated using the working and test hkl reflection sets, respectively.dTotal refined protein residues equal 3172, from which 28 terminal amino acids (the N- and C-termini on the 9 chains; plus residues: TS#399, TS#409 (in chains A, B & C), Fab#27, Fab#29 (in chain H), Fab#137, Fab#139 (in chain I), all flanking unmodeled gaps) were not included in the Ramachandran analysis (as implemented in Coot v 0.6.2-pre-1).
This worksheet displays the results of mineral abundance estimates based on Rietveld refinement of X-ray diffraction (XRD) analyses of mill tailings and other ore processing materials from worldwide localities. Data are also provided to show variation in mineral abundance estimates for subsplits in individual samples. Samples were analyzed using a PANalytical X'Pert Pro diffractometer using Cu radiation and the results interpreted using Highscore Plus v.4.7.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Pile -- USPTO (refined by Data-Juicer)
A refined version of USPTO dataset in The Pile by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 18G).
Dataset Information
Number of samples: 4,516,283 (Keep ~46.77% from the original dataset)
Refining Recipe
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- ArXiv (refined by Data-Juicer)
A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).
Dataset Information
Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)
Refining Recipe… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
These environmental DNA data and corresponding water quality data were collected and analyzed by the Fish and Wildlife Service in 2017. The samples were collected from 4 sites in pools 17 and 18 in the Upper Mississippi River on 3 sampling trips. The data was used to study occupancy modeling of eDNA data and determine optimal sampling effort required for reliable detection of invasive Bighead Carp and Silver Carp in streams with similar attributes at the Mississippi River.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TechnicalRemarks: This repository contains the supplementary data to our contribution "Particle Detection by means of Neural Networks and Synthetic Training Data Refinement in Defocusing Particle Tracking Velocimetry" to the 2022 Measurement Science and Technology special issue on the topic “Machine Learning and Data Assimilation techniques for fluid flow measurements”. This data includes annotated images used for the training of neural networks for particle detection on DPTV recordings as well as unannotated particle images used for training of the image-to-image translation networks for the generation of refined synthetic training data, as presented in the manuscript. The neural networks for particle detection trained on the aforementioned data are contained in this repository as well. An explanation on the use of this data and the trained neural networks, containing an example script can be found on GitHub (https://github.com/MaxDreisbach/DPTV_ML_Particle_detection)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Numbers in parentheses refer to values in the highest resolution shell.aRsym = ΣjΣh|Ih,j−|/ΣjΣh where Ih,j is the jth observation of reflection h, and is the mean intensity of that reflection.bRcryst = Σ||Fobs|−|Fcalc||/Σ|Fobs| where Fobs and Fcalc are the observed and calculated structure factor amplitudes, respectively.cRfree is equivalent to Rcryst for a 4% subset of reflections not used in the refinement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aValues in parentheses are for the highest resolution shell.bValues in parentheses are target values.N.B. One crystal was used for the full data set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aValues in parentheses are for the highest resolution shell.bRmerge = ∑h∑j | Ihj - h> |/∑h∑j Ihj, where Ihj is the intensity of observation j of reflection h.cRwork = ∑h | | Fo| - | Fc| |/∑h | Fo| for all reflections, where Fo and Fc are the observed and calculated structure factors, respectively. Rfree is calculated analogously for the test reflections, randomly selected and excluded from the refinement.
Data on petroleum inputs, production, yield, and capacity. Weekly, monthly and annual data available. Users of the EIA API are required to obtain an API Key via this registration form: http://www.eia.gov/beta/api/register.cfm
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Pile -- FreeLaw (refined by Data-Juicer)
A refined version of FreeLaw dataset in The Pile by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 45GB).
Dataset Information
Number of samples: 2,942,612 (Keep ~82.61% from the original dataset)
Refining Recipe… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer.
aNumber in parentheses indicate the outer-resolution shell.bRmerge = ∑hkl ∑i |Ii (hkl) - 〈I (hkl) 〉|/∑hkl ∑i Ii (hkl), where Ii(hkl) is the ith observation of reflection hkl and 〈I (hkl) 〉 is the weighted average intensity for all observations i of reflection hkl.cRcryst = Σhkl = ∑hkl|Fobs − Fcalc|/Σhkl |Fobs|.dRfree is the same as Rcryst except for 5% of the data excluded from the refinement.eSum of the TLS and Residual B-factor contributions.
ADAPTIVE MODEL REFINEMENT FOR THE IONOSPHERE AND THERMOSPHERE ANTHONY M. D’AMATO∗, AARON J. RIDLEY∗∗, AND DENNIS S. BERNSTEIN∗∗∗ Abstract. Mathematical models of physical phenomena are of critical importance in virtually all applications of science and technology. This paper addresses the problem of how to use data to improve the fidelity of a given model. We approach this problem using retrospective cost optimization, a novel technique that uses data to recursively update an unknown subsystem interconnected to a known system. Applications of this research are relevant to a wide range of applications that depend on large-scale models based on firstprinciples physics, such as the Global Ionosphere-Thermosphere Model (GITM). Using GITM as the truth model, we demonstrate that measurements can be used to identify unknown physics. Specifically, we estimate static thermal conductivity parameters, and we identify a dynamic cooling process.
aThe data for the highest resolution shell are shown in parentheses.bRfree is calculated using 10% of the total number of reflections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The following submission contains the data reduction and processing files, dynamical refinement files, refinement files for theoretical structure factors, and CIF files of five inorganic compounds: quartz, natrolite, borane, caesium lead bromide, and lutetium aluminium garnet collected by 3D electron diffraction (3D ED) for studying ionisation of atoms by kappa refinement against 3D ED data.
The data set for quartz was collected using the precession-assisted 3D ED method and for borane, caesium lead bromide, and lutetium aluminium garnet was collected using the continuous-rotation 3D ED method. Two data sets were collected from the same crystal for natrolite using continuous-rotation and precession-assisted 3D ED method. The data reduction and processing were done using PETS2 (1) software and the dynamical refinements were performed using the JANA2020 (2) software. The refinements were performed in two primary stages: IAM refinements (without taking into consideration the effects of charge transfer between the atoms) and kappa refinements (by taking into consideration the effects of charge transfer between the atoms).
The submission also contains JANA2020 files of refinements against theoretical structure factors obtained using periodic DFT calculations and on the structure model obtained after IAM refinements of each of the experimental data sets.
The folders are divided according to the compounds. Each folder contains the relevant data reduction and processing files (PETS2 files), dynamical refinement files (JANA2020 files for IAM and kappa refinements), refinement files for theoretical structure factors (JANA2020 files for IAM and kappa refinements) and final CIF files (for IAM and kappa refinements).
References
L. Palatinus, P. Brázda, M. Jelínek, J. Hrdá, G. Steciuk, M. Klementová, Specifics of the data processing of precession electron diffraction tomography data and their implementation in the program PETS2.0. Acta Cryst B 75, 512–522 (2019).
V. Petříček, L. Palatinus, J. Plášil, M. Dušek, Jana2020 – a new version of the crystallographic computing system Jana. Zeitschrift für Kristallographie - Crystalline Materials 238, 271–282 (2023).
The following table summarises the crystallographic information and data collection parameters for the data sets.
Crystal data
Sample
Quartz
Natrolite
Natrolite
Borane
Caesium lead bromide
Lutetium Aluminium Garnet
Chemical formula
SiO2
Na2Al2Si3O12H4
Na2Al2Si3O12H4
B18H22
CsPbBr3
Lu3Al5O12
Mr
60.1
380.2
380.2
108.4
579.8
851.8
Crystal system, space group
Trigonal, P3221
Orthorhombic, Fdd2
Orthorhombic, Fdd2
Orthorhombic, Pccn
Orthorhombic, Pbnm
Cubic, Ia3 ̅d
a, b, c (Å)
4.9012(24), 4.9012, 5.4068(26)
18.3885(1), 18.7183(32), 6.6569(11)
18.4125(9), 18.7073(7), 6.6306(2)
10.7789(17), 11.9869(16), 10.7338(17)
8.1189(4), 8.359(4), 11.7593(5)
11.9105(4), 11.9105(4), 11.9105(4)
α, β, γ (°)
90, 90, 120
90, 90, 90
90, 90, 90
90, 90, 90
90, 90, 90
90, 90, 90
V (Å3)
112.48(8)
2291.31(54)
2283.90(16)
1386.87(36)
798.1(1)
1689.6(1)
Z
3
8
8
4
4
8
Crystal size (mm)
0.0004
0.0005
0.0005
0.0015
0.0004
0.0003
Data collection
Diffractometer
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
TEM FEI Technei G2 20
3D ED method
Precession
Precession
Continuous Rotation
Continuous Rotation
Continuous Rotation
Continuous Rotation
Detector
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Medipix 3 ASI Cheetah
Radiation source
LaB6
LaB6
LaB6
LaB6
LaB6
LaB6
Radiation type
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Electron, λ = 0.0251 Å
Temperature (K)
293
95
95
100
153
153
(sin θ/λ)max (Å−1)
1.25
1.1
1.00
0.85
1.00
1.4
No. of measured, independent andobserved [I > 3σ(I)] reflections
3631, 1076, 1004
15767, 6018, 4419
12368, 4546, 4422
30304, 13809, 4779
16736, 422, 363
23256, 1562, 1363
Software used
Data collection
RATS software
RATS software
RATS software
RATS software
RATS software
RATS software
Data reduction and processing
PETS2
PETS2
PETS2
PETS2
PETS2
PETS2
Refinement
JANA2020
JANA2020
JANA2020
JANA2020
JANA2020
JANA2020
DFT calculation
WIEN2k and Crystal23
WIEN2k
Crystal23
Crystal23
WIEN2k
WIEN2k