Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the simulation data of the combinatorial metamaterial as used for the paper 'Machine Learning of Combinatorial Rules in Mechanical Metamaterials', as published in XXX.
In this paper, the data is used to classify each \(k \times k\) unit cell design into one of two classes (C or I) based on the scaling (linear or constant) of the number of zero modes \(M_k(n)\) for metamaterials consisting of an \(n\times n\) tiling of the corresponding unit cell. Additionally, a random walk through the design space starting from class C unit cells was performed to characterize the boundary between class C and I in design space. A more detailed description of the contents of the dataset follows below.
Modescaling_raw_data.zip
This file contains uniformly sampled unit cell designs and \(M_k(n)\) for \(1\leq n\leq 4\), which was used to classify the unit cell designs for the data set. There is a small subset of designs for \(k=\{3, 4, 5\}\) that do not neatly fall into the class C and I classification, and instead require additional simulation for \(4 \leq n \leq 6\) before either saturating to a constant number of zero modes (class I) or linearly increasing (class C). This file contains the simulation data of size \(3 \leq k \leq 8\) unit cells. The data is organized as follows.
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4.npy", and contain a [Nsim, 1+k*k+4] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Note: the unit cell design uses the numbers \(\{0, 1, 2, 3\}\) to refer to each building block orientation. The building block orientations can be characterized through the orientation of the missing diagonal bar (see Fig. 2 in the paper), which can be Left Up (LU), Left Down (LD), Right Up (RU), or Right Down (RD). The numbers correspond to the building block orientation \(\{0, 1, 2, 3\} = \{\mathrm{LU, RU, RD, LD}\}\).
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 6\) for unit cells that cannot be classified as class C or I for \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4_classX_extend.npy", and contain a [Nsim, 1+k*k+6] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Simulation data for \(6 \leq k \leq 8\) unit cells are stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. Note that the number of modes is now calculated for \(n_x \times n_y\) metamaterials, where we calculate \((n_x, n_y) = \{(1,1), (2, 2), (3, 2), (4,2), (2, 3), (2, 4)\}\) rather than \(n_x=n_y=n\) to save computation time. These files are named "data_new_rrQR_i_n_Mx_My_n4_kxk(_extended).npy", and contain a [Nsim, 1+k*k+8] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Modescaling_classification_results.zip
This file contains the classification, slope, and offset of the scaling of the number of zero modes \(M_k(n)\) for the unit cells in Modescaling_raw_data.zip. The data is organized as follows.
The results for \(3 \leq k \leq 5\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 4\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(3 \leq k \leq 5\) based on the extended \(1 \leq n \leq 6\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4_classC_extend.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 6\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(6 \leq k \leq 8\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scenx_Sceny_slopex_slopey_offsetx_offsety_M1k_kxk(_extended).txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class_x based on \(M_k(n_x, 2)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_x \leq 4\))
col 2: the class_y based on \(M_k(2, n_y)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_y \leq 4\))
col 3: slope_x from \(n_x \geq 2\) onward (undefined for class X)
col 4: slope_y from \(n_y \geq 2\) onward (undefined for class X)
col 5: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_x}\)
col 6: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_y}\)
col 7: \(M_k(1, 1)\)
Random Walks Data
This file contains the random walks for \(3 \leq k \leq 8\) unit cells. The random walk starts from a class C unit cell design, for each step \(s\) a randomly picked unit cell is changed to a random new orientation for a total of \(s=k^2\) steps. The data is organized as follows.
The configurations for each step are stored in the files named "configlist_test_i.npy", where i is a number and corresponds to a different starting unit cell. The stored array has the shape [k*k+1, 2*k+2, 2*k+2]. The first dimension denotes the step \(s\), where \(s=0\) is the initial configuration. The second and third dimension denote the unit cell configuration in the pixel representation (see paper) padded with a single pixel wide layer using periodic boundary conditions.
The class for each configuration are stored in "lmlist_test_i.npy", where i corresponds to the same number as for the configurations in the "configlist_test_i.npy" file. The stored array has
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
Sample survey data [ssd]
The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.
A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.
It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.
Face-to-face [f2f]
Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.
Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wind Spacecraft:
The Wind spacecraft (https://wind.nasa.gov) was launched on November 1, 1994 and currently orbits the first Lagrange point between the Earth and sun. A comprehensive review can be found in Wilson et al. [2021]. It holds a suite of instruments from gamma ray detectors to quasi-static magnetic field instruments, Bo. The instruments used for this data product are the fluxgate magnetometer (MFI) [Lepping et al., 1995] and the radio receivers (WAVES) [Bougeret et al., 1995]. The MFI measures 3-vector Bo at ~11 samples per second (sps); WAVES observes electromagnetic radiation from ~4 kHz to >12 MHz which provides an observation of the upper hybrid line (also called the plasma line) used to define the total electron density and also takes time series snapshot/waveform captures of electric and magnetic field fluctuations, called TDS bursts herein.
WAVES Instrument:
The WAVES experiment [Bougeret et al., 1995] on the Wind spacecraft is composed of three orthogonal electric field antenna and three orthogonal search coil magnetometers. The electric fields are measured through five different receivers: Low Frequency FFT receiver called FFT (0.3 Hz to 11 kHz), Thermal Noise Receiver called TNR (4-256 kHz), Radio receiver band 1 called RAD1 (20-1040 kHz), Radio receiver band 2 called RAD2 (1.075-13.825 MHz), and the Time Domain Sampler (TDS). The electric field antenna are dipole antennas with two orthogonal antennas in the spin plane and one spin axis stacer antenna.
The TDS receiver allows one to examine the electromagnetic waves observed by Wind as time series waveform captures. There are two modes of operation, TDS Fast (TDSF) and TDS Slow (TDSS). TDSF returns 2048 data points for two channels of the electric field, typically Ex and Ey (i.e. spin plane components), with little to no gain below ~120 Hz (the data herein has been high pass filtered above ~150 Hz for this reason). TDSS returns four channels with three electric(magnetic) field components and one magnetic(electric) component. The search coils show a gain roll off ~3.3 Hz [e.g., see Wilson et al., 2010; Wilson et al., 2012; Wilson et al., 2013 and references therein for more details].
The original calibration of the electric field antenna found that the effective antenna lengths are roughly 41.1 m, 3.79 m, and 2.17 m for the X, Y, and Z antenna, respectively. The +Ex antenna was broken twice during the mission as of June 26, 2020. The first break occurred on August 3, 2000 around ~21:00 UTC and the second on September 24, 2002 around ~23:00 UTC. These breaks reduced the effective antenna length of Ex from ~41 m to 27 m after the first break and ~25 m after the second break [e.g., see Malaspina et al., 2014; Malaspina & Wilson, 2016].
TDS Bursts:
TDS bursts are waveform captures/snapshots of electric and magnetic field data. The data is triggered by the largest amplitude waves which exceed a specific threshold and are then stored in a memory buffer. The bursts are ranked according to a quality filter which mostly depends upon amplitude. Due to the age of the spacecraft and ubiquity of large amplitude electromagnetic and electrostatic waves, the memory buffer often fills up before dumping onto the magnetic tape drive. If the memory buffer is full, then the bottom ranked TDS burst is erased every time a new TDS burst is sampled. That is, the newest TDS burst sampled by the instrument is always stored and if it ranks higher than any other in the list, it will be kept. This results in the bottom ranked burst always being erased. Earlier in the mission, there were also so called honesty bursts, which were taken periodically to test whether the triggers were working properly. It was found that the TDSF triggered properly, but not the TDSS. So the TDSS was set to trigger off of the Ex signals.
A TDS burst from the Wind/WAVES instrument is always 2048 time steps for each channel. The sample rate for TDSF bursts ranges from 1875 samples/second (sps) to 120,000 sps. Every TDS burst is marked a unique set of numbers (unique on any given date) to help distinguish it from others and to ensure any set of channels are appropriately connected to each other. For instance, during one spacecraft downlink interval there may be 95% of the TDS bursts with a complete set of channels (i.e., TDSF has two channels, TDSS has four) while the remaining 5% can be missing channels (just example numbers, not quantitatively accurate). During another downlink interval, those missing channels may be returned if they are not overwritten. During every downlink, the flight operations team at NASA Goddard Space Fligth Center (GSFC) generate level zero binary files from the raw telemetry data. Those files are filled with data received on that date and the file name is labeled with that date. There is no attempt to sort chronologically the data within so any given level zero file can have data from multiple dates within. Thus, it is often necessary to load upwards of five days of level zero files to find as many full channel sets as possible. The remaining unmatched channel sets comprise a much smaller fraction of the total.
All data provided here are from TDSF, so only two channels. Most of the time channel 1 will be associated with the Ex antenna and channel 2 with the Ey antenna. The data are provided in the spinning instrument coordinate basis with associated angles necessary to rotate into a physically meaningful basis (e.g., GSE).
TDS Time Stamps:
Each TDS burst is tagged with a time stamp called a spacecraft event time or SCET. The TDS datation time is sampled after the burst is acquired which requires a delay buffer. The datation time requires two corrections. The first correction arises from tagging the TDS datation with an associated spacecraft major frame in house keeping (HK) data. The second correction removes the delay buffer duration. Both inaccuracies are essentially artifacts of on ground derived values in the archives created by the WINDlib software (K. Goetz, Personal Communication, 2008) found at https://github.com/lynnbwilsoniii/Wind_Decom_Code.
The WAVES instrument's HK mode sends relevant low rate science back to ground once every spacecraft major frame. If multiple TDS bursts occur in the same major frame, it is possible for the WINDlib software to assign them the same SCETs. The reason being that this top-level SCET is only accurate to within +300 ms (in 120,000 sps mode) due to the issues described above (at lower sample rates, the error can be slightly larger). The time stamp uncertainty is a positive definite value because it results from digitization rounding errors. One can correct these issues to within +10 ms if using the proper HK data.
*** The data stored here have not corrected the SCETs! ***
The 300 ms uncertainty, due to the HK corrections mentioned above, results from WINDlib trying to recreate the time stamp after it has been telemetered back to ground. If a burst stays in the TDS buffer for extended periods of time (i.e., >2 days), the interpolation done by WINDlib can make mistakes in the 11th significant digit. The positive definite nature of this uncertainty is due to rounding errors associated with the onboard DPU (digital processing unit) clock rollover. The DPU clock is a 24 bit integer clock sampling at ∼50,018.8 Hz. The clock rolls over at ∼5366.691244092221 seconds, i.e., (16*224)/50,018.8. The sample rate is a temperature sensitive issue and thus subject to change over time. From a sample of 384 different points on 14 different days, a statistical estimate of the rollover time is 5366.691124061162 ± 0.000478370049 seconds (calculated by Lynn B. Wilson III, 2008). Note that the WAVES instrument team used UR8 times, which are the number of 86,400 second days from 1982-01-01/00:00:00.000 UTC.
The method to correct the SCETs to within +10 ms, were one to do so, is given as follows:
Retrieve the DPU clock times, SCETs, UR8 times, and DPU Major Frame Numbers from the WINDlib libraries on the VAX/ALPHA systems for the TDSS(F) data of interest.
Retrieve the same quantities from the HK data.
Match the HK event number with the same DPU Major Frame Number as the TDSS(F) burst of interest.
Find the difference in DPU clock times between the TDSS(F) burst of interest and the HK event with matching major frame number (Note: The TDSS(F) DPU clock time will always be greater than the HK DPU clock if they are the same DPU Major Frame Number and the DPU clock has not rolled over).
Convert the difference to a UR8 time and add this to the HK UR8 time. The new UR8 time is the corrected UR8 time to within +10 ms.
Find the difference between the new UR8 time and the UR8 time WINDlib associates with the TDSS(F) burst. Add the difference to the DPU clock time assigned by WINDlib to get the corrected DPU clock time (Note: watch for the DPU clock rollover).
Convert the new UR8 time to a SCET using either the IDL WINDlib libraries or TMLib (STEREO S/WAVES software) libraries of available functions. This new SCET is accurate to within +10 ms.
One can find a UR8 to UTC conversion routine at https://github.com/lynnbwilsoniii/wind_3dp_pros in the ~/LYNN_PRO/Wind_WAVES_routines/ folder.
Examples of good waveforms can be found in the notes PDF at https://wind.nasa.gov/docs/wind_waves.pdf.
Data Set Description
Each Zip file contains 300+ IDL save files; one for each day of the year with available data. This data set is not complete as the software used to retrieve and calibrate these TDS bursts did not have sufficient error handling to handle some of the more nuanced bit errors or major frame errors in some of the level zero files. There is currently (as of June 27, 2020) an effort (by Keith Goetz et al.) to generate the entire TDSF and TDSS data set in one repository to be put on SPDF/CDAWeb as CDF files. Once that data set is available, it will supercede
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘🍷 Alcohol vs Life Expectancy’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/alcohol-vs-life-expectancye on 13 February 2022.
--- Dataset description provided by original source is as follows ---
There is a surprising relationship between alcohol consumption and life expectancy. In fact, the data suggest that life expectancy and alcohol consumption are positively correlated - 1.2 additional years for every 1 liter of alcohol consumed annually. This is, of course, a spurious finding, because the correlation of this relationship is very low - 0.28. This indicates that other factors in those countries where alcohol consumption is comparatively high or low are contributing to differences in life expectancy, and further analysis is warranted.
https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">
The original drinks.csv file in the UNCC/DSBA-6100 dataset was missing values for The Bahamas, Denmark, and Macedonia for the wine, spirits, and beer attributes, respectively. Drinks_solution.csv shows these values filled in, for which I used the Mean of the rest of the data column.
Other methods were considered and ruled out:
beer_servings
, spirit_servings
, and wine_servings
), and upon reviewing the Bahamas, Denmark, and Macedonia more closely, it is apparent that 0 would be a poor choice for the missing values, as all three countries clearly consume alcohol.Filling missing values with MEAN - In the case of the drinks dataset, this is the best approach. The MEAN averages for the columns happen to be very close to the actual data from where we sourced this exercise. In addition, the MEAN will not skew the data, which the prior approaches would do.
The original drinks.csv dataset also had an empty data column: total_litres_of_pure_alcohol
. This column needed to be calculated in order to do a simple 2D plot and trendline. It would have been possible to instead run a multi-variable regression on the data and therefore skip this step, but this adds an extra layer of complication to understanding the analysis - not to mention the point of the exercise is to go through an example of calculating new attributes (or "feature engineering") using domain knowledge.
The graphic found at the Wikipedia / Standard Drink page shows the following breakdown:
The conversion factor from fl oz to L is 1 fl oz : 0.0295735 L
Therefore, the following formula was used to compute the empty column:
total_litres_of_pure_alcohol
=
(beer_servings * 12 fl oz per serving * 0.05 ABV + spirit_servings * 1.5 fl oz * 0.4 ABV + wine_servings * 5 fl oz * 0.12 ABV) * 0.0295735 liters per fl oz
The lifeexpectancy.csv datafile in the https://data.world/uncc-dsba/dsba-6100-fall-2016 dataset contains life expectancy data for each country. The following query will join this data to the cleaned drinks.csv data file:
# Life Expectancy vs Alcohol Consumption
PREFIX drinks: <http://data.world/databeats/alcohol-vs-life-expectancy/drinks_solution.csv/drinks_solution#>
PREFIX life: <http://data.world/uncc-dsba/dsba-6100-fall-2016/lifeexpectancy.csv/lifeexpectancy#>
PREFIX countries: <http://data.world/databeats/alcohol-vs-life-expectancy/countryTable.csv/countryTable#>
SELECT ?country ?alc ?years
WHERE {
SERVICE <https://query.data.world/sparql/databeats/alcohol-vs-life-expectancy> {
?r1 drinks:total_litres_of_pure_alcohol ?alc .
?r1 drinks:country ?country .
?r2 countries:drinksCountry ?country .
?r2 countries:leCountry ?leCountry .
}
SERVICE <https://query.data.world/sparql/uncc-dsba/dsba-6100-fall-2016> {
?r3 life:CountryDisplay ?leCountry .
?r3 life:GhoCode ?gho_code .
?r3 life:Numeric ?years .
?r3 life:YearCode ?reporting_year .
?r3 life:SexDisplay ?sex .
}
FILTER ( ?gho_code = "WHOSIS_000001" && ?reporting_year = 2013 && ?sex = "Both sexes" )
}
ORDER BY ?country
The resulting joined data can then be saved to local disk and imported into any analysis tool like Excel, Numbers, R, etc. to make a simple scatterplot. A trendline and R^2 should be added to determine the relationship between Alcohol Consumption and Life Expectancy (if any).
https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">
This dataset was created by Jonathan Ortiz and contains around 200 samples along with Beer Servings, Spirit Servings, technical information and other features such as: - Total Litres Of Pure Alcohol - Wine Servings - and more.
- Analyze Beer Servings in relation to Spirit Servings
- Study the influence of Total Litres Of Pure Alcohol on Wine Servings
- More datasets
If you use this dataset in your research, please credit Jonathan Ortiz
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Authors:
*Corresponding author: mathias.sable-meyer@ucl.ac.uk
The perception and production of regular geometric shapes is a characteristic trait of human cultures since prehistory, whose neural mechanisms are unknown. Behavioral studies suggest that humans are attuned to discrete regularities such as symmetries and parallelism, and rely on their combinations to encode regular geometric shapes in a compressed form. To identify the relevant brain systems and their dynamics, we collected functional MRI and magnetoencephalography data in both adults and six-year-olds during the perception of simple shapes such as hexagons, triangles and quadrilaterals. The results revealed that geometric shapes, relative to other visual categories, induce a hypoactivation of ventral visual areas and an overactivation of the intraparietal and inferior temporal regions also involved in mathematical processing, whose activation is modulated by geometric regularity. While convolutional neural networks captured the early visual activity evoked by geometric shapes, they failed to account for subsequent dorsal parietal and prefrontal signals, which could only be captured by discrete geometric features or by more advanced transformer models of vision. We propose that the perception of abstract geometric regularities engages an additional symbolic mode of visual perception.
We separately share the MEG dataset at https://openneuro.org/datasets/ds006012. Below are some notes about the
fMRI dataset of N=20 adult participants (sub-2xx
, numbers between 204 and
223), and N=22 children (sub-3xx
, numbers between 301 and 325).
20.0.5
/usr/local/miniconda/bin/fmriprep /data /out participant --participant-label <label> --output-spaces MNI152NLin6Asym:res-2 MNI152NLin2009cAsym:res-2
bidsonym
running the pydeface
masking,
and nobrainer
brain registraction pipeline.sub-325
was acquired by a different experimenter and defaced before being
shared with the rest of the research team, hence why the slightly different
defacing mask. That participant was also preprocessed separately, and using a
more recent fMRIPrep version: 20.2.6
.sub-313
and sub-316
are missing one run of the localizer eachsub-316
has no data at all for the geometrysub-308
has eno useable data for the intruder task
Since all of these still have some data to contribute to either task, all
available files were kept on this dataset. The analysis code reflects these
inconsistencies where required with specific exceptions.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CSV dataset generated gathering data from a production wireless mesh community network. Data is gathered every 5 minutes during the interval 2021-04-13 00:00:00 to 2021-04-16 00:00:00. During the interval 2021-04-14 02:00:00 2021-04-14 17:50:00 (both included) there is the failure of a gateway in the mesh (nodeid 24).
Live mesh network monitoring link: http://dsg.ac.upc.edu/qmpsu
The dataset consists of single gzip compressed CSV file. The first line of the file is a header describing the features. The first column is a GMT timestamp of the sample in the format as "2021-03-16 00:00:00". The rest of the columns provide the comma-separated values of the features collected from each node in the corresponding capture.
A suffix with the nodeid is added to each feature. For instance, the feature having the number of processes of node with nodeid 24 is named as "processes-24". In total, 63 different nodes showed up during the samples, each being assigned a different nodeid.
Features are of two types: (i) absolute values, for instance, the CPU 1-minute load average, and (ii) counters that are monotonically increased, for instance the number of transmitted packets. We have converted counter-type kernel variables to rates, by dividing the difference between two consecutive samples, over the difference of the corresponding timestamps in seconds, as shown in the following pseudo-code:
feature.rate are columns computed from feature as
feature.rate <- (feature[2:n]-feature[1:(n-1)])/(epoch[2:n]-epoch[1:(n-1)])
feature.rate <- feature.rate[feature.rate >= 0] # discard samples where the counter is restarted
where n is the number of samples
features
- processes number of processes
- loadavg.m1 1 minute load average
- softirq.rate servicing softirqs
- iowait.rate waiting for I/O to complete
- intr.rate
- system.rate processes executing in kernel mode
- idle.rate twiddling thumbs
- user.rate normal processes executing in user mode
- irq.rate servicing interrupts
- ctxt.rate total number of context switches across all CPUs
- nice.rate niced processes executing in user mode
- nr_slab_unreclaimable The part of the Slab that can't be reclaimed under memory pressure
- nr_anon_pages anonymous memory pages
- swap_cache Memory that once was swapped out, is swapped back in but still also is in the swapfile
- page_tables Memory used to map between virtual and physical memory addresses
- swap
- eth.txe.rate tx errors over all ethernet interfaces
- eth.rxe.rate rx errors over all ethernet interfaces
- eth.txb.rate tx bytes over all ethernet interfaces
- eth.rxb.rate rx bytes over all ethernet interfaces
- eth.txp.rate tx packets over all ethernet interfaces
- eth.rxp.rate rx packets over all ethernet interfaces
- wifi.txe.rate tx errors over all wireless interfaces
- wifi.rxe.rate rx errors over all wireless interfaces
- wifi.txb.rate tx bytes over all wireless interfaces
- wifi.rxb.rate rx bytes over all wireless interfaces
- wifi.txp.rate tx packets over all wireless interfaces
- wifi.rxp.rate rx packets over all wireless interfaces
- txb.rate tx bytes over all ethernet and wifi interfaces
- txp.rate tx packets over all ethernet and wifi interfaces
- rxb.rate rx bytes over all ethernet and wifi interfaces
- rxp.rate rx packets over all ethernet and wifi interfaces
- sum.xb.rate tx+rx bytes over all ethernet and wifi interfaces
- sum.xp.rate tx+rx packets over all ethernet and wifi interfaces
- diff.xb.rate tx-rx bytes over all ethernet and wifi interfaces
- diff.xp.rate tx-rx packets over all ethernet and wifi interfaces
Accessible Tables and Improved Quality
As part of the Analysis Function Reproducible Analytical Pipeline Strategy, processes to create all National Travel Survey (NTS) statistics tables have been improved to follow the principles of Reproducible Analytical Pipelines (RAP). This has resulted in improved efficiency and quality of NTS tables and therefore some historical estimates have seen very minor change, at least the fifth decimal place.
All NTS tables have also been redesigned in an accessible format where they can be used by as many people as possible, including people with an impaired vision, motor difficulties, cognitive impairments or learning disabilities and deafness or impaired hearing.
If you wish to provide feedback on these changes then please email national.travelsurvey@dft.gov.uk.
Revision to table NTS9919
On the 16th April 2025, the figures in table NTS9919 have been revised and recalculated to include only day 1 of the travel diary where short walks of less than a mile are recorded (from 2017 onwards), whereas previous versions included all days. This is to more accurately capture the proportion of trips which include short walks before a surface rail stage. This revision has resulted in fewer available breakdowns than previously published due to the smaller sample sizes.
NTS0303: https://assets.publishing.service.gov.uk/media/66ce0f118e33f28aae7e1f75/nts0303.ods">Average number of trips, stages, miles and time spent travelling by mode: England, 2002 onwards (ODS, 53.9 KB)
NTS0308: https://assets.publishing.service.gov.uk/media/66ce0f128e33f28aae7e1f76/nts0308.ods">Average number of trips and distance travelled by trip length and main mode; England, 2002 onwards (ODS, 191 KB)
NTS0312: https://assets.publishing.service.gov.uk/media/66ce0f12bc00d93a0c7e1f71/nts0312.ods">Walks of 20 minutes or more by age and frequency: England, 2002 onwards (ODS, 35.1 KB)
NTS0313: https://assets.publishing.service.gov.uk/media/66ce0f12bc00d93a0c7e1f72/nts0313.ods">Frequency of use of different transport modes: England, 2003 onwards (ODS, 27.1 KB)
NTS0412: https://assets.publishing.service.gov.uk/media/66ce0f1325c035a11941f653/nts0412.ods">Commuter trips and distance by employment status and main mode: England, 2002 onwards (ODS, 53.8 KB)
NTS0504: https://assets.publishing.service.gov.uk/media/66ce0f141aaf41b21139cf7d/nts0504.ods">Average number of trips by day of the week or month and purpose or main mode: England, 2002 onwards (ODS, 141 KB)
<h2 id=
The basic goal of this survey is to provide the necessary database for formulating national policies at various levels. It represents the contribution of the household sector to the Gross National Product (GNP). Household Surveys help as well in determining the incidence of poverty, and providing weighted data which reflects the relative importance of the consumption items to be employed in determining the benchmark for rates and prices of items and services. Generally, the Household Expenditure and Consumption Survey is a fundamental cornerstone in the process of studying the nutritional status in the Palestinian territory.
The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality. Data is a public good, in the interest of the region, and it is consistent with the Economic Research Forum's mandate to make micro data available, aiding regional research on this important topic.
The survey data covers urban, rural and camp areas in West Bank and Gaza Strip.
1- Household/families. 2- Individuals.
The survey covered all the Palestinian households who are a usual residence in the Palestinian Territory.
Sample survey data [ssd]
The sampling frame consists of all enumeration areas which were enumerated in 1997; the enumeration area consists of buildings and housing units and is composed of an average of 120 households. The enumeration areas were used as Primary Sampling Units (PSUs) in the first stage of the sampling selection. The enumeration areas of the master sample were updated in 2003.
The sample is a stratified cluster systematic random sample with two stages: First stage: selection of a systematic random sample of 299 enumeration areas. Second stage: selection of a systematic random sample of 12-18 households from each enumeration area selected in the first stage. A person (18 years and more) was selected from each household in the second stage.
The population was divided by: 1- Governorate 2- Type of Locality (urban, rural, refugee camps)
The calculated sample size is 3,781 households.
The target cluster size or "sample-take" is the average number of households to be selected per PSU. In this survey, the sample take is around 12 households.
Detailed information/formulas on the sampling design are available in the user manual.
Face-to-face [f2f]
The PECS questionnaire consists of two main sections:
First section: Certain articles / provisions of the form filled at the beginning of the month,and the remainder filled out at the end of the month. The questionnaire includes the following provisions:
Cover sheet: It contains detailed and particulars of the family, date of visit, particular of the field/office work team, number/sex of the family members.
Statement of the family members: Contains social, economic and demographic particulars of the selected family.
Statement of the long-lasting commodities and income generation activities: Includes a number of basic and indispensable items (i.e, Livestock, or agricultural lands).
Housing Characteristics: Includes information and data pertaining to the housing conditions, including type of shelter, number of rooms, ownership, rent, water, electricity supply, connection to the sewer system, source of cooking and heating fuel, and remoteness/proximity of the house to education and health facilities.
Monthly and Annual Income: Data pertaining to the income of the family is collected from different sources at the end of the registration / recording period.
Second section: The second section of the questionnaire includes a list of 54 consumption and expenditure groups itemized and serially numbered according to its importance to the family. Each of these groups contains important commodities. The number of commodities items in each for all groups stood at 667 commodities and services items. Groups 1-21 include food, drink, and cigarettes. Group 22 includes homemade commodities. Groups 23-45 include all items except for food, drink and cigarettes. Groups 50-54 include all of the long-lasting commodities. Data on each of these groups was collected over different intervals of time so as to reflect expenditure over a period of one full year.
Both data entry and tabulation were performed using the ACCESS and SPSS software programs. The data entry process was organized in 6 files, corresponding to the main parts of the questionnaire. A data entry template was designed to reflect an exact image of the questionnaire, and included various electronic checks: logical check, range checks, consistency checks and cross-validation. Complete manual inspection was made of results after data entry was performed, and questionnaires containing field-related errors were sent back to the field for corrections.
The survey sample consists of about 3,781 households interviewed over a twelve-month period between January 2004 and January 2005. There were 3,098 households that completed the interview, of which 2,060 were in the West Bank and 1,038 households were in GazaStrip. The response rate was 82% in the Palestinian Territory.
The calculations of standard errors for the main survey estimations enable the user to identify the accuracy of estimations and the survey reliability. Total errors of the survey can be divided into two kinds: statistical errors, and non-statistical errors. Non-statistical errors are related to the procedures of statistical work at different stages, such as the failure to explain questions in the questionnaire, unwillingness or inability to provide correct responses, bad statistical coverage, etc. These errors depend on the nature of the work, training, supervision, and conducting all various related activities. The work team spared no effort at different stages to minimize non-statistical errors; however, it is difficult to estimate numerically such errors due to absence of technical computation methods based on theoretical principles to tackle them. On the other hand, statistical errors can be measured. Frequently they are measured by the standard error, which is the positive square root of the variance. The variance of this survey has been computed by using the “programming package” CENVAR.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractPopulation surveys are vital for wildlife management, yet traditional methods are typically effort-intensive, leading to data gaps. Modern technologies — such as drones — facilitate field surveys but increase the data analysis burden. Citizen Science (CS) can alleviate this issue by engaging non-specialists in data collection and analysis. We evaluated this approach for population monitoring using the endangered Galápagos marine iguana as a case study, assessing citizen scientists’ ability to detect and count animals in aerial images. Comparing against a Gold Standard dataset of expert counts in 4345 images, we explored optimal aggregation methods from CS inputs and evaluated the accuracy of CS counts. During three phases of our project — hosted on Zooniverse.org — over 13,000 volunteers made 1,375,201 classifications from 57,838 images; each being independently classified up to 30 times. Volunteers achieved 68% to 94% accuracy in detecting iguanas, with more false negatives than false positives. Image quality strongly influenced accuracy; by excluding data from suboptimal pilot-phase images, volunteers counted with 91% to 92% of accuracy. For detecting iguanas, the standard ‘majority vote' aggregation approach (where the answer selected is that given by the majority of individual inputs) produced less accurate results than when a minimum threshold of five (from the total independent classifications) was used. For counting iguanas, HDBSCAN clustering yielded the best results. We conclude that CS can accurately identify and count marine iguanas from drone images though there is a tendency to underestimate. CS-based data analysis is still resource-intensive, underscoring the need to develop a Machine Learning approach.MethodsWe created a citizen science project, named Iguanas from Above, in Zooniverse.org. There, we uploaded 'sliced' images from drone imagery belonging to several colonies of the Galápagos marine iguana. Citizen scientists (CS) were asked to classify the images doing two tasks: First to say yes or no for iguana presence in the image and second to count the individuals when present. Each image was classified by 20 or 30 volunteers. Once all the images, corresponding to three phases launched were classified, we downloaded the data from the Zooniverse portal and used the Panoptes Aggregation python package to extract and aggregate CS data (source code: https://github.com/cwinkelmann/iguanas-from-above-zooniverse).We ramdomly selected 5–10% of all the images to create a Gold Standard (GS) dataset. Three experts from the research team identified presence and absence of marine iguanas in the images and count them. The concensus answers are presented in this dataset and is referred as expert data. The aggregated CS data from Task 1 (a total number of yes and no answers per image) was analyzed as accepted for iguana presence when 5 or more volunteers (from the 20–30) selected yes (a minimum threshold rule), otherwise absence was accepted. Then, we compared all CS accepted answers against the expert data, as correct or incorrect, and calculated a percentage of CS accuracy regarding marine iguana detection.For Task 2, we selected all the images identied by the volunteers to have iguanas with this minimum threshold rule and aggregate (summarize) all classifications into one value (count) per image by using the statistical metrics median and mode and the spatial clustering methods DBSCAN and HDBSCAN. The rest of the images obtained 0 counts. CS data was incorporated into this dataset. We then compared total counts in this GS dataset calculated by the expert and all the aggregating methods used in terms of percentages of agreement towards the expert data. These percentages showed CS accuracy regarding marine iguana counting. We also investigated number of marine iguanas under and overestimated with all aggregating methods.Finally, by applying generalized linear models, we used this dataset to explore statistical differences among the different methods used to count marine iguanas (expert, median, mode and HDBSCAN) in the images and how the factors: phase analyzed, quality of the imges (assessed by the experts) and number of marine iguanas present in the image, could affect CS accuracy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the underlying dataset used for the country analysis regarding the percentage of papers in Dimensions and Web of Science (WoS), published between 2015 and 2019 that are open access (OA), regardless of mode of OA.A paper was assigned a country affiliation based on the affiliation of the first author of a paper, thus each paper is only counted once, regardless whether the paper had multiple coauthors.Each row represents the data for a country. A country only appears once (i.e., each row is unique).Column headings:iso_alpha_2 = the ISO alpha 2 country code of the countrycountry = the name of the country as stated either in Dimensions or WoS.world_bank_region_2021 = pub_wos = total number of papers (document type articles and reviews) indexed in WoS, published from 2015 to 2019oa_pers_wos = Percentage of pub_wos that are OApub_dim = total number of papers (document type journal articles) indexed in Dimensions, published from 2015 to 2019oa_pers_dim = Percentage of pub_dim that are OArelative_diff = the relative difference between oa_pers_dim and oa_pers_wos using the following equation: ((x-y))/((x+y) ), with x representing the percentage of papers for the country in the Dimensions dataset that are OA, and y representing the percentage of papers for the country in the WoS dataset that are OA. In cases of "N/A" in a cell, a division by 0 occurred.Data availabilityRestriction apply to both datasets used to generate the aggregate data. The Web of Science data is owned by Clarivate Analytics. To obtain the bibliometric data in the same manner as authors (i.e. by purchasing them), readers can contact Clarivate Analytics at the following URL: https://clarivate.com/webofsciencegroup/solutions/web-of-science/contact-us/. The Dimensions data is owned by Digital Science, which has a programme that provides no cost access to its data. It can be accessed at: https://dimensions.ai/data_access.
Access to up-to-date socio-economic data is a widespread challenge in Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.
In Fiji, monthly HFPS data collection commenced in February 2024 on topics including employment, income, food security, health, food prices, assets and well-being. Fieldwork took place in rounds roughly one month in length in a panel method, where each household was only recontacted at least thirty days after the previous interview. Each month has approximately 700 households in the sample and is representative of urban and rural areas and divisions. This dataset contains combined monthly survey data between February and October 2024. There is one date file for household level data with a unique household ID, and a separate file for individual level data within each household data, that can be matched to the household file using the household ID, and which also has a unique individual ID within the household data which can be used to track individuals over time within households
Urban and rural areas of Fiji.
Household, invidiual.
Sample survey data [ssd]
The initial sample was drawn through Random Digit Dialing (RDD) with geographic stratification. As an objective of the survey was to measure changes in household economic wellbeing over time, the HFPS sought to contact a consistent number of households across each division month to month. It had a probability-based weighted design, with a proportionate stratification to achieve geographical representation. A panel was established from the outset, where in each subsequent round after February 2024, the survey firm would first attempt to contact all households from the previous month and then attempt to contact households from earlier months that had dropped out. After previous numbers were exhausted, RDD with geographic stratification was used for replacement households. This dataset includes 4,120 completed interviews with 1,360 unique households.
Computer Assisted Telephone Interview [cati]
The questionnaire, which can be found in the External Resources of this documentation, is available in English, with iTaukei translation available. There were few changes to the questionnaire across the survey months, with some sections only asked in some rounds, such as the digital governance module in rounds 3 and 4. The survey instrument consists of the following modules, with notes in parentheses on dates of collection for questions which were not collected consistently across the whole survey period: - Basic information, - Household roster, - Access to Services and Shocks (additional questions on water disruption were asked since April 2024) - Subjective well-being - Food insecurity experience scale (FIES) - Views on the economy and government (some questions were added since May 2024) - Household income - Labor - Agriculture - Medical service utilization - Climate migration (April 2024) - Digital government services (May and June 2024)
The raw data were cleaned by the World Bank team using STATA. This included formatting and correcting errors identified through the survey's monitoring and quality control process. The data are presented in two datasets: a household dataset and an individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, food security, household income, agriculture activities, social protection, subjective well-being, access to services, shocks, and perceptions. The household identifier (panel_hid) is available in both the household dataset and the individual dataset. The individual identifier (panel_indid) can be found in the individual dataset.
To facilitate comparisons with the Latin America and the Caribbean (LAC) High-Frequency Surveys collected in 2021, harmonized versions of the COVID-19 High Frequency Phone Surveys 2022 Brazil databases have been produced. The databases follow the same structure as those for the countries in the region (for example, see: COVID-19 LAC High Frequency Phone Surveys 2021 (Wave 1)).
The Brazil 2021 COVID-19 Phone Survey was conducted to provide information on how the pandemic had been affecting Brazilian households in 2021, collecting information along multiple dimensions relevant to the welfare of the population (e.g. changes in employment and income, coping mechanisms, access to health and education services, gender inequalities, and food insecurity). A total of 2,166 phone interviews were conducted across all Brazilian states between July 26 and October 1, 2021. The survey followed an Random Digit Dialing (RDD) sampling methodology using a dual sampling frame of cellphone and landline numbers. The sampling frame was stratified by type of phone and state. Results are nationally representative for households with a landline or at least one cell phone and of individuals of ages 18 years and above who have an active cell phone number or a landline at home.
National level.
Households and individuals of 18 years of age and older.
The sample is based on a dual frame of cell phone and landline numbers that was generated through a Random Digit Dialing (RDD) process and consisted of all possible phone numbers under the national phone numbering plan. Numbers were screened through an automated process to identify active numbers and cross-checked with business registries to identify business numbers not eligible for the survey. This method ensures coverage of all landline and cellphone numbers active at the time of the survey. The sampling frame was stratified by type of phone and state. See Sampling Design and Weighting document for more detail.
Computer Assisted Telephone Interview [cati]
Available in Portuguese. The questionnaire followed closely the LAC HFPS Questionnaire of Phase II Wave I but had some critical variations.
TL2ATMTN_7 is the Tropospheric Emission Spectrometer (TES)/Aura Level 2 Atmospheric Temperatures Nadir Version 7 data product. TES was an instrument aboard NASA's Aura satellite and was launched from California on July 15, 2004. Data collection for TES is complete. TES Level 2 data contains retrieved species (or temperature) profiles at the observation targets and the estimated errors. The geolocation, quality, and other data (e.g., surface characteristics for nadir observations) were also provided. L2 modeled spectra were evaluated using radiative transfer modeling algorithms. The process, referred to as retrieval, compared observed spectra to the modeled spectra and iteratively updated the atmospheric parameters. L2 standard product files included information for one molecular species (or temperature) for an entire global survey or special observation run. A global survey consisted of a maximum of 16 consecutive orbits. Nadir and limb observations were added to separate L2 files, and a single ancillary file was composed of data that are common to both nadir and limb files. A Nadir sequence within the TES Global Survey was a fixed number of observations within an orbit for a Global Survey. Prior to April 24, 2005, it consisted of two low resolution scans over the same ground locations. After April 24, 2005, Global Survey data consisted of three low resolution scans. The Nadir standard product consists of four files, where each file is composed of the Global Survey Nadir observations from one of four focal planes for a single orbit, i.e. 72 orbit sequences. The Global Survey Nadir observations only used a single set of filter mix. A Limb sequence within the TES Global Survey involved three high-resolution scans over the same limb locations. The Limb standard product consisted of four files, where each file was composed of the Global Survey Limb observations from one of four focal planes for a single orbit, i.e. 72 orbit sequences. The Global Survey Limb observations used a repeating sequence of filter wheel positions. Special Observations could only be scheduled during the 9 or 10 orbit gaps in the Global Surveys, and were conducted in any of three basic modes: stare, transect, step-and-stare. The mode used depended on the science requirement. A Global Survey consisted of observations along 16 consecutive orbits at the start of a two day cycle, over which 4,608 retrievals were performed (1,152 nadir retrievals and 1,152 retrievals in time ordered sequence for each limb observation). Each observation was the input for retrievals of species Volume Mixing Ratios (VMR), temperature profiles, surface temperature, and other data parameters with associated pressure levels, precision, total error, vertical resolution, total column density, and other diagnostic quantities. Each TES Level 2 standard product reported information in a swath format conforming to the HDF-EOS Aura File Format Guidelines. Each Swath object was bounded by the number of observations in a global survey and a predefined set of pressure levels, representing slices through the atmosphere. Each standard product could have had a variable number of observations depending upon the Global Survey configuration and whether averaging was employed. Also, missing or bad retrievals were not reported. Each limb observation Limb 1, Limb 2 and Limb 3, were processed independently. Thus, each limb standard product consisted of three sets where each set consisted of 1,152 observations. For TES, the swath object represented one of these sets. Thus, each limb standard product consisted of three swath objects, one for each observation, Limb 1, Limb 2, and Limb 3. The organization of data within the Swath object was based on a superset of Upper Atmosphere Research Satellite (UARS) pressure levels used to report concentrations of trace atmospheric gases. The reporting grid was the same pressure grid used for modeling. There were 67 reporting levels from 1211.53 hPa, which allow
Access to up-to-date socio-economic data is a widespread challenge in Vanuatu and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.
For Vanuatu, data for December 2023 – January 2025 was collected with each month having approximately 1000 households in the sample and is representative of urban and rural areas but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in Vanuatu. There is one date file for household level data with a unique household ID. And a separate file for individual level data within each household data, that can be matched to the household file using the household ID, and which also has a unique individual ID within the household data which can be used to track individuals over time within households, where the data is panel data.
National, urban and rural. Six provinces were covered by this survey: Sanma, Shefa, Torba, Penama, Malampa and Tafea.
Household and individuals.
Sample survey data [ssd]
The Vanuatu High Frequency Phone Survey (HFPS) sample is drawn from the list of customer phone numbers (MSIDNS) provided by Digicel Vanuatu, one of the country’s two main mobile providers. Digicel’s customer base spans all regions of Vanuatu. For the initial data collection, Digicel filtered their MSIDNS database to ensure a representative distribution across regions. Recognizing the challenge of reaching low-income respondents, Digicel also included low-income areas and customers with a low-income profile (defined by monthly spending between 50 and 150 VT), as well as those with only incoming calls or using the IOU service without repayment. These filtered lists were then randomized, and enumerators began calling the numbers.
This approach was used to complete the first round of 1,000 interviews. The respondents from this first round formed a panel to be surveyed monthly. Each month, phone numbers from the panel are contacted until all have been interviewed, at which point new phone numbers (fresh MSIDNS from Digicel’s database) are used to replace those that have been exhausted. These new respondents are then added to the panel for future surveys.
Computer Assisted Telephone Interview [cati]
The questionnaire was developed in both English and Bislama. Sections of the Questionnaire:
-Interview Information
-Household Roster (separate modules for new households and returning households)
-Labor (separate modules for new households and returning households)
-Food Security
-Household Income
-Agriculture
-Social Protection
-Access to Services
-Assets
-Perceptions
-Follow-up
At the end of data collection, the raw dataset was cleaned by the survey firm and the World Bank team. Data cleaning mainly included formatting, relabeling, and excluding survey monitoring variables (e.g., interview start and end times). Data was edited using the software STATA.
The data are presented in two datasets: a household dataset and an individual dataset. The total number of observations is 13,779 in the household dataset and 77,501 in the individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (hhid_mem) can be found in the individual dataset.
In November 2024, a total of 7,874 calls were made. Of these, 2,251 calls were successfully connected, and 1,000 respondents completed the survey. By February 2024, the sample was fully comprised of returning respondents, with a re-contact rate of 99.9 percent.
The GINGAMODE database table contains selected information from the Large Area Counter (LAC) aboard the third Japanese X-ray astronomy satellite Ginga. The Ginga experiment began on day 36, 5 February 1987 and ended in November 1991. Ginga consisted of the LAC, the all-sky monitor (ASM) and the gamma-ray burst detector (GBD). The satellite was in a circular orbit at 31 degree inclination with apogee 670 km and perigee 510 km, and with a period of 96 minutes. A Ginga observation consisted of varying numbers of major frames which had lengths of 4, 32, or 128 seconds, depending on the setting of the bitrate. Each GINGAMODE database entry consists of data from the first record of a series of observations having the same values of the following: "BITRATE", "LACMODE", "DISCRIMINATOR", or "ACS MONITOR". When any of these changed, a new entry was written into GINGAMODE. The other Ginga catalog database, GINGALOG is also a subset of the same LAC dump file used to create GINGAMODE. GINGALOG contains a listing only whenever the "ACS monitor" (Attitude Control System) changes. Thus, GINGAMODE monitors changes in four parameters and GINGALOG is a basic log database mapping the individual FITS files. Ginga FITS files may have more than one entries in the GINGAMODE database. Both databases point to the same archived Flexible Image Transport System (FITS) files created from the LAC dump files. The user is invited to browse though the observations available from Ginga using GINGALOG or GINGAMODE, then extract the FITS files for more detailed analysis. The Ginga LAC Mode Catalog was prepared from data sent to NASA/GSFC from the Institute of Space and Astronautical Science (ISAS) in Japan.
Duplicate entries were removed from the HEASARC implementation of this catalog in June 2019. This is a service provided by NASA HEASARC .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Methods: A literature search was performed using Google Scholar on 19th March 2019, which identified 368 citations of the original paper for TreeMap (Page 1994), and 332 citations of the original paper for Parafit (Legendre et al. 2002), resulting in a total of 700 articles that were screened to extract metrics for inclusion in our meta-analysis. Articles that did not contain cophylogenetic analyses were immediately excluded. Studies focussing at the population level were also excluded, as these do not represent true cophylogenetic analyses at the macroevolutionary level. Additionally, studies that included less than four taxa were excluded from consideration, as these do not provide sufficient power for inclusion in the meta-analysis. Studies that did not report the test statistic for congruence were also necessarily excluded. A short citation of each study was recorded under ‘authors’, and the year of publication was recorded in ‘year’. Hosts and symbionts were classified broadly according to Linnean taxonomy for ‘host_tax_broad’ and ‘symbiont_tax_broad’ as either: invertebrate, vertebrate, plant or microbe (i.e. microscopic symbionts such as fungi, protozoa, bacteria, viruses). We adopted the mode of symbiosis and mode of transmission between host species specified by the authors in each individual study for ‘symbiosis’ and ‘mode_of_transmission_broad’. In cases where either mode of symbiosis or mode of transmission were not directly specified by authors, we consulted the literature for clarification. In a small number of studies restricted to bacterial intracellular symbionts, the mutualism-parasitism distinction was not defined by the authors and either no further information was available, or a symbiont was cited in the literature as being both a mutualist or a parasite, depending on which study was considered. The nature of the relationship between bacterial intracellular symbionts and their hosts is complex, and in some cases they may display both beneficial and detrimental effects simultaneously. In a few cases of conflict or where authors did not explicitly state mode of transmission for bacterial intracellular symbionts, we assumed a mode of transmission in line with the majority of available references. We only encountered one study where authors categorised the mode of symbiosis as commensalism. On the continuum of symbioses from pure parasitism (fitness losses for the host) to mutualism (fitness gains for the host), commensalism represents a single point where losses and gains for the host precisely equal zero. Consequently, commensalism is an unlikely and unstable state, easily tipped to one side or the other with any small change in external conditions. Thus, the lack of widely recognized groups of commensals is the likeliest explanation for the scarcity of studies on commensalism in our data (note that we did not include this category, commensalism, in our analyses). The total number of host tips that were linked to a symbiont taxon were summed to provide ‘host_tips_linked’, which in a very few cases was corrected to remove multiple sampling of the same host species, to provide ‘host_tips_linked_corrected’. The total number of symbiont tips with a link to a host taxon were summed to provide ‘symbiont_tips_linked’, while the total number of individual links between hosts and symbionts was recorded as ‘total_host_symbiont_links’. If all symbionts in a phylogeny were strict specialists, such that each one had a single link to a single host, ‘total_host_symbiont_links’ would simply equal ‘symbiont_tips_linked’. However, because symbionts are often associated with more than one host, the value of ‘total_host_symbiont_links’ was often higher than the total number of symbionts included in a study. Thus, a measure of symbiont generalism was captured using ‘host_range_link_ratio’, defined as ‘total_host_symbiont_links’ divided by ‘symbiont_tips_linked’, providing the mean number of host-symbiont links observed per symbiont taxon, with the measure increasing with increasing generalism. An alternative estimate of symbiont host specificity was captured using ‘host_range_taxonomic_breadth’, which considers Linnean taxonomic rank, and was calculated by assigning an incremental score to successive host taxonomic ranks per symbiont in turn (i.e. single host species = 1, multiple host species in the same genus = 2, multiple host genera = 3, multiple host families = 4, multiple host orders = 5), summing the total score across all symbionts, and dividing by ‘symbiont_tips_linked’ (i.e. the total number of symbionts). Consequently, ‘host_range_taxonomic_breadth’ increases with symbiont generalism, such that symbiont phylogenies containing symbionts capable of infecting hosts from a wide range of taxonomic ranks are assigned a greater score. The number of phylogenetic permutations performed by authors during cophylogenetic analyses was recorded as ‘no_randomizations’, which poses a unique problem in our meta-analysis (discussed in the section ‘Publication bias and sensitivity analysis’). The resultant p value from each study was recorded as ‘p_value’, whereby observed p values decrease with a decreasing likelihood of observing host-symbiont cophylogeny by chance alone (i.e., as calculated during permutation tests performed by authors during TreeMap or ParaFit analyses). File '2021-09-01-source-data-dat.txt' is in tab-delimited text format.File 'Supporting_Information.Rmd' is accompanying R code used for analysis of the source data.
Syngenta is committed to increasing crop productivity and to using limited resources such as land, water and inputs more efficiently. Since 2014, Syngenta has been measuring trends in agricultural input efficiency on a global network of real farms. The Good Growth Plan dataset shows aggregated productivity and resource efficiency indicators by harvest year. The data has been collected from more than 4,000 farms and covers more than 20 different crops in 46 countries. The data (except USA data and for Barley in UK, Germany, Poland, Czech Republic, France and Spain) was collected, consolidated and reported by Kynetec (previously Market Probe), an independent market research agency. It can be used as benchmarks for crop yield and input efficiency.
National Coverage
Agricultural holdings
Sample survey data [ssd]
A. Sample design Farms are grouped in clusters, which represent a crop grown in an area with homogenous agro- ecological conditions and include comparable types of farms. The sample includes reference and benchmark farms. The reference farms were selected by Syngenta and the benchmark farms were randomly selected by Kynetec within the same cluster.
B. Sample size Sample sizes for each cluster are determined with the aim to measure statistically significant increases in crop efficiency over time. This is done by Kynetec based on target productivity increases and assumptions regarding the variability of farm metrics in each cluster. The smaller the expected increase, the larger the sample size needed to measure significant differences over time. Variability within clusters is assumed based on public research and expert opinion. In addition, growers are also grouped in clusters as a means of keeping variances under control, as well as distinguishing between growers in terms of crop size, region and technological level. A minimum sample size of 20 interviews per cluster is needed. The minimum number of reference farms is 5 of 20. The optimal number of reference farms is 10 of 20 (balanced sample).
C. Selection procedure The respondents were picked randomly using a “quota based random sampling” procedure. Growers were first randomly selected and then checked if they complied with the quotas for crops, region, farm size etc. To avoid clustering high number of interviews at one sampling point, interviewers were instructed to do a maximum of 5 interviews in one village.
Screened Bangladesh BF were from Jessore, Rajshahi, Rangpur, Bogra, Comilla and Mymensingh and were selected based on the following criterion:
- Rice growers
- Partly smallholder
- Professional farmer with rice being main income source
- Manual planting and harvesting. But land preparation and threshing are mechanized.
- Receive tech supports from SYT FFs, CP suppliers or dealers
- Hire labor
- Leading local farmer
- Using SYT products (read remark in next column)
- Loyal to SYT (only for RF - read remark in next column)
- Rice to rice rotation
Face-to-face [f2f]
Data collection tool for 2019 covered the following information:
(A) PRE- HARVEST INFORMATION
PART I: Screening PART II: Contact Information PART III: Farm Characteristics a. Biodiversity conservation b. Soil conservation c. Soil erosion d. Description of growing area e. Training on crop cultivation and safety measures PART IV: Farming Practices - Before Harvest a. Planting and fruit development - Field crops b. Planting and fruit development - Tree crops c. Planting and fruit development - Sugarcane d. Planting and fruit development - Cauliflower e. Seed treatment
(B) HARVEST INFORMATION
PART V: Farming Practices - After Harvest a. Fertilizer usage b. Crop protection products c. Harvest timing & quality per crop - Field crops d. Harvest timing & quality per crop - Tree crops e. Harvest timing & quality per crop - Sugarcane f. Harvest timing & quality per crop - Banana g. After harvest PART VI - Other inputs - After Harvest a. Input costs b. Abiotic stress c. Irrigation
See all questionnaires in external materials tab
Data processing:
Kynetec uses SPSS (Statistical Package for the Social Sciences) for data entry, cleaning, analysis, and reporting. After collection, the farm data is entered into a local database, reviewed, and quality-checked by the local Kynetec agency. In the case of missing values or inconsistencies, farmers are re-contacted. In some cases, grower data is verified with local experts (e.g. retailers) to ensure data accuracy and validity. After country-level cleaning, the farm-level data is submitted to the global Kynetec headquarters for processing. In the case of missing values or inconsistences, the local Kynetec office was re-contacted to clarify and solve issues.
B. Quality assurance Various consistency checks and internal controls are implemented throughout the entire data collection and reporting process in order to ensure unbiased, high quality data.
• Screening: Each grower is screened and selected by Kynetec based on cluster-specific criteria to ensure a comparable group of growers within each cluster. This helps keeping variability low.
• Evaluation of the questionnaire: The questionnaire aligns with the global objective of the project and is adapted to the local context (e.g. interviewers and growers should understand what is asked). Each year the questionnaire is evaluated based on several criteria, and updated where needed.
• Briefing of interviewers: Each year, local interviewers - familiar with the local context of farming -are thoroughly briefed to fully comprehend the questionnaire to obtain unbiased, accurate answers from respondents.
• Cross-validation of the answers:
o Kynetec captures all growers' responses through a digital data-entry tool. Various logical and consistency checks are automated in this tool (e.g. total crop size in hectares cannot be larger than farm size)
o Kynetec cross validates the answers of the growers in three different ways:
1. Within the grower (check if growers respond consistently during the interview)
2. Across years (check if growers respond consistently throughout the years)
3. Within cluster (compare a grower's responses with those of others in the group)
o All the above mentioned inconsistencies are followed up by contacting the growers and asking them to verify their answers. The data is updated after verification. All updates are tracked.
• Check and discuss evolutions and patterns: Global evolutions are calculated, discussed and reviewed on a monthly basis jointly by Kynetec and Syngenta.
• Sensitivity analysis: sensitivity analysis is conducted to evaluate the global results in terms of outliers, retention rates and overall statistical robustness. The results of the sensitivity analysis are discussed jointly by Kynetec and Syngenta.
• It is recommended that users interested in using the administrative level 1 variable in the location dataset use this variable with care and crosscheck it with the postal code variable.
Due to the above mentioned checks, irregularities in fertilizer usage data were discovered which had to be corrected:
For data collection wave 2014, respondents were asked to give a total estimate of the fertilizer NPK-rates that were applied in the fields. From 2015 onwards, the questionnaire was redesigned to be more precise and obtain data by individual fertilizer products. The new method of measuring fertilizer inputs leads to more accurate results, but also makes a year-on-year comparison difficult. After evaluating several solutions to this problems, 2014 fertilizer usage (NPK input) was re-estimated by calculating a weighted average of fertilizer usage in the following years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently bibliographic databases have included a large number of Early Access (EA) articles. Taking 47 IEEE journals as examples, this study analyzed and compared the differences in publication stages of EA articles in three typical bibliographic databases, including Web of Science Core Collection, Scopus, and Engineering Village Compendex. Qualitative analysis of data sets that may appear in these three databases and their publication stage modes, and quantitative analysis on the number of records, proportion, and journal distributions of each data set and each publication stage mode were conducted. There were totally 7 sub-data sets and corresponding 26 publication stage modes, with 14 “undifferentiated publication stage modes” and 12 “differentiated publication stage modes”. Although the proportion of EA records from each “differentiated publication stage mode” was mostly below 1.0%, the absolute quantity of EA records with differences in the publication stage was noteworthy reaching 2516. Among the 47 journals, 23 journals have 7–8 publication stage modes, 1 journal having 18 modes, and 40 journals have one or more “differentiated publication stage modes”. Therefore, in IEEE journals, whether for the same EA article or the same journal, the difference in publication stage between these three databases was pervasive and complex.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the simulation data of the combinatorial metamaterial as used for the paper 'Machine Learning of Combinatorial Rules in Mechanical Metamaterials', as published in XXX.
In this paper, the data is used to classify each \(k \times k\) unit cell design into one of two classes (C or I) based on the scaling (linear or constant) of the number of zero modes \(M_k(n)\) for metamaterials consisting of an \(n\times n\) tiling of the corresponding unit cell. Additionally, a random walk through the design space starting from class C unit cells was performed to characterize the boundary between class C and I in design space. A more detailed description of the contents of the dataset follows below.
Modescaling_raw_data.zip
This file contains uniformly sampled unit cell designs and \(M_k(n)\) for \(1\leq n\leq 4\), which was used to classify the unit cell designs for the data set. There is a small subset of designs for \(k=\{3, 4, 5\}\) that do not neatly fall into the class C and I classification, and instead require additional simulation for \(4 \leq n \leq 6\) before either saturating to a constant number of zero modes (class I) or linearly increasing (class C). This file contains the simulation data of size \(3 \leq k \leq 8\) unit cells. The data is organized as follows.
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4.npy", and contain a [Nsim, 1+k*k+4] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Note: the unit cell design uses the numbers \(\{0, 1, 2, 3\}\) to refer to each building block orientation. The building block orientations can be characterized through the orientation of the missing diagonal bar (see Fig. 2 in the paper), which can be Left Up (LU), Left Down (LD), Right Up (RU), or Right Down (RD). The numbers correspond to the building block orientation \(\{0, 1, 2, 3\} = \{\mathrm{LU, RU, RD, LD}\}\).
Simulation data for \(3 \leq k \leq 5\) and \(1 \leq n \leq 6\) for unit cells that cannot be classified as class C or I for \(1 \leq n \leq 4\) is stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. These files are named "data_new_rrQR_i_n_M_kxk_fixn4_classX_extend.npy", and contain a [Nsim, 1+k*k+6] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Simulation data for \(6 \leq k \leq 8\) unit cells are stored in numpy array format (.npy) and can be readily loaded in Python with the Numpy package using the numpy.load command. Note that the number of modes is now calculated for \(n_x \times n_y\) metamaterials, where we calculate \((n_x, n_y) = \{(1,1), (2, 2), (3, 2), (4,2), (2, 3), (2, 4)\}\) rather than \(n_x=n_y=n\) to save computation time. These files are named "data_new_rrQR_i_n_Mx_My_n4_kxk(_extended).npy", and contain a [Nsim, 1+k*k+8] sized array, where Nsim is the number of simulated unit cells. Each row corresponds to a unit cell. The columns are organized as follows:
Modescaling_classification_results.zip
This file contains the classification, slope, and offset of the scaling of the number of zero modes \(M_k(n)\) for the unit cells in Modescaling_raw_data.zip. The data is organized as follows.
The results for \(3 \leq k \leq 5\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 4\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(3 \leq k \leq 5\) based on the extended \(1 \leq n \leq 6\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scen_slope_offset_M1k_kxk_fixn4_classC_extend.txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class, where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n \leq 6\))
col 2: slope from \(n \geq 2\) onward (undefined for class X)
col 3: the offset is defined as \(M_k(2) - 2 \cdot \mathrm{slope}\)
col 4: \(M_k(1)\)
The results for \(6 \leq k \leq 8\) based on the \(1 \leq n \leq 4\) mode scaling data is stored in "results_analysis_new_rrQR_i_Scenx_Sceny_slopex_slopey_offsetx_offsety_M1k_kxk(_extended).txt". The data can be loaded using ',' as delimiter. Every row corresponds to a unit cell design (see the label number to compare to the earlier data). The columns are organized as follows:
col 0: label number to keep track
col 1: the class_x based on \(M_k(n_x, 2)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_x \leq 4\))
col 2: the class_y based on \(M_k(2, n_y)\), where 0 corresponds to class I, 1 to class C and 2 to class X (neither class I or C for \(1 \leq n_y \leq 4\))
col 3: slope_x from \(n_x \geq 2\) onward (undefined for class X)
col 4: slope_y from \(n_y \geq 2\) onward (undefined for class X)
col 5: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_x}\)
col 6: the offset_x is defined as \(M_k(2, 2) - 2 \cdot \mathrm{slope_y}\)
col 7: \(M_k(1, 1)\)
Random Walks Data
This file contains the random walks for \(3 \leq k \leq 8\) unit cells. The random walk starts from a class C unit cell design, for each step \(s\) a randomly picked unit cell is changed to a random new orientation for a total of \(s=k^2\) steps. The data is organized as follows.
The configurations for each step are stored in the files named "configlist_test_i.npy", where i is a number and corresponds to a different starting unit cell. The stored array has the shape [k*k+1, 2*k+2, 2*k+2]. The first dimension denotes the step \(s\), where \(s=0\) is the initial configuration. The second and third dimension denote the unit cell configuration in the pixel representation (see paper) padded with a single pixel wide layer using periodic boundary conditions.
The class for each configuration are stored in "lmlist_test_i.npy", where i corresponds to the same number as for the configurations in the "configlist_test_i.npy" file. The stored array has