100+ datasets found
  1. Sport Activity Dataset - MTS-5

    • kaggle.com
    zip
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jarno Matarmaa (2023). Sport Activity Dataset - MTS-5 [Dataset]. https://www.kaggle.com/datasets/jarnomatarmaa/sportdata-mts-5
    Explore at:
    zip(498699 bytes)Available download formats
    Dataset updated
    Jul 13, 2023
    Authors
    Jarno Matarmaa
    License

    https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en

    Description

    Description

    Dataset consists of data in categories walking, running, biking, skiing, and roller skiing (5). Sport activities have been recorded by an individual active (non-competitive) athlete. Data is pre-processed, standardized and splitted in four parts (each dimension in its own file): * HR-DATA_std_1140x69 (heart rate signals) * SPD-DATA_std_1140x69 (speed signals) * ALT-DATA_std_1140x69 (altitude signals) * META-DATA_1140x4 (labels and details)

    NOTE: Signal order between the separate files must not be confused when processing the data. Signal order is critical; first index in each of the file comes from the same activity which label corresponds to first index in the target data file, and so on. So, data should be constructed and files combined into the same table while reading the files, ideally using nested data structure. Something like in the picture below:

    You may check the related TSC projects in GitHub: - "https://github.com/JABE22/MasterProject">Sport Activity Classification Using Classical Machine Learning and Time Series Methods - Symbolic Representation of Multivariate Time Series Signals in Sport Activity Classification - Kaggle Project

    https://mediauploads.data.world/e1ccd4d36522e04c0061d12d05a87407bec80716f6fe7301991eaaccd577baa8_mts_data.png" alt="Nested data structure for multivariate time series classifiers">

    In the following picture one can see five signal samples for each dimension (Heart Rate, Speed, Altitude) in standard feature value format. So, each figure contains signal from five different random activities (can be same or different category). However, for example, signal indexes number 1 in each three figure are from the same activity. Figures just visualizes what kind of signals dataset consists. They do not have any particular meaning.

    https://mediauploads.data.world/162b7086448d8dbd202d282014bcf12bd95bd3174b41c770aa1044bab22ad655_signal_samples.png" alt="Signals from sport activities (Heart Rate, Speed, and Altitude)">

    Dataset size and construction procedure

    The original amount of sport activities is 228. From each of them, starting from the index 100 (seconds), have been picked 5 x 69 second consecutive segments, that is expressed as a formula below:

    https://mediauploads.data.world/68ce83092ec65f6fbaee90e5de6e12df40498e08fa6725c111f1205835c1a842_segment_equation.png" alt="Data segmentation and augmentation formula">

    where 𝐷 = 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑 𝑑𝑎𝑡𝑎 ,𝑁 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑖𝑒𝑠 , 𝑠 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑠𝑡𝑎𝑟𝑡 𝑖𝑛𝑑𝑒𝑥 , 𝑙 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ, and 𝑛 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 from a single original sequence 𝐷𝑖 , resulting the new set of equal length segments 𝐷𝑠𝑒𝑔. And in this certain case the equation takes the form of:

    https://mediauploads.data.world/63dd87bf3d0010923ad05a8286224526e241b17bbbce790133030d8e73f3d3a7_data_segmentation_formula.png" alt="Data segmentation and augmentation formula with values">

    Thus, dataset has dimesions of 1140 x 69 x 3.

    Additional information

    Data has been recorded without knowing it will be used in research, therefore it represents well real-world application of data source and can provide excellent tool to test algorithms in real data.

    Recording devices

    Data has been recorded using two type of Garmin devices. Models are Forerunner 920XT and vivosport. Vivosport is activity tracker and measures heart rate from the wrist using optical sensor, whereas 920XT requires external sensor belt (hear rate + inertial) installed under chest when doing exercises. Otherwise devices are not essentially different, they uses GPS location to measure speed and inertial barometer to measure elevation changes.

    Device manuals - Garmin FR-920XT - Garmin Vivosport

    Person profile

    Age: 30-31, Weight: 82, Length: 181, Active athlete (non-competitive)

  2. Consumer Price Index 2021 - West Bank and Gaza

    • pcbs.gov.ps
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2023). Consumer Price Index 2021 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/711
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttps://pcbs.gov/
    Time period covered
    2021
    Area covered
    Gaza, West Bank, Gaza Strip
    Description

    Abstract

    The Consumer price surveys primarily provide the following: Data on CPI in Palestine covering the West Bank, Gaza Strip and Jerusalem J1 for major and sub groups of expenditure. Statistics needed for decision-makers, planners and those who are interested in the national economy. Contribution to the preparation of quarterly and annual national accounts data.

    Consumer Prices and indices are used for a wide range of purposes, the most important of which are as follows: Adjustment of wages, government subsidies and social security benefits to compensate in part or in full for the changes in living costs. To provide an index to measure the price inflation of the entire household sector, which is used to eliminate the inflation impact of the components of the final consumption expenditure of households in national accounts and to dispose of the impact of price changes from income and national groups. Price index numbers are widely used to measure inflation rates and economic recession. Price indices are used by the public as a guide for the family with regard to its budget and its constituent items. Price indices are used to monitor changes in the prices of the goods traded in the market and the consequent position of price trends, market conditions and living costs. However, the price index does not reflect other factors affecting the cost of living, e.g. the quality and quantity of purchased goods. Therefore, it is only one of many indicators used to assess living costs. It is used as a direct method to identify the purchasing power of money, where the purchasing power of money is inversely proportional to the price index.

    Geographic coverage

    Palestine West Bank Gaza Strip Jerusalem

    Analysis unit

    The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.

    Universe

    The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A non-probability purposive sample of sources from which the prices of different goods and services are collected was updated based on the establishment census 2017, in a manner that achieves full coverage of all goods and services that fall within the Palestinian consumer system. These sources were selected based on the availability of the goods within them. It is worth mentioning that the sample of sources was selected from the main cities inside Palestine: Jenin, Tulkarm, Nablus, Qalqiliya, Ramallah, Al-Bireh, Jericho, Jerusalem, Bethlehem, Hebron, Gaza, Jabalia, Dier Al-Balah, Nusseirat, Khan Yunis and Rafah. The selection of these sources was considered to be representative of the variation that can occur in the prices collected from the various sources. The number of goods and services included in the CPI is approximately 730 commodities, whose prices were collected from 3,200 sources. (COICOP) classification is used for consumer data as recommended by the United Nations System of National Accounts (SNA-2008).

    Sampling deviation

    Not apply

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    A tablet-supported electronic form was designed for price surveys to be used by the field teams in collecting data from different governorates, with the exception of Jerusalem J1. The electronic form is supported with GIS, and GPS mapping technique that allow the field workers to locate the outlets exactly on the map and the administrative staff to manage the field remotely. The electronic questionnaire is divided into a number of screens, namely: First screen: shows the metadata for the data source, governorate name, governorate code, source code, source name, full source address, and phone number. Second screen: shows the source interview result, which is either completed, temporarily paused or permanently closed. It also shows the change activity as incomplete or rejected with the explanation for the reason of rejection. Third screen: shows the item code, item name, item unit, item price, product availability, and reason for unavailability. Fourth screen: checks the price data of the related source and verifies their validity through the auditing rules, which was designed specifically for the price programs. Fifth screen: saves and sends data through (VPN-Connection) and (WI-FI technology).

    In case of the Jerusalem J1 Governorate, a paper form has been designed to collect the price data so that the form in the top part contains the metadata of the data source and in the lower section contains the price data for the source collected. After that, the data are entered into the price program database.

    Cleaning operations

    The price survey forms were already encoded by the project management depending on the specific international statistical classification of each survey. After the researcher collected the price data and sent them electronically, the data was reviewed and audited by the project management. Achievement reports were reviewed on a daily and weekly basis. Also, the detailed price reports at data source levels were checked and reviewed on a daily basis by the project management. If there were any notes, the researcher was consulted in order to verify the data and call the owner in order to correct or confirm the information.

    At the end of the data collection process in all governorates, the data will be edited using the following process: Logical revision of prices by comparing the prices of goods and services with others from different sources and other governorates. Whenever a mistake is detected, it should be returned to the field for correction. Mathematical revision of the average prices for items in governorates and the general average in all governorates. Field revision of prices through selecting a sample of the prices collected from the items.

    Response rate

    Not apply

    Sampling error estimates

    The findings of the survey may be affected by sampling errors due to the use of samples in conducting the survey rather than total enumeration of the units of the target population, which increases the chances of variances between the actual values we expect to obtain from the data if we had conducted the survey using total enumeration. The computation of differences between the most important key goods showed that the variation of these goods differs due to the specialty of each survey. For example, for the CPI, the variation between its goods was very low, except in some cases such as banana, tomato, and cucumber goods that had a high coefficient of variation during 2019 due to the high oscillation in their prices. The variance of the key goods in the computed and disseminated CPI survey that was carried out on the Palestine level was for reasons related to sample design and variance calculation of different indicators since there was a difficulty in the dissemination of results by governorates due to lack of weights. Non-sampling errors are probable at all stages of data collection or data entry. Non-sampling errors include: Non-response errors: the selected sources demonstrated a significant cooperation with interviewers; so, there wasn't any case of non-response reported during 2019. Response errors (respondent), interviewing errors (interviewer), and data entry errors: to avoid these types of errors and reduce their effect to a minimum, project managers adopted a number of procedures, including the following: More than one visit was made to every source to explain the objectives of the survey and emphasize the confidentiality of the data. The visits to data sources contributed to empowering relations, cooperation, and the verification of data accuracy. Interviewer errors: a number of procedures were taken to ensure data accuracy throughout the process of field data compilation: Interviewers were selected based on educational qualification, competence, and assessment. Interviewers were trained theoretically and practically on the questionnaire. Meetings were held to remind interviewers of instructions. In addition, explanatory notes were supplied with the surveys. A number of procedures were taken to verify data quality and consistency and ensure data accuracy for the data collected by a questioner throughout processing and data entry (knowing that data collected through paper questionnaires did not exceed 5%): Data entry staff was selected from among specialists in computer programming and were fully trained on the entry programs. Data verification was carried out for 10% of the entered questionnaires to ensure that data entry staff had entered data correctly and in accordance with the provisions of the questionnaire. The result of the verification was consistent with the original data to a degree of 100%. The files of the entered data were received, examined, and reviewed by project managers before findings were extracted. Project managers carried out many checks on data logic and coherence, such as comparing the data of the current month with that of the previous month, and comparing the data of sources and between governorates. Data collected by tablet devices were checked for consistency and accuracy by applying rules at item level to be checked.

    Data appraisal

    Other technical procedures to improve data quality: Seasonal adjustment processes

  3. N

    Index, WA Annual Population and Growth Analysis Dataset: A Comprehensive...

    • neilsberg.com
    csv, json
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Index, WA Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Index from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/index-wa-population-by-year/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Index, Washington
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Index population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Index across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2023, the population of Index was 157, a 1.29% increase year-by-year from 2022. Previously, in 2022, Index population was 155, an increase of 1.31% compared to a population of 153 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Index decreased by 3. In this period, the peak population was 211 in the year 2019. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2023

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2023)
    • Population: The population for the specific year for the Index is shown in this column.
    • Year on Year Change: This column displays the change in Index population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Index Population by Year. You can refer the same here

  4. N

    Index, WA Population Dataset: Yearly Figures, Population Change, and Percent...

    • neilsberg.com
    csv, json
    Updated Sep 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). Index, WA Population Dataset: Yearly Figures, Population Change, and Percent Change Analysis [Dataset]. https://www.neilsberg.com/research/datasets/6ea812e6-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Sep 18, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Index, Washington
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2022, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2022. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2022. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Index population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Index across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2022, the population of Index was 156, a 0.00% decrease year-by-year from 2021. Previously, in 2021, Index population was 156, an increase of 0.65% compared to a population of 155 in 2020. Over the last 20 plus years, between 2000 and 2022, population of Index decreased by 4. In this period, the peak population was 211 in the year 2019. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2022

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2022)
    • Population: The population for the specific year for the Index is shown in this column.
    • Year on Year Change: This column displays the change in Index population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Index Population by Year. You can refer the same here

  5. Case Mix Index

    • data.chhs.ca.gov
    • data.ca.gov
    • +2more
    docx, pdf, xlsx, zip
    Updated Nov 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Case Mix Index [Dataset]. https://data.chhs.ca.gov/dataset/case-mix-index
    Explore at:
    docx, pdf, xlsx(192727), zipAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    The Case Mix Index (CMI) is the average relative DRG weight of a hospital’s inpatient discharges, calculated by summing the Medicare Severity-Diagnosis Related Group (MS-DRG) weight for each discharge and dividing the total by the number of discharges. The CMI reflects the diversity, clinical complexity, and resource needs of all the patients in the hospital. A higher CMI indicates a more complex and resource-intensive case load. Although the MS-DRG weights, provided by the Centers for Medicare & Medicaid Services (CMS), were designed for the Medicare population, they are applied here to all discharges regardless of payer. Note: It is not meaningful to add the CMI values together.

  6. N

    Index, WA Population Breakdown by Gender and Age Dataset: Male and Female...

    • neilsberg.com
    csv, json
    Updated Feb 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Index, WA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/8df69e40-c989-11ee-9145-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 19, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Index, Washington
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Index by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Index. The dataset can be utilized to understand the population distribution of Index by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Index. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Index.

    Key observations

    Largest age group (population): Male # 45-49 years (16) | Female # 40-44 years (15). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Index population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Index is shown in the following column.
    • Population (Female): The female population in the Index is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Index for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Index Population by Gender. You can refer the same here

  7. Consumer Price Index (CPI)

    • catalog.data.gov
    • datasets.ai
    Updated May 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Labor Statistics (2022). Consumer Price Index (CPI) [Dataset]. https://catalog.data.gov/dataset/consumer-price-index-cpi-ee18b
    Explore at:
    Dataset updated
    May 16, 2022
    Dataset provided by
    Bureau of Labor Statisticshttp://www.bls.gov/
    Description

    The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available. Prices for the goods and services used to calculate the CPI are collected in 75 urban areas throughout the country and from about 23,000 retail and service establishments. Data on rents are collected from about 43,000 landlords or tenants. More information and details about the data provided can be found at http://www.bls.gov/cpi

  8. H

    Data from: Long-term, gridded standardized precipitation index for Hawai‘i

    • hydroshare.org
    • dataone.org
    • +1more
    zip
    Updated Sep 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Lucas; Clay Trauernicht; Abby Frazier; Tomoaki Miura (2020). Long-term, gridded standardized precipitation index for Hawai‘i [Dataset]. http://doi.org/10.4211/hs.822553ead1d04869b5b3e1e3a3817ec6
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 22, 2020
    Dataset provided by
    HydroShare
    Authors
    Matthew Lucas; Clay Trauernicht; Abby Frazier; Tomoaki Miura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1920 - Dec 31, 2011
    Area covered
    Description

    This dataset contains gridded monthly Standardized Precipitation Index (SPI) at 10 timescales: 1-, 3-, 6-, 9-, 12-, 18-, 24-, 36-, 48-, and 60-month intervals from 1920 to 2012 at 250 m resolution for seven of the eight main Hawaiian Islands (18.849°N, 154.668°W to 22.269°N, 159.816°W; the island of Ni‘ihau is excluded due to lack of data). The gridded data use a World Geographic Coordinate System 1984 (WGS84) and are stored as individual GeoTIFF files for each month-year, organized by SPI interval, as indicated by the GeoTIFF file name. Thus, for example, the file “spi3_1999_11.tif” would contain the gridded 3-month SPI values calculated for the month of November in the year 1999. Currently, the data are available from 1920 to 2012, but the datasets will be updated as new gridded monthly rainfall data become available.SPI is a normalized drought index that converts monthly rainfall totals into the number of standard deviations (z-score) by which the observed, cumulative rainfall diverges from the long-term mean. The conversion of raw rainfall to a z-score is done by fitting a designated probability distribution function to the observed precipitation data for a site. In doing so, anomalous rainfall quantities take the form of positive and negative SPI z-scores. Additionally, because distribution fitting is based on long-term (>30 years) precipitation data at that location, SPI score is relative, making comparisons across different climates possible.The creation of a statewide Hawai‘i SPI dataset relied on a 93-year (1920-2012) high resolution (250 m) spatially interpolated monthly gridded rainfall dataset [1]. This dataset is recognized as the highest quality precipitation data available [2] for the main Hawaiian Islands. After performing extensive quality control on the monthly rainfall station data (including homogeneity testing of over 1,100 stations [1,3]) and a geostatistical method comparison, ordinary kriging was using to generate a time series of gridded monthly rainfall from January 1920 to December 2012 at 250 m resolution [3]. This dataset was then used to calculate monthly SPI for 10 timescales (1-, 3-, 6-, 9-, 12-, 18-, 24-, 36-, 48-, and 60-month) at each grid cell. A 3-month SPI in May 2001, for example, represents the March-April-May (MAM) total rainfall in 2001 compared to the MAM rainfall in the entire time series. The resolution of the gridded rainfall dataset provides a more precise representation of drought (and pluvial) events compared to the other available drought products.Frazier, A.G.; Giambelluca, T.W.; Diaz, H.F.; Needham, H.L. Comparison of geostatistical approaches to spatially interpolate month-year rainfall for the Hawaiian Islands. Int. J. Climatol. 2016, 36, 1459–1470, doi:10.1002/joc.4437.Giambelluca, T.W.; Chen, Q.; Frazier, A.G.; Price, J.P.; Chen, Y.-L.; Chu, P.-S.; Eischeid, J.K.; Delparte, D.M. Online Rainfall Atlas of Hawai‘i. B. Am. Meteorol. Soc. 2013, 94, 313–316, doi:10.1175/BAMS-D-11-00228.1.Frazier, A.G.; Giambelluca, T.W. Spatial trend analysis of Hawaiian rainfall from 1920 to 2012. Int. J. Climatol. 2017, 37, 2522–2531, doi:10.1002/joc.4862.

  9. Management, Organization and Innovation Survey 2009 - Serbia

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Sep 26, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Bank for Reconstruction and Development (2013). Management, Organization and Innovation Survey 2009 - Serbia [Dataset]. https://microdata.worldbank.org/index.php/catalog/317
    Explore at:
    Dataset updated
    Sep 26, 2013
    Dataset provided by
    World Bank Grouphttp://www.worldbank.org/
    European Bank for Reconstruction and Development
    Time period covered
    2008 - 2009
    Area covered
    Serbia
    Description

    Abstract

    The study was conducted in Serbia between October 2008 and February 2009 as part of the first round of The Management, Organization and Innovation Survey. Data from 135 manufacturing companies with 50 to 5,000 full-time employees was analyzed.

    The survey topics include detailed information about a company and its management practices - production performance indicators, production target, ways employees are promoted/dealt with when underperforming. The study also focuses on organizational matters, innovation, spending on research and development, production outsourcing to other countries, competition, and workforce composition.

    Analysis unit

    The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment is defined as a separate production unit, regardless of whether or not it has its own financial statements separate from those of the firm, and whether it has it own management and control over payroll. So the bottling plant of a brewery would be counted as an establishment.

    Universe

    The survey universe was defined as manufacturing establishments with at least fifty, but less than 5,000, full-time employees.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Random sampling was used in the study. For all MOI countries, except Russia, there was a requirement that all regions must be covered and that the percentage of the sample in each region was required to be equal to at least one half of the percentage of the sample frame population in each region.

    In most countries the sample frame used was an extract from the Orbis database of Bureau van Dijk, which was provided to the Consultant by the EBRD. The sample frame contained details of company names, location, company size (number of employees), company performance measures and contact details. The sample frame downloaded from Orbis was cleaned by the EBRD through the addition of regional variables, updating addresses and phone numbers of companies.

    Examination of the Orbis sample frames showed their geographic distributions to be wide with many locations, a large number of which had only a small number of records. Each establishment was selected with two substitutes that can be used if it proves impossible to conduct an interview at the first establishment. In practice selection was confined to locations with the most records in the sample frame, so the sample frame was filtered to just the cities with the most establishments.

    The quality of the frame was assessed at the onset of the project. The frame proved to be useful though it showed positive rates of non-eligibility, repetition, non-existent units, etc. These problems are typical of establishment surveys. For Serbia, the percentage of confirmed non-eligible units as a proportion of the total number of contacts to complete the survey was 26.7% (82 out of 307 establishments).

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Two different versions of the questionnaire were used. Questionnaire A was used when interviewing establishments that are part of multiestablishment firms, while Questionnaire B was used when interviewing single-establishment firms. Questionnaire A incorporates all questions from Questionnaire B, the only difference is in the reference point, which is the so-called national firm in the first part of Questionnaire A and firm in Questionnaire B. Second part of the questionnaire refers to the interviewed establishment only in both Questionnaire A and Questionnaire B. Each variation of the questionnaire is identified by the index variable, a0.

    Response rate

    Item non-response was addressed by two strategies: - For sensitive questions that may generate negative reactions from the respondent, such as ownership information, enumerators were instructed to collect the refusal to respond as (-8). - Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

    Survey non-response was addressed by maximising efforts to contact establishments that were initially selected for interviews. Up to 15 attempts (but at least 4 attempts) were made to contact an establishment for interview at different times/days of the week before a replacement establishment (with similar characteristics) was suggested for interview. Survey non-response did occur, but substitutions were made in order to potentially achieve the goals.

    Additional information about sampling, response rates and survey implementation can be found in "MOI Survey Report on Methodology and Observations 2009" in "Technical Documents" folder.

  10. Z

    Data from: The Software Heritage License Dataset (2022 Edition)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus M. Gonzalez-Barahona; Sergio Montes-Leon; Gregorio Robles; Stefano Zacchiroli (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France
    Universidad Rey Juan Carlos, Madrid, Spain
    Authors
    Jesus M. Gonzalez-Barahona; Sergio Montes-Leon; Gregorio Robles; Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

    In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

    Format

    The dataset is organized as follows:

    blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

    The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

    blobs/ is the root directory containing all license blobs

    8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

    $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

    $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

    86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

    One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

    swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

    blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

    license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

    where:

    SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

    SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

    NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

    blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

    SHA1: blob SHA1

    MIME_TYPE: blob MIME type, as detected by libmagic

    ENCODING: blob character encoding, as detected by libmagic

    LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

    WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

    SIZE: blob size in bytes

    blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

    SHA1: blob SHA1

    LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

    SCORE: confidence score in the result, as a decimal number between 0 and 100

    There may be zero or arbitrarily many lines for each blob.

    blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

    sha1: blob SHA1

    licenses: output of scancode.api.get_licenses(..., min_score=0)

    copyrights: output of scancode.api.get_copyrights(...)

    There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

    blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

    Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

    If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

    blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

    Two blobs are missing because the computation crashes:

    swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

    This issue will be fixed in a future version of the dataset

    blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

    SWHID: blob SWHID

    EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

    EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

    OCCURRENCES: number of known commits containing the blob

    replication-package.tar.gz: code and scripts used to produce the dataset

    licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

    Changes since the 2021-03-23 dataset

    More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

    Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

    Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

    blobs-nb-origins.csv.zst is added.

    blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

    blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

    blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

    blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

    blobs-scancode.ndjson.zst is added.

    Errata

    A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

    pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

    The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

    Citation

    If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

    [pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

    [pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

    References

    The dataset has been built using primarily the data sources described in the following papers:

    [pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

    [pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

    Errata (v2, 2024-01-09)

    licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4

  11. Z

    Sequence and Fitness Datasets for Variant Fitness Prediction using Protein...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuanfei Sun (2024). Sequence and Fitness Datasets for Variant Fitness Prediction using Protein Language Models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6784458
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    Texas A&M University
    Authors
    Yuanfei Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset bundle contains three sets: 1) domain sequences for pretraining, 2) domain sequences for finetuning and 3) variant fitness scores. Files are in lmdb format.

    1. Domain sequences for pretraining

    Two bz2 compressed files are provided:

    rp15_seq_lmdb.tar.bz2: representative proteome sequences at 15% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 12,681,738; number of sequences in validation set: 1,042,103. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.

    rp75_seq_lmdb.tar.bz2: representative proteome sequences at 75% level from Pfam-V32 database. Whole dataset is randomly split into train and validation sets: number of sequences in training set: 68,810,960; number of sequences in validation set: 5,687,282. Sequence length range from 18 to 500 (inclusive) and this length filtered set covers more than 95% sequences of the whole set.

    Information of each sequence is stored as key-value pairs:

    { 'primary': protein amino acid sequence, 'protein_length': length of the sequence, 'family': sequence Pfam family id (without 'PF'), 'clan': sequence Pfam clan id (without 'CL', -1 if not exists), 'unpIden': sequence Uniprot_id.version_number, 'range': domain residue start-end indices (follow indices of Uniprot seq), 'id': a index number for each sequence from 0 to N }

    One example:

    {'primary': 'ALQTTDKHHVATPANWRPGDDVIVPPPATQEAAEERLREG', 'protein_length': 40, 'family': 10417, 'clan': -1, 'unpIden': 'A0A147JSN0.1', 'range': '162-201', 'id': '0'}

    1. Domain sequences for finetuning

    We collected homologous sequences of 33 proteins from [Shin2021]. The sequences are domain sequences queried over UniRef100 database. Each family is split into train and validation sets with ratio 9:1

    Information of each sequence is stored as key-value pairs:

    { 'unp_range': Uniprot record name/start index - end index (indices follow Uniprot seq), 'primary': protein amino acid sequence, 'seq_reweight': sequence weighting score from Shin2021, 'family_reweight': family weighting score from Shin2021 (sum of seq_reweight score for all family sequences), 'seq_reweight_mmseqs2': sequence weighting score calculated by us using mmseqs2, 'family_reweight_mmseqs2': family weighting score based on seq_reweight_mmseqs2 (sum of seq_reweight_mmseqs2 score for all family sequences) }

    One example:

    { 'unp_range': 'AMIE_PSEAE/1-346', 'primary': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEGLEKEA', 'seq_reweight': 0.0714285714286, 'family_reweight': 19553.99941694187, 'seq_reweight_mmseqs2': 0.0021413276231263384, 'family_reweight_mmseqs2': 25236.560885598774 }

    1. Variant fitness scores

    This fitness benchmark set contains 42 mutagenesis sets, which were from originally curated by [DeepSequence] and later [Shin2021] used a subset of it.

    Information of each variant is stored as key-value pairs:

    { 'set_nm': set name, 'wt_seq': WT sequence, 'seq_len': sequence length, 'mutants': amino acid variants list (could have multi-site mutations), 'mut_relative_idxs': list of relative amino acid indices for variants, 'mut_seq': mutant sequence, 'fitness': fitness score }

    One example:

    { 'set_nm': 'AMIE_PSEAE_Whitehead', 'wt_seq': 'MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'seq_len': 341, 'mutants': ['M1W'], 'mut_relative_idxs': [0], 'mut_seq': 'WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMKQGLPGMDLVVFPEYSLQGIMYDPAEMMETAVAIPGEETEIFSRACRKANVWGVFSLTGERHEEHPRKAPYNTLVLIDNNGEIVQKYRKIIPWCPIEGWYPGGQTYVSEGPKGMKISLIICDDGNYPEIWRDCAMKGAELIVRCQGYMYPAKDQQVMMAKAMAWANNCYVAVANAAGFDGVYSYFGHSAIIGFDGRTLGECGEEEMGIQYAQLSLSQIRDARANDQSQNHLFKILHRGYSGLQASGDGDRGLAECPFEFYRTWVTDAEKARENVERLTRSTTGVAQCPVGRLPYEG', 'fitness': -0.5174 }

    Reference

    DeepSequence: Riesselman, Adam J., John B. Ingraham, and Debora S. Marks. "Deep generative models of genetic variation capture the effects of mutations." Nature methods 15.10 (2018): 816-822.

    Shin2021:Shin, Jung-Eun, et al. "Protein design and variant prediction using autoregressive generative models." Nature communications 12.1 (2021): 1-11.

  12. N

    Index, WA Age Group Population Dataset: A Complete Breakdown of Index Age...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Index, WA Age Group Population Dataset: A Complete Breakdown of Index Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/index-wa-population-by-age/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Index, Washington
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Index population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Index. The dataset can be utilized to understand the population distribution of Index by age. For example, using this dataset, we can identify the largest age group in Index.

    Key observations

    The largest age group in Index, WA was for the group of age 10 to 14 years years with a population of 24 (14.63%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Index, WA was the Under 5 years years with a population of 0 (0%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Index is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Index total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Index Population by Age. You can refer the same here

  13. n

    AirNow Air Quality Monitoring Data (Current) - Dataset - CKAN

    • nationaldataplatform.org
    Updated Feb 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). AirNow Air Quality Monitoring Data (Current) - Dataset - CKAN [Dataset]. https://nationaldataplatform.org/catalog/dataset/airnow-air-quality-monitoring-data-current
    Explore at:
    Dataset updated
    Feb 28, 2024
    Description

    This United States Environmental Protection Agency (US EPA) feature layer represents monitoring site data, updated hourly concentrations and Air Quality Index (AQI) values for the latest hour received from monitoring sites that report to AirNow.Map and forecast data are collected using federal reference or equivalent monitoring techniques or techniques approved by the state, local or tribal monitoring agencies. To maintain "real-time" maps, the data are displayed after the end of each hour. Although preliminary data quality assessments are performed, the data in AirNow are not fully verified and validated through the quality assurance procedures monitoring organizations used to officially submit and certify data on the EPA Air Quality System (AQS).This data sharing, and centralization creates a one-stop source for real-time and forecast air quality data. The benefits include quality control, national reporting consistency, access to automated mapping methods, and data distribution to the public and other data systems. The U.S. Environmental Protection Agency, National Oceanic and Atmospheric Administration, National Park Service, tribal, state, and local agencies developed the AirNow system to provide the public with easy access to national air quality information. State and local agencies report the Air Quality Index (AQI) for cities across the US and parts of Canada and Mexico. AirNow data are used only to report the AQI, not to formulate or support regulation, guidance or any other EPA decision or position.About the AQIThe Air Quality Index (AQI) is an index for reporting daily air quality. It tells you how clean or polluted your air is, and what associated health effects might be a concern for you. The AQI focuses on health effects you may experience within a few hours or days after breathing polluted air. EPA calculates the AQI for five major air pollutants regulated by the Clean Air Act: ground-level ozone, particle pollution (also known as particulate matter), carbon monoxide, sulfur dioxide, and nitrogen dioxide. For each of these pollutants, EPA has established national air quality standards to protect public health. Ground-level ozone and airborne particles (often referred to as "particulate matter") are the two pollutants that pose the greatest threat to human health in this country.A number of factors influence ozone formation, including emissions from cars, trucks, buses, power plants, and industries, along with weather conditions. Weather is especially favorable for ozone formation when it’s hot, dry and sunny, and winds are calm and light. Federal and state regulations, including regulations for power plants, vehicles and fuels, are helping reduce ozone pollution nationwide.Fine particle pollution (or "particulate matter") can be emitted directly from cars, trucks, buses, power plants and industries, along with wildfires and woodstoves. But it also forms from chemical reactions of other pollutants in the air. Particle pollution can be high at different times of year, depending on where you live. In some areas, for example, colder winters can lead to increased particle pollution emissions from woodstove use, and stagnant weather conditions with calm and light winds can trap PM2.5 pollution near emission sources. Federal and state rules are helping reduce fine particle pollution, including clean diesel rules for vehicles and fuels, and rules to reduce pollution from power plants, industries, locomotives, and marine vessels, among others.How Does the AQI Work?Think of the AQI as a yardstick that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 represents good air quality with little potential to affect public health, while an AQI value over 300 represents hazardous air quality.An AQI value of 100 generally corresponds to the national air quality standard for the pollutant, which is the level EPA has set to protect public health. AQI values below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy-at first for certain sensitive groups of people, then for everyone as AQI values get higher.Understanding the AQIThe purpose of the AQI is to help you understand what local air quality means to your health. To make it easier to understand, the AQI is divided into six categories:Air Quality Index(AQI) ValuesLevels of Health ConcernColorsWhen the AQI is in this range:..air quality conditions are:...as symbolized by this color:0 to 50GoodGreen51 to 100ModerateYellow101 to 150Unhealthy for Sensitive GroupsOrange151 to 200UnhealthyRed201 to 300Very UnhealthyPurple301 to 500HazardousMaroonNote: Values above 500 are considered Beyond the AQI. Follow recommendations for the Hazardous category. Additional information on reducing exposure to extremely high levels of particle pollution is available here.Each category corresponds to a different level of health concern. The six levels of health concern and what they mean are:"Good" AQI is 0 to 50. Air quality is considered satisfactory, and air pollution poses little or no risk."Moderate" AQI is 51 to 100. Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people. For example, people who are unusually sensitive to ozone may experience respiratory symptoms."Unhealthy for Sensitive Groups" AQI is 101 to 150. Although general public is not likely to be affected at this AQI range, people with lung disease, older adults and children are at a greater risk from exposure to ozone, whereas persons with heart and lung disease, older adults and children are at greater risk from the presence of particles in the air."Unhealthy" AQI is 151 to 200. Everyone may begin to experience some adverse health effects, and members of the sensitive groups may experience more serious effects."Very Unhealthy" AQI is 201 to 300. This would trigger a health alert signifying that everyone may experience more serious health effects."Hazardous" AQI greater than 300. This would trigger a health warnings of emergency conditions. The entire population is more likely to be affected.AQI colorsEPA has assigned a specific color to each AQI category to make it easier for people to understand quickly whether air pollution is reaching unhealthy levels in their communities. For example, the color orange means that conditions are "unhealthy for sensitive groups," while red means that conditions may be "unhealthy for everyone," and so on.Air Quality Index Levels of Health ConcernNumericalValueMeaningGood0 to 50Air quality is considered satisfactory, and air pollution poses little or no risk.Moderate51 to 100Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people who are unusually sensitive to air pollution.Unhealthy for Sensitive Groups101 to 150Members of sensitive groups may experience health effects. The general public is not likely to be affected.Unhealthy151 to 200Everyone may begin to experience health effects; members of sensitive groups may experience more serious health effects.Very Unhealthy201 to 300Health alert: everyone may experience more serious health effects.Hazardous301 to 500Health warnings of emergency conditions. The entire population is more likely to be affected.Note: Values above 500 are considered Beyond the AQI. Follow recommendations for the "Hazardous category." Additional information on reducing exposure to extremely high levels of particle pollution is available here.

  14. Data from: Evictions

    • data.cityofnewyork.us
    • nycopendata.socrata.com
    • +5more
    csv, xlsx, xml
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Investigation (DOI) (2025). Evictions [Dataset]. https://data.cityofnewyork.us/City-Government/Evictions/6z8x-wfk4
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    New York City Department of Investigationhttp://www.nyc.gov/doi
    Authors
    Department of Investigation (DOI)
    Description

    This dataset lists executed evictions within the five boroughs for the years 2017-Present (data prior to January 1, 2017, is not available). The data fields may be sorted by 20 categories of information including Court Index Number, Docket Number, Eviction Address, Marshal First or Last Name, Borough, etc..

    Eviction data is compiled from New York City Marshals. City Marshals are independent public officials appointed by the Mayor. Marshals can be contacted directly regarding evictions, and their contact information can be found at https://www1.nyc.gov/site/doi/offices/marshals-list.page.

  15. d

    DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano

    • catalog.data.gov
    • gdr.openei.org
    • +2more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano [Dataset]. https://catalog.data.gov/dataset/deepen-3d-pfa-index-models-for-exploration-datasets-at-newberry-volcano-327cd
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Area covered
    Newberry Volcano
    Description

    DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), index models needed to be developed to map values in geoscientific exploration datasets to favorability index values. This GDR submission includes those index models. Index models were created by binning values in exploration datasets into chunks based on their favorability, and then applying a number between 0 and 5 to each chunk, where 0 represents very unfavorable data values and 5 represents very favorable data values. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type. Index models were created using histograms of the distributions of each exploration dataset in combination with literature and input from experts about what combinations of geophysical, geological, and geochemical signatures are considered favorable at Newberry. This is in attempt to create similar sized bins based on the current understanding of how different anomalies map to favorable areas for the different types of geothermal plays (i.e., conventional hydrothermal, superhot EGS, and supercritical). For example, an area of partial melt would likely appear as an area of low density, high conductivity, low vp, and high vp/vs. This means that these target anomalies would be given high (4 or 5) index values for the purpose of imaging the heat source. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type. Index models were produced for the following datasets: - Geologic model - Alteration model - vp/vs - vp - vs - Temperature model - Seismicity (density*magnitude) - Density - Resistivity - Fault distance - Earthquake cutoff depth model

  16. i

    Household Health Survey 2012-2013, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    • datacatalog.ihsn.org
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistical Organization (CSO) (2017). Household Health Survey 2012-2013, Economic Research Forum (ERF) Harmonization Data - Iraq [Dataset]. https://catalog.ihsn.org/index.php/catalog/6937
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    Kurdistan Regional Statistics Office (KRSO)
    Central Statistical Organization (CSO)
    Economic Research Forum
    Time period covered
    2012 - 2013
    Area covered
    Iraq
    Description

    Abstract

    The harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.

    ----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:

    Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

    The survey has six main objectives. These objectives are:

    1. Provide data for poverty analysis and measurement and monitor, evaluate and update the implementation Poverty Reduction National Strategy issued in 2009.
    2. Provide comprehensive data system to assess household social and economic conditions and prepare the indicators related to the human development.
    3. Provide data that meet the needs and requirements of national accounts.
    4. Provide detailed indicators on consumption expenditure that serve making decision related to production, consumption, export and import.
    5. Provide detailed indicators on the sources of households and individuals income.
    6. Provide data necessary for formulation of a new consumer price index number.

    The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.

    Geographic coverage

    National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.

    Analysis unit

    1- Household/family. 2- Individual/person.

    Universe

    The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    ----> Design:

    Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.

    ----> Sample frame:

    Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.

    ----> Sampling Stages:

    In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    ----> Preparation:

    The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.

    ----> Questionnaire Parts:

    The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job

    Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.

    Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days

    Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.

    Cleaning operations

    ----> Raw Data:

    Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.

    ----> Harmonized Data:

    • The SPSS package is used to harmonize the Iraq Household Socio Economic Survey (IHSES) 2007 with Iraq Household Socio Economic Survey (IHSES) 2012.
    • The harmonization process starts with raw data files received from the Statistical Office.
    • A program is generated for each dataset to create harmonized variables.
    • Data is saved on the household and individual level, in SPSS and then converted to STATA, to be disseminated.

    Response rate

    Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).

  17. d

    Pavement Condition Index

    • catalog.data.gov
    • data.montgomerycountymd.gov
    • +1more
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.montgomerycountymd.gov (2025). Pavement Condition Index [Dataset]. https://catalog.data.gov/dataset/pavement-condition-index-2019
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    data.montgomerycountymd.gov
    Description

    Pavement Condition Index is provided by the Department of Transportation for analyzing the conditions of the pavement for all 5,200 lane miles of roadways within the County. Pavement Condition Index or PCI is a numerical expression between 0-100 numerically representing the pavements condition. For example, a PCI of 30 is considered as “poor” whereas a PCI rating of 80 indicates pavement in very good condition. The pavements’ numerical PCI score is developed through an analysis of nineteen (19) discrete pavement distresses (cracking, pot holes, environmental distress, utility cuts, etc.) and ranges from 1-100 with 1 being an absolute failure and 100 representing perfect conditions. This dataset is updated biennially.

  18. House Price Index; existing own homes; 2010=100 1995-2017

    • data.overheid.nl
    • cbs.nl
    • +2more
    atom, json
    Updated Jan 22, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centraal Bureau voor de Statistiek (Rijk) (2018). House Price Index; existing own homes; 2010=100 1995-2017 [Dataset]. https://data.overheid.nl/dataset/4491-house-price-index--existing-own-homes--2010-100-----1995-2017
    Explore at:
    atom(KB), json(KB)Available download formats
    Dataset updated
    Jan 22, 2018
    Dataset provided by
    Centraal Bureau voor de Statistiek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The figures of existing own homes are related to the stock of existing own homes. Besides the price indices, figures are also published about the numbers sold, the average purchase price, and the total sum of the purchase prices of the sold dwellings. The House Price Index of existing own homes is based on a complete registration of sales of dwellings by the Dutch Land Registry Office (Kadaster) and the (WOZ) value of all dwellings in the Netherlands. Indices can fluctuate, for example when a limited number of dwellings of a certain type is sold. In such cases we recommended using the long-term figures. The average purchase price of existing own homes may differ from the price index of existing own homes. The change in the average purchase price, however, is not an indicator for price developments of existing own homes.

    Data available from: January 1995 - 2017

    Status of the figures: The figures are definitive.

    Changes as of 21 February 2014: Price information for 2008 onwards has been revised because of an improvement in the weighting scheme. The weighting scheme is based on the stock of existing own homes instead of the stock of all existing homes. The effect of the revision is very small.

    Changes as of 21 February 2018: None, this table has been discontinued. This table is followed by the table House Price Index; existing own homes 2015 = 100. See paragraph 3

    When will new figures be published? Does not apply.

  19. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  20. p

    Nepal Number Dataset

    • listtodata.com
    • st.listtodata.com
    .csv, .xls, .txt
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    List to Data (2025). Nepal Number Dataset [Dataset]. https://listtodata.com/nepal-dataset
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jul 17, 2025
    Authors
    List to Data
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2025 - Dec 31, 2025
    Area covered
    Nepal
    Variables measured
    phone numbers, Email Address, full name, Address, City, State, gender,age,income,ip address,
    Description

    Nepal Number Dataset is an index of Nepal contact numbers that are 100% accurate and valid. We always double-check to make sure that this record is correct. So, when you use this number library, you can trust that Nepal contact numbers work. And if you ever get an incorrect number, you get a replacement guarantee. This means that if a phone number doesn’t work, they will give you a new number at no extra cost. Moreover, the Nepal Number Dataset is very reliable. Use it with confidence, knowing that you are following the right steps for a smooth, successful outreach effort. The people on the list have agreed to share their mobile numbers. So, you are not breaking any rules when you use this database. And, getting the customer’s consent makes contacting them more welcoming and effective. Nepal phone data is detailed information about Nepal contact numbers. Trusted sources collect this phone data to ensure its reliability. The sources from which this library comes may include websites, government records, and phone service providers. We verify each source, and you can check the URLs where we got the data. This ensures that the mobile data is accurate and reliable. Also, Nepal phone data providers offer 24/7 support. Also, Nepal phone data follows an opt-in policy. This means that people can share their numbers. This is good because it ensures that people know they are using their information. You won’t get in trouble for using contact details without permission. List to Data helps you to find Nepal contact data for your business. Nepal phone number list is a collection of phone numbers of people living in Nepal. You can sort these contact numbers by gender, age, and relationship status. This means that you can only see the amount that matches your needs. For example, if you want to contact young and single people, you can do so. Also, this contact list follows GDPR rules. Also, the Nepal phone number list helps you remove invalid data. Sometimes, contact numbers may change or stop working. This list checks this and removes those numbers, so you don’t waste time calling people who don’t answer. Using the Nepal phone number list, you reach the right people. Therefore, you get accurate, current information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jarno Matarmaa (2023). Sport Activity Dataset - MTS-5 [Dataset]. https://www.kaggle.com/datasets/jarnomatarmaa/sportdata-mts-5
Organization logo

Sport Activity Dataset - MTS-5

Multivariate Time Series (MTS) outdoor sport activity dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(498699 bytes)Available download formats
Dataset updated
Jul 13, 2023
Authors
Jarno Matarmaa
License

https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en

Description

Description

Dataset consists of data in categories walking, running, biking, skiing, and roller skiing (5). Sport activities have been recorded by an individual active (non-competitive) athlete. Data is pre-processed, standardized and splitted in four parts (each dimension in its own file): * HR-DATA_std_1140x69 (heart rate signals) * SPD-DATA_std_1140x69 (speed signals) * ALT-DATA_std_1140x69 (altitude signals) * META-DATA_1140x4 (labels and details)

NOTE: Signal order between the separate files must not be confused when processing the data. Signal order is critical; first index in each of the file comes from the same activity which label corresponds to first index in the target data file, and so on. So, data should be constructed and files combined into the same table while reading the files, ideally using nested data structure. Something like in the picture below:

You may check the related TSC projects in GitHub: - "https://github.com/JABE22/MasterProject">Sport Activity Classification Using Classical Machine Learning and Time Series Methods - Symbolic Representation of Multivariate Time Series Signals in Sport Activity Classification - Kaggle Project

https://mediauploads.data.world/e1ccd4d36522e04c0061d12d05a87407bec80716f6fe7301991eaaccd577baa8_mts_data.png" alt="Nested data structure for multivariate time series classifiers">

In the following picture one can see five signal samples for each dimension (Heart Rate, Speed, Altitude) in standard feature value format. So, each figure contains signal from five different random activities (can be same or different category). However, for example, signal indexes number 1 in each three figure are from the same activity. Figures just visualizes what kind of signals dataset consists. They do not have any particular meaning.

https://mediauploads.data.world/162b7086448d8dbd202d282014bcf12bd95bd3174b41c770aa1044bab22ad655_signal_samples.png" alt="Signals from sport activities (Heart Rate, Speed, and Altitude)">

Dataset size and construction procedure

The original amount of sport activities is 228. From each of them, starting from the index 100 (seconds), have been picked 5 x 69 second consecutive segments, that is expressed as a formula below:

https://mediauploads.data.world/68ce83092ec65f6fbaee90e5de6e12df40498e08fa6725c111f1205835c1a842_segment_equation.png" alt="Data segmentation and augmentation formula">

where 𝐷 = 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑓𝑖𝑙𝑡𝑒𝑟𝑒𝑑 𝑑𝑎𝑡𝑎 ,𝑁 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑖𝑒𝑠 , 𝑠 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑠𝑡𝑎𝑟𝑡 𝑖𝑛𝑑𝑒𝑥 , 𝑙 = 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 𝑙𝑒𝑛𝑔𝑡ℎ, and 𝑛 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 from a single original sequence 𝐷𝑖 , resulting the new set of equal length segments 𝐷𝑠𝑒𝑔. And in this certain case the equation takes the form of:

https://mediauploads.data.world/63dd87bf3d0010923ad05a8286224526e241b17bbbce790133030d8e73f3d3a7_data_segmentation_formula.png" alt="Data segmentation and augmentation formula with values">

Thus, dataset has dimesions of 1140 x 69 x 3.

Additional information

Data has been recorded without knowing it will be used in research, therefore it represents well real-world application of data source and can provide excellent tool to test algorithms in real data.

Recording devices

Data has been recorded using two type of Garmin devices. Models are Forerunner 920XT and vivosport. Vivosport is activity tracker and measures heart rate from the wrist using optical sensor, whereas 920XT requires external sensor belt (hear rate + inertial) installed under chest when doing exercises. Otherwise devices are not essentially different, they uses GPS location to measure speed and inertial barometer to measure elevation changes.

Device manuals - Garmin FR-920XT - Garmin Vivosport

Person profile

Age: 30-31, Weight: 82, Length: 181, Active athlete (non-competitive)

Search
Clear search
Close search
Google apps
Main menu