45 datasets found
  1. i

    Copernicus

    • sextant.ifremer.fr
    • pigma.org
    www:link +1
    Updated Sep 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ERA5 monthly averaged data on single levels from 1979 to present (2021). Copernicus [Dataset]. https://sextant.ifremer.fr/geonetwork/srv/api/records/ff2cd349-ecab-48e1-817a-1ed87dc0c4be
    Explore at:
    www:link-1.0-http--publication-url, www:linkAvailable download formats
    Dataset updated
    Sep 6, 2021
    Dataset provided by
    ERA5 monthly averaged data on single levels from 1979 to present
    Area covered
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 4 to 7 decades. Currently data is available from 1950, split into Climate Data Store entries for 1950-1978 (preliminary back extension) and from 1979 onwards (final release plus timely updates, this page). ERA5 replaces the ERA-Interim reanalysis.

    Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

    ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread.

    ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has not been the case and when this does occur users will be notified.

    The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications.

    An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines.

    Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities).

    The present entry is "ERA5 monthly mean data on single levels from 1979 to present".

  2. e

    Primary data/laser scan data of the Lasercsanner measurement recording...

    • data.europa.eu
    compressed las file
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Primary data/laser scan data of the Lasercsanner measurement recording including classification — Free State of Saxony [Dataset]. https://data.europa.eu/88u/dataset/99815cb4-f6de-4bef-ac7f-e8e005e3f1d9~~1
    Explore at:
    compressed las fileAvailable download formats
    Dataset updated
    Jun 16, 2024
    License

    Data licence Germany – Attribution – Version 2.0https://www.govdata.de/dl-de/by-2-0
    License information was derived automatically

    Description

    In order to create terrain and surface models, the earth’s surface is detected by means of laser scanner measurement. As a result, there is an irregular point cloud (so-called primary data or laser scan data). The laser scan data are georeferenced and are distinguished by first-echo points, load echo points and only echo points. The laser scan data is classified into ground points and non-ground points. Gaps are filled by interpolated points (so-called supplementary points). The laser scan data is stored in the data format LAZ. The additional points are also stored in the files.

  3. Z

    Wind WAVES TDSF Dataset

    • data.niaid.nih.gov
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wilson III, Lynn B (2024). Wind WAVES TDSF Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3911204
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    NASA Goddard Space Flight Center
    Authors
    Wilson III, Lynn B
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wind Spacecraft:

    The Wind spacecraft (https://wind.nasa.gov) was launched on November 1, 1994 and currently orbits the first Lagrange point between the Earth and sun. A comprehensive review can be found in Wilson et al. [2021]. It holds a suite of instruments from gamma ray detectors to quasi-static magnetic field instruments, Bo. The instruments used for this data product are the fluxgate magnetometer (MFI) [Lepping et al., 1995] and the radio receivers (WAVES) [Bougeret et al., 1995]. The MFI measures 3-vector Bo at ~11 samples per second (sps); WAVES observes electromagnetic radiation from ~4 kHz to >12 MHz which provides an observation of the upper hybrid line (also called the plasma line) used to define the total electron density and also takes time series snapshot/waveform captures of electric and magnetic field fluctuations, called TDS bursts herein.

    WAVES Instrument:

    The WAVES experiment [Bougeret et al., 1995] on the Wind spacecraft is composed of three orthogonal electric field antenna and three orthogonal search coil magnetometers. The electric fields are measured through five different receivers: Low Frequency FFT receiver called FFT (0.3 Hz to 11 kHz), Thermal Noise Receiver called TNR (4-256 kHz), Radio receiver band 1 called RAD1 (20-1040 kHz), Radio receiver band 2 called RAD2 (1.075-13.825 MHz), and the Time Domain Sampler (TDS). The electric field antenna are dipole antennas with two orthogonal antennas in the spin plane and one spin axis stacer antenna.

    The TDS receiver allows one to examine the electromagnetic waves observed by Wind as time series waveform captures. There are two modes of operation, TDS Fast (TDSF) and TDS Slow (TDSS). TDSF returns 2048 data points for two channels of the electric field, typically Ex and Ey (i.e. spin plane components), with little to no gain below ~120 Hz (the data herein has been high pass filtered above ~150 Hz for this reason). TDSS returns four channels with three electric(magnetic) field components and one magnetic(electric) component. The search coils show a gain roll off ~3.3 Hz [e.g., see Wilson et al., 2010; Wilson et al., 2012; Wilson et al., 2013 and references therein for more details].

    The original calibration of the electric field antenna found that the effective antenna lengths are roughly 41.1 m, 3.79 m, and 2.17 m for the X, Y, and Z antenna, respectively. The +Ex antenna was broken twice during the mission as of June 26, 2020. The first break occurred on August 3, 2000 around ~21:00 UTC and the second on September 24, 2002 around ~23:00 UTC. These breaks reduced the effective antenna length of Ex from ~41 m to 27 m after the first break and ~25 m after the second break [e.g., see Malaspina et al., 2014; Malaspina & Wilson, 2016].

    TDS Bursts:

    TDS bursts are waveform captures/snapshots of electric and magnetic field data. The data is triggered by the largest amplitude waves which exceed a specific threshold and are then stored in a memory buffer. The bursts are ranked according to a quality filter which mostly depends upon amplitude. Due to the age of the spacecraft and ubiquity of large amplitude electromagnetic and electrostatic waves, the memory buffer often fills up before dumping onto the magnetic tape drive. If the memory buffer is full, then the bottom ranked TDS burst is erased every time a new TDS burst is sampled. That is, the newest TDS burst sampled by the instrument is always stored and if it ranks higher than any other in the list, it will be kept. This results in the bottom ranked burst always being erased. Earlier in the mission, there were also so called honesty bursts, which were taken periodically to test whether the triggers were working properly. It was found that the TDSF triggered properly, but not the TDSS. So the TDSS was set to trigger off of the Ex signals.

    A TDS burst from the Wind/WAVES instrument is always 2048 time steps for each channel. The sample rate for TDSF bursts ranges from 1875 samples/second (sps) to 120,000 sps. Every TDS burst is marked a unique set of numbers (unique on any given date) to help distinguish it from others and to ensure any set of channels are appropriately connected to each other. For instance, during one spacecraft downlink interval there may be 95% of the TDS bursts with a complete set of channels (i.e., TDSF has two channels, TDSS has four) while the remaining 5% can be missing channels (just example numbers, not quantitatively accurate). During another downlink interval, those missing channels may be returned if they are not overwritten. During every downlink, the flight operations team at NASA Goddard Space Fligth Center (GSFC) generate level zero binary files from the raw telemetry data. Those files are filled with data received on that date and the file name is labeled with that date. There is no attempt to sort chronologically the data within so any given level zero file can have data from multiple dates within. Thus, it is often necessary to load upwards of five days of level zero files to find as many full channel sets as possible. The remaining unmatched channel sets comprise a much smaller fraction of the total.

    All data provided here are from TDSF, so only two channels. Most of the time channel 1 will be associated with the Ex antenna and channel 2 with the Ey antenna. The data are provided in the spinning instrument coordinate basis with associated angles necessary to rotate into a physically meaningful basis (e.g., GSE).

    TDS Time Stamps:

    Each TDS burst is tagged with a time stamp called a spacecraft event time or SCET. The TDS datation time is sampled after the burst is acquired which requires a delay buffer. The datation time requires two corrections. The first correction arises from tagging the TDS datation with an associated spacecraft major frame in house keeping (HK) data. The second correction removes the delay buffer duration. Both inaccuracies are essentially artifacts of on ground derived values in the archives created by the WINDlib software (K. Goetz, Personal Communication, 2008) found at https://github.com/lynnbwilsoniii/Wind_Decom_Code.

    The WAVES instrument's HK mode sends relevant low rate science back to ground once every spacecraft major frame. If multiple TDS bursts occur in the same major frame, it is possible for the WINDlib software to assign them the same SCETs. The reason being that this top-level SCET is only accurate to within +300 ms (in 120,000 sps mode) due to the issues described above (at lower sample rates, the error can be slightly larger). The time stamp uncertainty is a positive definite value because it results from digitization rounding errors. One can correct these issues to within +10 ms if using the proper HK data.

    *** The data stored here have not corrected the SCETs! ***

    The 300 ms uncertainty, due to the HK corrections mentioned above, results from WINDlib trying to recreate the time stamp after it has been telemetered back to ground. If a burst stays in the TDS buffer for extended periods of time (i.e., >2 days), the interpolation done by WINDlib can make mistakes in the 11th significant digit. The positive definite nature of this uncertainty is due to rounding errors associated with the onboard DPU (digital processing unit) clock rollover. The DPU clock is a 24 bit integer clock sampling at ∼50,018.8 Hz. The clock rolls over at ∼5366.691244092221 seconds, i.e., (16*224)/50,018.8. The sample rate is a temperature sensitive issue and thus subject to change over time. From a sample of 384 different points on 14 different days, a statistical estimate of the rollover time is 5366.691124061162 ± 0.000478370049 seconds (calculated by Lynn B. Wilson III, 2008). Note that the WAVES instrument team used UR8 times, which are the number of 86,400 second days from 1982-01-01/00:00:00.000 UTC.

    The method to correct the SCETs to within +10 ms, were one to do so, is given as follows:

    Retrieve the DPU clock times, SCETs, UR8 times, and DPU Major Frame Numbers from the WINDlib libraries on the VAX/ALPHA systems for the TDSS(F) data of interest.

    Retrieve the same quantities from the HK data.

    Match the HK event number with the same DPU Major Frame Number as the TDSS(F) burst of interest.

    Find the difference in DPU clock times between the TDSS(F) burst of interest and the HK event with matching major frame number (Note: The TDSS(F) DPU clock time will always be greater than the HK DPU clock if they are the same DPU Major Frame Number and the DPU clock has not rolled over).

    Convert the difference to a UR8 time and add this to the HK UR8 time. The new UR8 time is the corrected UR8 time to within +10 ms.

    Find the difference between the new UR8 time and the UR8 time WINDlib associates with the TDSS(F) burst. Add the difference to the DPU clock time assigned by WINDlib to get the corrected DPU clock time (Note: watch for the DPU clock rollover).

    Convert the new UR8 time to a SCET using either the IDL WINDlib libraries or TMLib (STEREO S/WAVES software) libraries of available functions. This new SCET is accurate to within +10 ms.

    One can find a UR8 to UTC conversion routine at https://github.com/lynnbwilsoniii/wind_3dp_pros in the ~/LYNN_PRO/Wind_WAVES_routines/ folder.

    Examples of good waveforms can be found in the notes PDF at https://wind.nasa.gov/docs/wind_waves.pdf.

    Data Set Description

    Each Zip file contains 300+ IDL save files; one for each day of the year with available data. This data set is not complete as the software used to retrieve and calibrate these TDS bursts did not have sufficient error handling to handle some of the more nuanced bit errors or major frame errors in some of the level zero files. There is currently (as of June 27, 2020) an effort (by Keith Goetz et al.) to generate the entire TDSF and TDSS data set in one repository to be put on SPDF/CDAWeb as CDF files. Once that data set is available, it will supercede

  4. ERA5 monthly averaged data on single levels from 1940 to present

    • cds.climate.copernicus.eu
    grib
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 monthly averaged data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.f17050d7
    Explore at:
    gribAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on single levels from 1940 to present".

  5. Mortality Statistics in US Cities

    • kaggle.com
    zip
    Updated Jan 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mortality Statistics in US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/mortality-statistics-in-us-cities
    Explore at:
    zip(96624 bytes)Available download formats
    Dataset updated
    Jan 23, 2023
    Authors
    The Devastator
    Area covered
    United States
    Description

    Mortality Statistics in US Cities

    Deaths by Age and Cause of Death in 2016

    By Health [source]

    About this dataset

    This dataset contains mortality statistics for 122 U.S. cities in 2016, providing detailed information about all deaths that occurred due to any cause, including pneumonia and influenza. The data is voluntarily reported from cities with populations of 100,000 or more, and it includes the place of death and the week during which the death certificate was filed. Data is provided broken down by age group and includes a flag indicating the reliability of each data set to help inform analysis. Each row also provides longitude and latitude information for each reporting area in order to make further analysis easier. These comprehensive mortality statistics are invaluable resources for tracking disease trends as well as making comparisons between different areas across the country in order to identify public health risks quickly and effectively

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains mortality rates for 122 U.S. cities in 2016, including deaths by age group and cause of death. The data can be used to study various trends in mortality and contribute to the understanding of how different diseases impact different age groups across the country.

    In order to use the data, firstly one has to identify which variables they would like to use from this dataset. These include: reporting area; MMWR week; All causes by age greater than 65 years; All causes by age 45-64 years; All causes by age 25-44 years; All causes by age 1-24 years; All causes less than 1 year old; Pneumonia and Influenza total fatalities; Location (1 & 2); flag indicating reliability of data.

    Once you have identified the variables that you are interested in,you will need to filter the dataset so that it only includes relevant information for your analysis or research purposes. For example, if you are looking at trends between different ages, then all you would need is information on those 3 specific cause groups (greater than 65, 45-64 and 25-44). You can do this using a selection tool that allows you to pick only certain columns from your data set or an excel filter tool if your data is stored as a csv file type .

    Next step is preparing your data - it’s important for efficient analysis also helpful when there are too many variables/columns which can confuse our analysis process – eliminate unnecessary columns, rename column labels where needed etc ... In addition , make sure we clean up any missing values / outliers / incorrect entries before further investigation .Remember , outliers or corrupt entries may lead us into incorrect conclusions upon analyzing our set ! Once we complete the cleaning steps , now its safe enough transit into drawing insights !

    The last step involves using statistical methods such as linear regression with multiple predictors or descriptive statistical measures such as mean/median etc ..to draw key insights based on analysis done so far and generate some actionable points !

    With these steps taken care off , now its easier for anyone who decides dive into another project involving this particular dataset with added advantage formulated out of existing work done over our previous investigations!

    Research Ideas

    • Creating population health profiles for cities in the U.S.
    • Tracking public health trends across different age groups
    • Analyzing correlations between mortality and geographical locations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: rows.csv | Column name | Description | |:--------------------------------------------|:-----------------------------------...

  6. ERA5 monthly averaged data on pressure levels from 1940 to present

    • cds.climate.copernicus.eu
    grib
    Updated Nov 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 monthly averaged data on pressure levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.6860a573
    Explore at:
    gribAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".

  7. U

    Inflation Data

    • dataverse.unc.edu
    • dataverse-staging.rdmc.unc.edu
    Updated Oct 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UNC Dataverse (2022). Inflation Data [Dataset]. http://doi.org/10.15139/S3/QA4MPU
    Explore at:
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    UNC Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is not going to be an article or Op-Ed about Michael Jordan. Since 2009 we've been in the longest bull-market in history, that's 11 years and counting. However a few metrics like the stock market P/E, the call to put ratio and of course the Shiller P/E suggest a great crash is coming in-between the levels of 1929 and the dot.com bubble. Mean reversion historically is inevitable and the Fed's printing money experiment could end in disaster for the stock market in late 2021 or 2022. You can read Jeremy Grantham's Last Dance article here. You are likely well aware of Michael Burry's predicament as well. It's easier for you just to skim through two related videos on this topic of a stock market crash. Michael Burry's Warning see this YouTube. Jeremy Grantham's Warning See this YouTube. Typically when there is a major event in the world, there is a crash and then a bear market and a recovery that takes many many months. In March, 2020 that's not what we saw since the Fed did some astonishing things that means a liquidity sloth and the risk of a major inflation event. The pandemic represented the quickest decline of at least 30% in the history of the benchmark S&P 500, but the recovery was not correlated to anything but Fed intervention. Since the pandemic clearly isn't disappearing and many sectors such as travel, business travel, tourism and supply chain disruptions appear significantly disrupted - the so-called economic recovery isn't so great. And there's this little problem at the heart of global capitalism today, the stock market just keeps going up. Crashes and corrections typically occur frequently in a normal market. But the Fed liquidity and irresponsible printing of money is creating a scenario where normal behavior isn't occurring on the markets. According to data provided by market analytics firm Yardeni Research, the benchmark index has undergone 38 declines of at least 10% since the beginning of 1950. Since March, 2020 we've barely seen a down month. September, 2020 was flat-ish. The S&P 500 has more than doubled since those lows. Look at the angle of the curve: The S&P 500 was 735 at the low in 2009, so in this bull market alone it has gone up 6x in valuation. That's not a normal cycle and it could mean we are due for an epic correction. I have to agree with the analysts who claim that the long, long bull market since 2009 has finally matured into a fully-fledged epic bubble. There is a complacency, buy-the dip frenzy and general meme environment to what BigTech can do in such an environment. The weight of Apple, Amazon, Alphabet, Microsoft, Facebook, Nvidia and Tesla together in the S&P and Nasdaq is approach a ridiculous weighting. When these stocks are seen both as growth, value and companies with unbeatable moats the entire dynamics of the stock market begin to break down. Check out FANG during the pandemic. BigTech is Seen as Bullet-Proof me valuations and a hysterical speculative behavior leads to even higher highs, even as 2020 offered many younger people an on-ramp into investing for the first time. Some analysts at JP Morgan are even saying that until retail investors stop charging into stocks, markets probably don’t have too much to worry about. Hedge funds with payment for order flows can predict exactly how these retail investors are behaving and monetize them. PFOF might even have to be banned by the SEC. The risk-on market theoretically just keeps going up until the Fed raises interest rates, which could be in 2023! For some context, we're more than 1.4 years removed from the bear-market bottom of the coronavirus crash and haven't had even a 5% correction in nine months. This is the most over-priced the market has likely ever been. At the night of the dot-com bubble the S&P 500 was only 1,400. Today it is 4,500, not so many years after. Clearly something is not quite right if you look at history and the P/E ratios. A market pumped with liquidity produces higher earnings with historically low interest rates, it's an environment where dangerous things can occur. In late 1997, as the S&P 500 passed its previous 1929 peak of 21x earnings, that seemed like a lot, but nothing compared to today. For some context, the S&P 500 Shiller P/E closed last week at 38.58, which is nearly a two-decade high. It's also well over double the average Shiller P/E of 16.84, dating back 151 years. So the stock market is likely around 2x over-valued. Try to think rationally about what this means for valuations today and your favorite stock prices, what should they be in historical terms? The S&P 500 is up 31% in the past year. It will likely hit 5,000 before a correction given the amount of added liquidity to the system and the QE the Fed is using that's like a huge abuse of MMT, or Modern Monetary Theory. This has also lent to bubbles in the housing market, crypto and even commodities like Gold with long-term global GDP meeting many headwinds in the years ahead due to a demographic shift of an ageing population and significant technological automation. So if you think that stocks or equities or ETFs are the best place to put your money in 2022, you might want to think again. The crash of the OTC and small-cap market since February 2021 has been quite an indication of what a correction looks like. According to the Motley Fool what happens after major downturns in the market historically speaking? In each of the previous four instances that the S&P 500's Shiller P/E shot above and sustained 30, the index lost anywhere from 20% to 89% of its value. So what's what we too are due for, reversion to the mean will be realistically brutal after the Fed's hyper-extreme intervention has run its course. Of course what the Fed stimulus has really done is simply allowed the 1% to get a whole lot richer to the point of wealth inequality spiraling out of control in the decades ahead leading us likely to a dystopia in an unfair and unequal version of BigTech capitalism. This has also led to a trend of short squeeze to these tech stocks, as shown in recent years' data. Of course the Fed has to say that's its done all of these things for the people, employment numbers and the labor market. Women in the workplace have been set behind likely 15 years in social progress due to the pandemic and the Fed's response. While the 89% lost during the Great Depression would be virtually impossible today thanks to ongoing intervention from the Federal Reserve and Capitol Hill, a correction of 20% to 50% would be pretty fair and simply return the curve back to a normal trajectory as interest rates going back up eventually in the 2023 to 2025 period. It's very unlikely the market has taken Fed tapering into account (priced-in), since the euphoria of a can't miss market just keeps pushing the markets higher. But all good things must come to an end. Earlier this month, the U.S. Bureau of Labor Statistics released inflation data from July. This report showed that the Consumer Price Index for All Urban Consumers rose 5.2% over the past 12 months. While the Fed and economists promise us this inflation is temporary, others are not so certain. As you print so much money, the money you have is worth less and certain goods cost more. Wage gains in some industries cannot be taken back, they are permanent - in the service sector like restaurants, hospitality and travel that have been among the hardest hit. The pandemic has led to a paradigm shift in the future of work, and that too is not temporary. The Great Resignation means white collar jobs with be more WFM than ever before, with a new software revolution, different transport and energy behaviors and so forth. Climate change alone could slow down global GDP in the 21st century. How can inflation be temporary when so many trends don't appear to be temporary? Sure the price of lumber or used-cars could be temporary, but a global chip shortage is exasperating the automobile sector. The stock market isn't even behaving like it cares about anything other than the Fed, and its $billions of dollars of buying bonds each month. Some central banks will start to taper about December, 2021 (like the European). However Delta could further mutate into a variant that makes the first generation of vaccines less effective. Such a macro event could be enough to trigger the correction we've been speaking about. So stay safe, and keep your money safe. The Last Dance of the 2009 bull market could feel especially more painful because we've been spoiled for so long in the markets. We can barely remember what March, 2020 felt like. Some people sold their life savings simply due to scare tactics by the likes of Bill Ackman. His scare tactics on CNBC won him likely hundreds of millions as the stock market tanked. Hedge funds further gamed the Reddit and Gamestop movement, orchestrating them and leading the new retail investors into meme speculation and a whole bunch of other unsavory things like options trading at such scale we've never seen before. It's not just inflation and higher interest rates, it's how absurdly high valuations have become. Still correlation does not imply causation. Just because inflation has picked up, it doesn't guarantee that stocks will head lower. Nevertheless, weaker buying power associated with higher inflation can't be overlooked as a potential negative for the U.S. economy and equities. The current S&P500 10-year P/E Ratio is 38.7. This is 97% above the modern-era market average of 19.6, putting the current P/E 2.5 standard deviations above the modern-era average. This is just math, folks. History is saying the stock market is 2x its true value. So why and who would be full on the market or an asset class like crypto that is mostly speculative in nature to begin with? Study the following on a historical basis, and due your own due diligence as to the health of the markets: Debt-to-GDP ratio Call to put ratio

  8. w

    North American Atlas - Populated Places Collection

    • data.wu.ac.at
    esri arc export, gml +1
    Updated Oct 10, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canada (2013). North American Atlas - Populated Places Collection [Dataset]. https://data.wu.ac.at/schema/datahub_io/N2NhMTVlZmQtOGI4Yi00MDg4LTk0MjctNGViMTRlOTVkNTU1
    Explore at:
    gml, shape, esri arc exportAvailable download formats
    Dataset updated
    Oct 10, 2013
    Dataset provided by
    Canada
    Area covered
    North America
    Description

    A joint venture involving the National Atlas programs in Canada (Natural Resources Canada), Mexico (Instituto Nacional de Estadística Geografía e Informática), and the United States (U.S. Geological Survey), as well as the North American Commission for Environmental Co-operation, has led to the release (June 2004) of several new products: an updated paper map of North America, and its associated geospatial data sets and their metadata. These data sets are available online from each of the partner countries both for visualization and download. The North American Atlas data are standardized geospatial data sets at 1:10,000,000 scale. A variety of basic data layers (e.g. roads, railroads, populated places, political boundaries, hydrography, bathymetry, sea ice and glaciers) have been integrated so that their relative positions are correct. This collection of data sets forms a base with which other North American thematic data may be integrated. Any data outside of Canada, Mexico, and the United States of America included in the North American Atlas data sets is strictly to complete the context of the data. The North American Atlas - Populated Places data set shows a selection of named populated places suitable for use at a scale of 1:10,000,000. Places, which refer to individual municipalities, are always shown using point symbols. These symbols have been fitted to the North American Atlas roads, railroads, and hydrography layers, so that the points represent the approximate locations of places relative to data in these other layers. The selection of populated places was based on local importance (as shown by population size), importance as a cross-border point, and, occasionally, on other factors. All capital cities (national, provincial, territorial or State) are shown for Canada, Mexico, and the United States of America. Attributes were added to the data to reflect population class, name, and capital. Cartographic considerations were taken into account so that names do not overlap in crowded areas, nor are there too many names shown for sparsely-populated areas.

  9. School Student Health and Wellbeing

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). School Student Health and Wellbeing [Dataset]. https://www.kaggle.com/thedevastator/school-student-health-and-wellbeing
    Explore at:
    zip(4585462 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    The Devastator
    Description

    School Student Health and Wellbeing

    Physical Activity, Nutrition, Lifestyle, and Emotional Health Behaviors

    By data.world's Admin [source]

    About this dataset

    The ‘My Health, My School’ (MHMS) annual school survey is an invaluable online primary and secondary school pupil survey tool. It is available for students in Years 5, 6, 7, 9 and 11 – with the survey now being extended to include SILC and post-16 settings too.

    This survey allows schools to measure student health behaviours in order to improve their well-being as it provides an instant snapshot of data on important areas like Healthy Eating, Physical Activity and Sports, Drugs, Alcohol and Tobacco consumption , Sexual Health amongst others.

    Where blank sections appear on the data set does not mean that there is no information - instead these questions have been directed towards different year groups. This dataset will provide a comprehensive picture of how school age students perceive their own health and well-being from across key areas such as society values, mental states including feelings associated with being bullied or worried , physical activity levels measured by minutes per week active or whether they partake in certain activities either through college/school or outside it. Schools can also measure drug use levels weekly basis as well games clubs offered by them which are both enjoyed during break times or lunchtimes. All this taken together should help a school develop a more holistic view of their student’s individual needs when it comes to providing for their physical health as well as day-to-day triggers which may affect their mental wellbeing within the learning environment.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Explore available columns - The first step when utilizing this dataset is to become familiar with the available columns. Take some time to read through the data points that are included in the table so that you know what information you have access too and can plan accordingly when conducting your analysis.
    • Clean your data - There might be missing or inaccurate data points in your table which will negatively affect any analysis you do on it. Make sure that all of your data is accurate and complete before beginning any type of exploration or analysis!
    • Choose summary statistics - If you want to summarize large amounts of data quickly, summary statistics like mean or median are great options! These will allow you get an overall snapshot of how students responded to each question without getting lost trying look through all responses individually .
    • Use visualization tools - Visualization tools such as graphs and charts can help bring new life into the raw numbers in your table! Seeing trends visually can help with understanding patterns in responses more quickly than if relying solely on text-based analysis methods .
    • Interpret results relative to British values - Finally, remember why we're here: British Values! Once you've conducted your analysis across different questions/data points within this dataset be sure also compare them against relevant aspects/metrics related to British Values so as not get too lost looking at individual findings without understanding their context within Britain's broader culture

    Research Ideas

    • The dataset can be used to identify any physical activity, social-emotional and mental health issues, or unhealthy behaviours being exhibited by students across different age groups and school/college settings.
    • The data can be analysed to measure the level of understanding of British values among school pupils and the amount of useful information received about them through lessons in school/college.
    • It can also be used to assess how safe students feel in certain places and evaluate the attendance rate of pupils at educational institutions by exploring questions related to missed lessons without ill-health being a factor

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: school-survey-2018-19-1.csv | Column name | Description | |:---------------------------------------------------------------------------------------------------------|:--------------------------...

  10. Z

    Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...

    • data.niaid.nih.gov
    Updated Dec 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leon Müller; Manolis Sifalakis; Sherif Eissa; Amirreza Yousefzadeh; Sander Stuijk; Federico Corradi; Paul Detterer (2023). Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based Camera for Sensor Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7656910
    Explore at:
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    IMEC
    Eindhoven University of Technology
    Authors
    Leon Müller; Manolis Sifalakis; Sherif Eissa; Amirreza Yousefzadeh; Sander Stuijk; Federico Corradi; Paul Detterer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Introduction The advent of neural networks capable of learning salient features from variance in the radar data has expanded the breadth of radar applications, often as an alternative sensor or a complementary modality to camera vision. Gesture recognition for command control is arguably the most commonly explored application. Nevertheless, more suitable benchmarking datasets than currently available are needed to assess and compare the merits of the different proposed solutions and explore a broader range of scenarios than simple hand-gesturing a few centimeters away from a radar transmitter/receiver. Most current publicly available radar datasets used in gesture recognition provide limited diversity, do not provide access to raw ADC data, and are not significantly challenging. To address these shortcomings, we created and make available a new dataset that combines FMCW radar and dynamic vision camera of 10 aircraft marshalling signals (whole body) at several distances and angles from the sensors, recorded from 13 people. The two modalities are hardware synchronized using the radar's PRI signal. Moreover, in the supporting publication we propose a sparse encoding of the time domain (ADC) signals that achieve a dramatic data rate reduction (>76%) while retaining the efficacy of the downstream FFT processing (<2% accuracy loss on recognition tasks), and can be used to create an sparse event-based representation of the radar data. In this way the dataset can be used as a two-modality neuromorphic dataset. Synchronization of the two modalities The PRI pulses from the radar have been hard-wired to the event stream of the DVS sensor, and timestamped using the DVS clock. Based on this signal the DVS event stream has been segmented such that groups of events (time-bins) of the DVS are mapped with individual radar pulses (chirps). Data storage DVS events (x,y coords and timestamps) are stored in structured arrays, and one such structured array object is associated with the data of a radar transmission (pulse/chirp). A radar transmission is a vector of 512 ADC levels that correspond to sampling points of chirping signal (FMCW radar) that lasts about ~1.3ms. Every 192 radar transmissions are stacked in a matrix called a radar frame (each transmission is a row in that matrix). A data capture (recording) consisting of some thousands of continuous radar transmissions is therefore segmented in a number of radar frames. Finally radar frames and the corresponding DVS structured arrays are stored in separate containers in a custom-made multi-container file format (extension .rad). We provide a (rad file) parser for extracting the data out of these files. There is one file per capture of continuous gesture recording of about 10s. Note the number of 192 transmissions per radar frame is an ad-hoc segmentation that suits the purpose of obtaining sufficient signal resolution in a 2D FFT typical in radar signal processing, for the range resolution of the specific radar. It also served the purpose of fast streaming storing of the data during capture. For extracting individual data points for the dataset however, one can pool together (concat) all the radar frames from a single capture file and re-segment them according to liking. The data loader that we provide offers this, with a default of re-segmenting every 769 transmissions (about 1s of gesturing). Data captures directory organization (radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z) The dataset captures (recordings) are organized in a common directory structure which encompasses additional metadata information about the captures. dataset_dir///--/ofxRadar8Ghz_yyyy-mm-dd_HH-MM-SS.rad Identifiers

    stage [train, test]. room: [conference_room, foyer, open_space]. subject: [0-9]. Note that 0 stands for no person, and 1 for an unlabeled, random person (only present in test). gesture: ['none', 'emergency_stop', 'move_ahead', 'move_back_v1', 'move_back_v2', 'slow_down' 'start_engines', 'stop_engines', 'straight_ahead', 'turn_left', 'turn_right']. distance: 'xxx', '100', '150', '200', '250', '300', '350', '400', '450'. Note that xxx is used for none gestures when there is no person present in front of the radar (i.e. background samples), or when a person is walking in front of the radar with varying distances but performing no gesture. The test data captures contain both subjects that appear in the train data as well as previously unseen subjects. Similarly the test data contain captures from the spaces that train data were recorded at, as well as from a new unseen open space. Files List radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z This is the actual archive bundle with the data captures (recordings). rad_file_parser_2.py Parser for individual .rad files, which contain capture data. loader.py A convenience PyTorch Dataset loader (partly Tonic compatible). You practically only need this to quick-start if you don't want to delve too much into code reading. When you init a DvsRadarAircraftMarshallingSignals class object it automatically downloads the dataset archive and the .rad file parser, unpacks the archive, and imports the .rad parser to load the data. One can then request from it a training set, a validation set and a test set as torch.Datasets to work with.
    aircraft_marshalling_signals_howto.ipynb Jupyter notebook for exemplary basic use of loader.py Contact For further information or questions try contacting first M. Sifalakis or F. Corradi.

  11. Ski jumping results database

    • kaggle.com
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiktor Florek (2022). Ski jumping results database [Dataset]. https://www.kaggle.com/wrotki8778/ski-jumping-results-database-2009now
    Explore at:
    zip(11389097 bytes)Available download formats
    Dataset updated
    Jan 9, 2022
    Authors
    Wiktor Florek
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Hello. As a big ski jumping fan, I would like to invite everybody to something like a project called "Ski Jumping Data Center". Primary goal is as below:

    Collect as many data about ski-jumping as possible and create as many useful insights based on them as possible

    In the mid-September last year (12.09.20) I thought "Hmm, I don't know any statistical analyses of ski jumping". In fact, the only easily found public data analysis about SJ I know is https://rstudio-pubs-static.s3.amazonaws.com/153728_02db88490f314b8db409a2ce25551b82.html

    Question is: why? This discipline is in fact overloaded with data, but almost nobody took this topic seriously. Therefore I decided to start collecting data and analyzing them. However, the amount of work needed to capture various data (i.e. jumps and results of competitions) was so big and there is so many ways to use these informations, that make it public was obvious. In fact, I have a plan to expand my database to be as big as possible, but it requires more time and (I wish) more help.

    Content

    Data below is (in a broad sense) created by merging a lot of (>6000) PDFs with the results of almost 4000 ski jumping competitions organized between (roughly) 2009 and 2021. Creation of this dataset costed me about 150 hours of coding and parsing data and over 4 months of hard work. My current algorithm can parse in a quasi-instant way results of the consecutive events, so this dataset can be easily extended. For details see the Github page: https://github.com/wrotki8778/Ski_jumping_data_center The observations contain standard information about every jump - style points, distance, take-off speed, wind etc. Main advantage of this dataset is the number of jumps - it's quite high (by the time of uploading it's almost 250 000 rows), so we can analyze this data in various ways, although the number of columns is not so insane.

    Acknowledgements

    Big "thank you" should go to the creators of tika package, because without theirs contribution I probably wouldn't create this dataset at all.

    Inspiration

    I plan to make at least a few insights from this data: 1) Are the wind/gate factor well adjusted? 2) How strong is the correlation between the distance and the style marks? Is the judgement always fair? 3) (advanced) Can we create a model that predicts the performance/distance of an athlete in a given competition? Maybe some deep learning model? 4) Which characteristics of athletes are important in achieving the best jumps - height/weight etc.?

  12. Osu! Standard Rankings

    • kaggle.com
    zip
    Updated Jan 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julliane Pierre (2023). Osu! Standard Rankings [Dataset]. https://www.kaggle.com/datasets/jullianepierre/osu-standard-rankings/data
    Explore at:
    zip(3788 bytes)Available download formats
    Dataset updated
    Jan 30, 2023
    Authors
    Julliane Pierre
    Description

    Context:

    osu! is a music rhythm game that has 4 modes (check for more info). In this dataset, you can examine the rankings of the standard mode, taken on 30/01/2023 around 3 PM. The ranking is based on pp (performance points) awarded after every play, which are influenced by play accuracy and score; pps are then summed with weights: your top play will award you the whole pp points of the map, then the percentage is decreased (this can maintain balance between strong players and players who play too much). You can find here many other statistics.

    Contents:

    The dataset contains some columns (see below) reporting statistics for every player in the top 100 of the game in the standard mode. The ranking is ordered by pp. Some players seem to have the same points, but there are decimals that are not shown in the ranking chart on the site

    Variables:

    • rank: global rank (you can use this like an id too)
    • player_name: in-game nickname
    • country: country of origin
    • accuracy: mean accuracy of your top plays
    • play_count: lifetime plays
    • level: level (not very influent on stats)
    • hours: total hours played
    • performance_points: pp which determine the rankings
    • ss: number of ss plays (accuracy=100% and no miss)
    • s: number of s plays (accuracy>=93% and no miss)
    • a: number of a plays (accuracy>=93% but there are misses)
    • watched_by: number of replays of the player watched by others

    Acknowledgements:

    I created this database to use it for my upcoming project in our Data Science.

    I used the 2017 osu! rankings and description by Svidon as a reference in order to produce the 2023 osu! ranking in the top 100 as of January 30, 2023

    This data will be public and can be accessible on this link https://osu.ppy.sh/rankings/osu/performance.

    Here is his kaggle: https://www.kaggle.com/svidon

  13. Microsoft Professional Capstone DataSet

    • kaggle.com
    zip
    Updated Oct 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Sharma (2017). Microsoft Professional Capstone DataSet [Dataset]. https://www.kaggle.com/sharmaharsh/microsoft-capstone
    Explore at:
    zip(6858551 bytes)Available download formats
    Dataset updated
    Oct 22, 2017
    Authors
    Harsh Sharma
    Description

    Problem Description

    About the data

    Your goal is to predict a student's earnings a set number of years after they have enrolled in United States institutions of higher education. The data is compiled from a wide range of sources and made publicly available by the United States Department of education.

    Target Variable

    We're trying to predict the variable income, which represents earnings in thousands of US dollars a set interval from when the student first enrolled.

    Submission Format

    The format for the submission file is two columns with the row_id and the income. The data type of income is a float, so make sure there is a decimal point in your submission. For example 0.0 is a valid float. 0 is not.

    For example, if you predicted...

    row_idincome
    20.0
    80.0
    90.0
    100.0
    110.0

    The first few lines of the .csv file that you submit would look like:

    row_id,income
    2,0.0
    8,0.0
    9,0.0
    10,0.0
    11,0.0
    

    Performance Metric

    We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric called Root-mean-squared error. It is an error metric, so lower value is better (as opposed to an accuracy metric, where a higher value is better).

    \[RMSE = \sqrt{\frac{1}{N}\sum_{n=1}^{N} (\hat{y}_n - y_n)^2 }\]

    Where $\hat{y}_n$ is the predicted earnings and $y_n$ is the actual earnings. The best possible score is 0, but the worst possible score can be infinite.

    Features

    There are 297 variables in this dataset. Each row in the dataset represents a United States institution of higher education in a specific year. The dataset we are working with covers four particular years, denoted year_a, year_f, year_w, and year_z in our dataset. An institution may have a row for all, some, or just for one of the years. We don't provide a unique identifier for an individual institution, just a row_id for each row.

    The variables in the dataset have names that of the form category_variable, where category is the high level category of the variable (e.g. academics or students). variable is what the specific column contains.

    Categories

    • academics

      • program_assoc_agriculture: Associate degree in Agriculture, Agriculture Operations, And Related Sciences.
      • program_assoc_architecture: Associate degree in Architecture And Related Services.
      • program_assoc_biological: Associate degree in Biological And Biomedical Sciences.
      • program_assoc_business_marketing: Associate degree in Business, Management, Marketing, And Related Support Services.
      • program_assoc_communication: Associate degree in Communication, Journalism, And Related Programs.
      • program_assoc_communications_technology: Associate degree in Communications Technologies/Technicians And Support Services.
      • program_assoc_computer: Associate degree in Computer And Information Sciences And Support Services.
      • program_assoc_construction: Associate degree in Construction Trades.
      • program_assoc_education: Associate degree in Education.
      • program_assoc_engineering: Associate degree in Engineering.
      • program_assoc_engineering_technology: Associate degree in Engineering Technologies And Engineering-Related Fields.
      • program_assoc_english: Associate degree in English Language And Literature/Letters.
      • program_assoc_ethnic_cultural_gender: Associate degree in Area, Ethnic, Cultural, Gender, And Group Studies.
      • program_assoc_family_consumer_science: Associate degree in Family And Consumer Sciences/Human Sciences.
      • program_assoc_health: Associate degree in Health Professions And Related Programs.
      • program_assoc_history: Associate degree in History.
      • program_assoc_humanities: Associate degree in Liberal Arts And Sciences, General Studies And Humanities.
      • program_assoc_language: Associate degree in Foreign Languages, Literatures, And Linguistics.
      • program_assoc_legal: Associate degree in Legal Professions And Studies.
      • program_assoc_library: Associate degree in Library Science.
      • program_assoc_mathematics: Associate degree in Mathematics And Statistics.
      • program_assoc_mechanic_repair_technology: Associate degree in Mechanic And Repair Technologies/Technicians.
      • program_assoc_military: Associate degree in Military Technologies And Applied Sciences. ...
  14. League Of Legends Data

    • kaggle.com
    zip
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Preston Robertson (2022). League Of Legends Data [Dataset]. https://www.kaggle.com/datasets/prestonrobertson7/league-of-legends-data-9292022
    Explore at:
    zip(15143027 bytes)Available download formats
    Dataset updated
    Oct 5, 2022
    Authors
    Preston Robertson
    Description

    Basic Data Description

    X matches of the most recent League of Legends games as of the date specified in the CSV file. Each game has 10 separate players and each player recorded has 68 recorded features.

    Where X matches is based on how many I am capable of pulling at a given time. With a maximum of 10,000 matches.

    League of Legends

    1. Data Set Description

    1.1. Introduction to the Data Set

    This data set is on an online competitive game called League of Legends. I chose this data set to challenge myself; the dataset’s unique nature will require me to apply techniques I have learned in classes this semester, while teaching myself and applying new techniques. The objective of the data set is to find the features that will most impact the win, so that it is easier to balance the game. “Balancing” refers to updating the game, such as downgrading strategies that are too strong and upgrading strategies that feel too weak. This document will serve as the preliminary to champion balance by correlating specific stats to a win, so that in the future someone may correlate these stats to champions in the game. This goal may seem convoluted; however, each game will have 5 winners and 5 losers meaning that a single champion’s impact on the game is roughly 10%. If a champion is unbalanced, being double the strength of another champion, that only raises the 10% to 20%. Due to the law of averages, each champion will still have around 50% win rate despite being too weak or too strong. Therefore to properly balance a champion, one must look at the correlation between stats of the champion to the win-rate of each of those stats. This effectively takes out the problem of bias since every game has at least one winner and one loser. This dataset has 23,752 data points and 24 features (or columns). Features refer to a measurable piece of data, such as champion name, damage done, etc. These features do not refer to game features but rather the name of the data being measured. It is a complicated dataset, with several variables requiring several stages of feature modification to run the code. The code is also large enough to have significance. This dataset was specifically chosen due to my prior familiarity with the data, allowing me to focus on the machine learning techniques.

    1.2. Brief Description of League of Legends

    The online competitive multiplayer game, League of Legends is a part of the MOBA genre of games and is considered the most widely popular game of all time. Its most recent tournaments have made more money than the Superbowl; in 2020 League of Legends made 1.7 billion dollars. The premise of the game is two teams fighting to destroy the enemy Nexus. Below is the map of the game so it is easier to reference the variables given. Figure 1: Map of League of Legends ([1])

    1.3. Description of the Map

    The map of League of Legends contains 3 paths, each a lane with a corresponding name. Top Lane, Mid Lane (short for Middle), and Bot Lane (short for Bottom Lane). Each lane spawns ”minions” to help push lanes. These minions are very easy monsters to defeat and provide gold if a player lands the killing blow. Each lane has 3 towers and an inhibitor. All three of one lane’s tower + inhibitor need to be destroyed before a player can reach the nexus. The towers provide protection to players by hurting enemy champions when they are in range. The inhibitor provides no uses to ally team; however, if a player destroys the enemy inhibitor then ”Super Minions” will spawn in that lane. These buffed minions help push to finish the game. There are also forests called the Jungle in the middle of these lanes. In the jungle there are several monsters’ worth gold and that grow stronger as the game goes on. Some monsters, if killed, can even provide special bonuses. All these monsters can be killed by one player. The blue section seen in 1 that splits the map into two sides is known as the river. This is the equivalent of the half-way line in soccer. In this part of the map, Large Monsters spawn that require a group effort to takedown but give huge bonuses.

    1.4. Description of the Gameplay

    The game is known for its complexity and if you want a comprehensive guide, I have provided a link that I think does an excellent job (([2]) and ([3]) provide great comprehensive guides). This paper will explain only the minimal necessities to follow the data. There are 5 players on each team and each player plays a champion. This champion is a character with unique gameplay, stats, and abilities. These 5 players will each fill a specific role: Top Lane goes to the Top Lane, Mid Lane goes to the Mid Lane, the Jungler goes to the Jungle, and the Attack Damage Carry (ADC) and Support go to the Bot Lane. In each of their respective locations, each role attempts to make gold and level up. The gold is used to buy items (each with unique effects, ...

  15. Aqua/AIRS L3 Daily Standard Physical Retrieval (AIRS-only) 1 degree x 1...

    • catalog.data.gov
    • gimi9.com
    • +5more
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA/GSFC/SED/ESD/TISL/GESDISC (2025). Aqua/AIRS L3 Daily Standard Physical Retrieval (AIRS-only) 1 degree x 1 degree V7.0 at GES DISC [Dataset]. https://catalog.data.gov/dataset/aqua-airs-l3-daily-standard-physical-retrieval-airs-only-1-degree-x-1-degree-v7-0-at-ges-d-141e0
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Atmospheric Infrared Sounder (AIRS) is a grating spectrometer (R = 1200) aboard the second Earth Observing System (EOS) polar-orbiting platform, EOS Aqua. The AIRS Level 3 Daily Gridded Product contains standard retrieval means, standard deviations and input counts. Each file covers a temporal period of 24 hours for either the descending (equatorial crossing North to South at 1:30 AM local time) or ascending (equatorial crossing South to North at 1:30 PM local time) orbit. The data starts at the international dateline and progresses westward (as do the subsequent orbits of the satellite) so that neighboring gridded cells of data are no more than a swath of time apart (about 90 minutes). The two parts of a scan line crossing the dateline are included in separate L3 files, according to the date, so that data points in a grid box are always coincident in time. The edge of the AIRS Level 3 gridded cells is at the date line (the 180E/W longitude boundary). When plotted, this produces a map with 0 degrees longitude in the center of the image unless the bins are reordered. This method is preferred because the left (West) side of the image and the right (East) side of the image contain data farthest apart in time. The gridding scheme used by AIRS is the same as used by TOVS Pathfinder to create Level 3 products. The daily Level 3 products have gores between satellite paths where there is no coverage for that day. The geophysical parameters have been averaged and binned into 1 x 1 deg grid cells, from -180.0 to +180.0 deg longitude and from -90.0 to +90.0 deg latitude. For each grid map of 4-byte floating-point mean values there is a corresponding 4-byte floating-point map of standard deviation and a 2-byte integer grid map of counts. The counts map provides the user with the number of points per bin that were included in the mean and can be used to generate custom multi-day maps from the daily gridded products. The thermodynamic parameters are: Skin Temperature (land and sea surface), Air Temperature at the surface, Profiles of Air Temperature and Water Vapor, Tropopause Characteristics, Column Precipitable Water, Cloud Amount/Frequency, Cloud Height, Cloud Top Pressure, Cloud Top Temperature, Reflectance, Emissivity, Surface Pressure, Cloud Vertical Distribution. The trace gases parameters are: Total Amounts and Vertical Profiles of Carbon Monoxide, Methane, and Ozone. The actual names of the variables in the data files should be inferred from the Processing File Description document.The value for each grid box is the sum of the values that fall within the 1x1 area divided by the number of points in thebox.

  16. Complete UERRA regional reanalysis for Europe from 1961 to 2019

    • cds.climate.copernicus.eu
    netcdf
    Updated Sep 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2019). Complete UERRA regional reanalysis for Europe from 1961 to 2019 [Dataset]. http://doi.org/10.24381/cds.dd7c6d66
    Explore at:
    netcdfAvailable download formats
    Dataset updated
    Sep 10, 2019
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Europe
    Description

    The UERRA datasets contain reanalysis data of the atmosphere, the surface and near-surface as well as for the soil covering Europe. Essential climate variables are generated with the UERRA-HARMONIE and the MESCAN-SURFEX systems. UERRA-HARMONIE is a 3-dimensional variational data assimilation system, while MESCAN-SURFEX is a complementary surface analysis system. Using the Optimal Interpolation method, MESCAN provides the best estimate of daily accumulated precipitation and six-hourly air temperature and relative humidity at 2 meters above the model topography. The land surface platform SURFEX is forced with downscaled forecast fields from UERRA-HARMONIE as well as MESCAN analyses. It is run offline, i.e. without feedback to the atmospheric analysis performed in MESCAN or the UERRA-HARMONIE data assimilation cycles. Using SURFEX offline allows taking full benefit of precipitation analysis and to use the more advanced physics options to better represent surface variables such as surface temperature and surface fluxes, and soil processes related to water and heat transfer in the soil and snow. In general, reanalysis combines model data with observations into a complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (6 hours in the UERRA-HARMONIE system) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. The assimilation system is able to estimate biases between observations and to sift good-quality data from poor data. The laws of physics allow for estimates at locations where data coverage is low. The provision of estimates at each grid point in Europe for each regular output time, over a long period, always using the same format, makes reanalysis a very convenient and popular dataset to work with. The observing system has changed drastically over time, and although the assimilation system can resolve data holes, the initially much sparser networks will lead to less accurate estimates. UERRA-HARMONIE data is available from 1961 and is updated once a month with a delay to real-time of about 4 months. The system provides four analyses per day – at 0 UTC, 6 UTC, 12 UTC, and 18 UTC. Between the analyses, forecasts of the system are available with hourly resolution. Hence, estimates of the status of the atmosphere are available for every hour since 1961. Moreover, forecasts up to 30 hours are available from the analyses initialised at 0 UTC and 12 UTC. In addition to observations in the model domain, a regional reanalysis needs information at its lateral boundaries. For the UERRA-HARMONIE system, this information is taken from the global reanalyses ERA40 (until the end of 1978) and ERA-interim (from 1979). The improvement over global products comes with the higher horizontal resolution that allows incorporating more regional details, e.g. topography. Moreover, it enables the system even to consider more observations at places with dense observation networks. The UERRA-HARMONIE regional reanalysis is produced at a horizontal resolution of 11km and MESCAN SURFEX provides data at a resolution of 5.5km. For the UERRA-HARMONIE system, variables are produced at the surface and on model levels (65 levels) but are also interpolated to two other level types: pressure levels (24 levels between 1000-10hPa), and height levels (11 levels between 15m-500m). The output of height levels were introduced with special focus on the wind energy sector and their needs. Soil data is available on 14 levels from the surface to a depth of 12m. The number of available parameters varies between the different level types. In order to make data access more manageable, the UERRA-HARMONIE and MESCAN-SURFEX dataset has been split into four records. Analysis time steps are available via the CDS whereas forecasts are available only through CDSAPI.

  17. w

    Living Standards Measurement Survey 2003 (Wave 3 Panel) - Bosnia-Herzegovina...

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Jan 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State Agency for Statistics (BHAS) (2020). Living Standards Measurement Survey 2003 (Wave 3 Panel) - Bosnia-Herzegovina [Dataset]. https://microdata.worldbank.org/index.php/catalog/67
    Explore at:
    Dataset updated
    Jan 30, 2020
    Dataset provided by
    Republika Srpska Institute of Statistics (RSIS)
    Federation of BiH Institute of Statistics (FIS)
    State Agency for Statistics (BHAS)
    Time period covered
    2003
    Area covered
    Bosnia and Herzegovina
    Description

    Abstract

    In 2001, the World Bank in co-operation with the Republika Srpska Institute of Statistics (RSIS), the Federal Institute of Statistics (FOS) and the Agency for Statistics of BiH (BHAS), carried out a Living Standards Measurement Survey (LSMS). The Living Standard Measurement Survey LSMS, in addition to collecting the information necessary to obtain a comprehensive as possible measure of the basic dimensions of household living standards, has three basic objectives, as follows:

    1. To provide the public sector, government, the business community, scientific institutions, international donor organizations and social organizations with information on different indicators of the population's living conditions, as well as on available resources for satisfying basic needs.

    2. To provide information for the evaluation of the results of different forms of government policy and programs developed with the aim to improve the population's living standard. The survey will enable the analysis of the relations between and among different aspects of living standards (housing, consumption, education, health, labor) at a given time, as well as within a household.

    3. To provide key contributions for development of government's Poverty Reduction Strategy Paper, based on analyzed data.

    The Department for International Development, UK (DFID) contributed funding to the LSMS and provided funding for a further two years of data collection for a panel survey, known as the Household Survey Panel Series (HSPS). Birks Sinclair & Associates Ltd. were responsible for the management of the HSPS with technical advice and support provided by the Institute for Social and Economic Research (ISER), University of Essex, UK. The panel survey provides longitudinal data through re-interviewing approximately half the LSMS respondents for two years following the LSMS, in the autumn of 2002 and 2003. The LSMS constitutes Wave 1 of the panel survey so there are three years of panel data available for analysis. For the purposes of this documentation we are using the following convention to describe the different rounds of the panel survey: - Wave 1 LSMS conducted in 2001 forms the baseline survey for the panel
    - Wave 2 Second interview of 50% of LSMS respondents in Autumn/ Winter 2002 - Wave 3 Third interview with sub-sample respondents in Autumn/ Winter 2003

    The panel data allows the analysis of key transitions and events over this period such as labour market or geographical mobility and observe the consequent outcomes for the well-being of individuals and households in the survey. The panel data provides information on income and labour market dynamics within FBiH and RS. A key policy area is developing strategies for the reduction of poverty within FBiH and RS. The panel will provide information on the extent to which continuous poverty is experienced by different types of households and individuals over the three year period. And most importantly, the co-variates associated with moves into and out of poverty and the relative risks of poverty for different people can be assessed. As such, the panel aims to provide data, which will inform the policy debates within FBiH and RS at a time of social reform and rapid change.

    Geographic coverage

    National coverage. Domains: Urban/rural/mixed; Federation; Republic

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The Wave 3 sample consisted of 2878 households who had been interviewed at Wave 2 and a further 73 households who were interviewed at Wave 1 but were non-contact at Wave 2 were issued. A total of 2951 households (1301 in the RS and 1650 in FBiH) were issued for Wave 3. As at Wave 2, the sample could not be replaced with any other households.

    Panel design

    Eligibility for inclusion

    The household and household membership definitions are the same standard definitions as a Wave 2. While the sample membership status and eligibility for interview are as follows: i) All members of households interviewed at Wave 2 have been designated as original sample members (OSMs). OSMs include children within households even if they are too young for interview. ii) Any new members joining a household containing at least one OSM, are eligible for inclusion and are designated as new sample members (NSMs). iii) At each wave, all OSMs and NSMs are eligible for inclusion, apart from those who move outof-scope (see discussion below). iv) All household members aged 15 or over are eligible for interview, including OSMs and NSMs.

    Following rules

    The panel design means that sample members who move from their previous wave address must be traced and followed to their new address for interview. In some cases the whole household will move together but in others an individual member may move away from their previous wave household and form a new split-off household of their own. All sample members, OSMs and NSMs, are followed at each wave and an interview attempted. This method has the benefit of maintaining the maximum number of respondents within the panel and being relatively straightforward to implement in the field.

    Definition of 'out-of-scope'

    It is important to maintain movers within the sample to maintain sample sizes and reduce attrition and also for substantive research on patterns of geographical mobility and migration. The rules for determining when a respondent is 'out-of-scope' are as follows:

    i. Movers out of the country altogether i.e. outside FBiH and RS. This category of mover is clear. Sample members moving to another country outside FBiH and RS will be out-of-scope for that year of the survey and not eligible for interview.

    ii. Movers between entities Respondents moving between entities are followed for interview. The personal details of the respondent are passed between the statistical institutes and a new interviewer assigned in that entity.

    iii. Movers into institutions Although institutional addresses were not included in the original LSMS sample, Wave 3 individuals who have subsequently moved into some institutions are followed. The definitions for which institutions are included are found in the Supervisor Instructions.

    iv. Movers into the district of Brcko are followed for interview. When coding entity Brcko is treated as the entity from which the household who moved into Brcko originated.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Questionnaire design

    Approximately 90% of the questionnaire (Annex B) is based on the Wave 2 questionnaire, carrying forward core measures that are needed to measure change over time. The questionnaire was widely circulated and changes were made as a result of comments received.

    Pretesting

    In order to undertake a longitudinal test the Wave 2 pretest sample was used. The Control Forms and Advance letters were generated from an Access database containing details of ten households in Sarajevo and fourteen in Banja Luka. The pretest was undertaken from March 24-April 4 and resulted in 24 households (51 individuals) successfully interviewed. One mover household was successfully traced and interviewed.
    In order to test the questionnaire under the hardest circumstances a briefing was not held. A list of the main questionnaire changes was given to experienced interviewers.

    Issues arising from the pretest

    Interviewers were asked to complete a Debriefing and Rating form. The debriefing form captured opinions on the following three issues:

    1. General reaction to being re-interviewed. In some cases there was a wariness of being asked to participate again, some individuals asking “Why Me?” Interviewers did a good job of persuading people to take part, only one household refused and another asked to be removed from the sample next year. Having the same interviewer return to the same households was considered an advantage. Most respondents asked what was the benefit to them of taking part in the survey. This aspect was reemphasised in the Advance Letter, Respondent Report and training of the Wave 3 interviewers.

    2. Length of the questionnaire. The average time of interview was 30 minutes. No problems were mentioned in relation to the timing, though interviewers noted that some respondents, particularly the elderly, tended to wonder off the point and that control was needed to bring them back to the questions in the questionnaire. One interviewer noted that the economic situation of many respondents seems to have got worse from the previous year and it was necessary to listen to respondents “stories” during the interview.

    3. Confidentiality. No problems were mentioned in relation to confidentiality. Though interviewers mentioned it might be worth mentioning the new Statistics Law in the Advance letter. The Rating Form asked for details of specific questions that were unclear. These are described below with a description of the changes made.

    • Module 3. Q29-31 have been added to capture funds received for education, scholarships etc.

    • Module 4. Pretest respondents complained that the 6 questions on "Has your health limited you..." and the 16 on "in the last 7 days have you felt depressed” etc were too many. These were reduced by half (Q38-Q48). The LSMS data was examined and those questions where variability between the answers was widest were chosen.

    • Module 5. The new employment questions (Q42-Q44) worked well and have been kept in the main questionnaire.

    • Module 7. There were no problems reported with adding the credit questions (Q28-Q36)

    • Module 9. SIG recommended that some of Questions 1-12 were relevant only to those aged over 18 so additional skips have been added. Some respondents complained the questionnaire was boring. To try and overcome

  18. C

    Raw hydrological data (HYGON)

    • ckan.mobidatalab.eu
    download +1
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geoportal (2021). Raw hydrological data (HYGON) [Dataset]. https://ckan.mobidatalab.eu/dataset/hydrological-raw-data-hygon
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csv, downloadAvailable download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    Geoportal
    License

    Data licence Germany - Zero - Version 2.0https://www.govdata.de/dl-de/zero-2-0
    License information was derived automatically

    Description

    Current unchecked raw data on water level (W), water temperature, water quality and precipitation (N) from HYGON. Any errors that may occur, including incorrect values, can only be corrected in the subsequent check by the operator. For this reason, no guarantee can be given for the completeness and correctness of the data accessed here. Water levels: The measurement of the water level in above-ground bodies of water is carried out using gauges. In North Rhine-Westphalia there are around 500 gauges whose measurements form the basis for the planning and control of water management systems. The so-called gauge system includes not only the pure water level measurement at many gauges, but also additional measurements to determine the discharge. Water temperature: The basis of a temperature measuring network consists primarily of the continuously measuring quality stations that LANUV operates nationwide. These stations have been installed at sampling points relevant to water management in NRW (e.g. the mouths of large tributaries in the Rhine, on the Ruhr and at border water measuring points). In addition, so-called combination probes for measuring water level and water temperature have been installed during the installation of new sensors for water levels at gauges in NRW in recent years, which also enable the water temperature to be measured as a by-product at the location of the water level measurement. Water quality: At six control stations on the Rhine and at the alarm stations at the mouths of the most important Rhine tributaries as well as on the Weser and Ems, water samples are continuously taken by the State Agency for Nature, Environment and Consumer Protection (LANUV) NRW and analyzed at short intervals. Precipitation: The measured values ​​from around 200 precipitation stations, which are equipped with technical devices for data transmission, are currently available as raw data values. The amount of precipitation is recorded at all stations in minute resolution. Depending on the technical equipment, the stations are either called up once a day or they send current values ​​every hour. The unaudited raw data is stored in Hygon's database for 2 years. For older databases, reference is made to the specialist information system ELWAS.

  19. daigt data - llama 70b and falcon180b

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2023). daigt data - llama 70b and falcon180b [Dataset]. https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b
    Explore at:
    zip(6163526 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    Nicholas Broad
    Description

    This is for the LLM - Detect AI Generated Text (DAIGT) competition.

    Versions

    1. Very light processing out of LLM. 1k from llama-70b-chat, 1k from falcon-180b-chat across all persuade prompts and some extras from gpt-4

    2. Added llama70b_v2.csv by cleaning up Llama 70b output as seen in this notebook. Same data, just with some text removed from samples.

    3. 500 generated samples from llama 70b and falcon 180b for each prompt in RDizzl3_seven. (3,500 total for llama70b; 3,500 total for falcon180b). These had sources in the the prompt, unlike earlier versions.

    I will be updating it more in the future as I improve the prompts. If you notice anything odd or if you have any questions, please don't hesitate to ask!

    Prompts

    The prompts were a combination of the PERSUADE corpus and some from GPT-4. Essays for the same prompt are generated with different temperatures, top_k values, and slightly different prompts.

    All together there were 15 prompts from PERSUADE and 20 from GPT-4. All prompts are below:

    persuade_prompts = ['Today the majority of humans own and operate cell phones on a daily basis. In essay form, explain if drivers should or should not be able to use cell phones in any capacity while operating a vehicle.',
     'Write an explanatory essay to inform fellow citizens about the advantages of limiting car usage. Your essay must be based on ideas and information that can be found in the passage set. Manage your time carefully so that you can read the passages; plan your response; write your response; and revise and edit your response. Be sure to use evidence from multiple sources; and avoid overly relying on one source. Your response should be in the form of a multiparagraph essay. Write your essay in the space provided.',
     'Some schools require students to complete summer projects to assure they continue learning during their break. Should these summer projects be teacher-designed or student-designed? Take a position on this question. Support your response with reasons and specific examples.',
     "You have just read the article, 'A Cowboy Who Rode the Waves.' Luke's participation in the Seagoing Cowboys program allowed him to experience adventures and visit many unique places. Using information from the article, write an argument from Luke's point of view convincing others to participate in the Seagoing Cowboys program. Be sure to include: reasons to join the program; details from the article to support Luke's claims; an introduction, a body, and a conclusion to your essay.",
     'Your principal has decided that all students must participate in at least one extracurricular activity. For example, students could participate in sports, work on the yearbook, or serve on the student council. Do you agree or disagree with this decision? Use specific details and examples to convince others to support your position. ',
     'In "The Challenge of Exploring Venus," the author suggests studying Venus is a worthy pursuit despite the dangers it presents. Using details from the article, write an essay evaluating how well the author supports this idea. Be sure to include: a claim that evaluates how well the author supports the idea that studying Venus is a worthy pursuit despite the dangers; an explanation of the evidence from the article that supports your claim; an introduction, a body, and a conclusion to your essay.',
     'In the article "Making Mona Lisa Smile," the author describes how a new technology called the Facial Action Coding System enables computers to identify human emotions. Using details from the article, write an essay arguing whether the use of this technology to read the emotional expressions of students in a classroom is valuable.',
     "You have read the article 'Unmasking the Face on Mars.' Imagine you are a scientist at NASA discussing the Face with someone who thinks it was created by aliens. Using information in the article, write an argumentative essay to convince someone that the Face is just a natural landform.Be sure to include: claims to support your argument that the Face is a natural landform; evidence from the article to support your claims; an introduction, a body, and a conclusion to your argumentative essay.",
     'Some of your friends perform community service. For example, some tutor elementary school children and others clean up litter. They think helping the community is very important. But other friends of yours think community service takes too much time away from what they need or want to do. 
    Your principal is deciding whether to require all students to perform community service. 
    Write a letter to your principal in which you take a position on whether students should be required to perform community service. Support your position with examples.',
     "Your principal is considering changing school po...
    
  20. iEEG-Multicenter-Dataset

    • openneuro.org
    Updated Dec 2, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Li; Sara Inati; Kareem Zaghloul; Nathan Crone; William Anderson; Emily Johnson; Iahn Cajigas; Damian Brusko; Jonathan Jagid; Angel Claudio; Andres Kanner; Jennifer Hopp; Stephanie Chen; Jennifer Haagensen; Sridevi Sarma (2020). iEEG-Multicenter-Dataset [Dataset]. http://doi.org/10.18112/openneuro.ds003029.v1.0.1
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Adam Li; Sara Inati; Kareem Zaghloul; Nathan Crone; William Anderson; Emily Johnson; Iahn Cajigas; Damian Brusko; Jonathan Jagid; Angel Claudio; Andres Kanner; Jennifer Hopp; Stephanie Chen; Jennifer Haagensen; Sridevi Sarma
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Fragility Multi-Center Retrospective Study

    iEEG and EEG data from 5 centers is organized in our study with a total of 100 subjects. We publish 4 centers' dataset here due to data sharing issues.

    Acquisitions include ECoG and SEEG. Each run specifies a different snapshot of EEG data from that specific subject's session. For seizure sessions, this means that each run is a EEG snapshot around a different seizure event.

    For additional clinical metadata about each subject, refer to the clinical Excel table in the publication.

    Data Availability

    NIH, JHH, UMMC, and UMF agreed to share. Cleveland Clinic did not, so requires an additional DUA.

    All data, except for Cleveland Clinic was approved by their centers to be de-identified and shared. All data in this dataset have no PHI, or other identifiers associated with patient. In order to access Cleveland Clinic data, please forward all requests to Amber Sours, SOURSA@ccf.org:

    Amber Sours, MPH Research Supervisor | Epilepsy Center Cleveland Clinic | 9500 Euclid Ave. S3-399 | Cleveland, OH 44195 (216) 444-8638

    You will need to sign a data use agreement (DUA).

    Sourcedata

    For each subject, there was a raw EDF file, which was converted into the BrainVision format with mne_bids. Each subject with SEEG implantation, also has an Excel table, called electrode_layout.xlsx, which outlines where the clinicians marked each electrode anatomically. Note that there is no rigorous atlas applied, so the main points of interest are: WM, GM, VENTRICLE, CSF, and OUT, which represent white-matter, gray-matter, ventricle, cerebrospinal fluid and outside the brain. WM, Ventricle, CSF and OUT were removed channels from further analysis. These were labeled in the corresponding BIDS channels.tsv sidecar file as status=bad. The dataset uploaded to openneuro.org does not contain the sourcedata since there was an extra anonymization step that occurred when fully converting to BIDS.

    Derivatives

    Derivatives include: * fragility analysis * frequency analysis * graph metrics analysis * figures

    These can be computed by following the following paper: Neural Fragility as an EEG Marker for the Seizure Onset Zone

    Events and Descriptions

    Within each EDF file, there contain event markers that are annotated by clinicians, which may inform you of specific clinical events that are occuring in time, or of when they saw seizures onset and offset (clinical and electrographic).

    During a seizure event, specifically event markers may follow this time course:

    * eeg onset, or clinical onset - the onset of a seizure that is either marked electrographically, or by clinical behavior. Note that the clinical onset may not always be present, since some seizures manifest without clinical behavioral changes.
    * Marker/Mark On - these are usually annotations within some cases, where a health practitioner injects a chemical marker for use in ICTAL SPECT imaging after a seizure occurs. This is commonly done to see which portions of the brain are active metabolically.
    * Marker/Mark Off - This is when the ICTAL SPECT stops imaging.
    * eeg offset, or clinical offset - this is the offset of the seizure, as determined either electrographically, or by clinical symptoms.
    

    Other events included may be beneficial for you to understand the time-course of each seizure. Note that ICTAL SPECT occurs in all Cleveland Clinic data. Note that seizure markers are not consistent in their description naming, so one might encode some specific regular-expression rules to consistently capture seizure onset/offset markers across all dataset. In the case of UMMC data, all onset and offset markers were provided by the clinicians on an Excel sheet instead of via the EDF file. So we went in and added the annotations manually to each EDF file.

    Seizure Electrographic and Clinical Onset Annotations

    For various datasets, there are seizures present within the dataset. Generally there is only one seizure per EDF file. When seizures are present, they are marked electrographically (and clinically if present) via standard approaches in the epilepsy clinical workflow.

    Clinical onset are just manifestation of the seizures with clinical syndromes. Sometimes the maker may not be present.

    Seizure Onset Zone Annotations

    What is actually important in the evaluation of datasets is the clinical annotations of their localization hypotheses of the seizure onset zone.

    These generally include:

    * early onset: the earliest onset electrodes participating in the seizure that clinicians saw
    * early/late spread (optional): the electrodes that showed epileptic spread activity after seizure onset. Not all seizures has spread contacts annotated.
    

    Surgical Zone (Resection or Ablation) Annotations

    For patients with the post-surgical MRI available, then the segmentation process outlined above tells us which electrodes were within the surgical removed brain region.

    Otherwise, clinicians give us their best estimate, of which electrodes were resected/ablated based on their surgical notes.

    For surgical patients whose postoperative medical records did not explicitly indicate specific resected or ablated contacts, manual visual inspection was performed to determine the approximate contacts that were located in later resected/ablated tissue. Postoperative T1 MRI scans were compared against post-SEEG implantation CT scans or CURRY coregistrations of preoperative MRI/post SEEG CT scans. Contacts of interest in and around the area of the reported resection were selected individually and the corresponding slice was navigated to on the CT scan or CURRY coregistration. After identifying landmarks of that slice (e.g. skull shape, skull features, shape of prominent brain structures like the ventricles, central sulcus, superior temporal gyrus, etc.), the location of a given contact in relation to these landmarks, and the location of the slice along the axial plane, the corresponding slice in the postoperative MRI scan was navigated to. The resected tissue within the slice was then visually inspected and compared against the distinct landmarks identified in the CT scans, if brain tissue was not present in the corresponding location of the contact, then the contact was marked as resected/ablated. This process was repeated for each contact of interest.

    References

    Adam Li, Chester Huynh, Zachary Fitzgerald, Iahn Cajigas, Damian Brusko, Jonathan Jagid, Angel Claudio, Andres Kanner, Jennifer Hopp, Stephanie Chen, Jennifer Haagensen, Emily Johnson, William Anderson, Nathan Crone, Sara Inati, Kareem Zaghloul, Juan Bulacio, Jorge Gonzalez-Martinez, Sridevi V. Sarma. Neural Fragility as an EEG Marker of the Seizure Onset Zone. bioRxiv 862797; doi: https://doi.org/10.1101/862797

    Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896). https://doi.org/10.21105/joss.01896

    Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D'Ambrosio, S., David, O., … Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific Data, 6, 102. https://doi.org/10.1038/s41597-019-0105-7

    Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ERA5 monthly averaged data on single levels from 1979 to present (2021). Copernicus [Dataset]. https://sextant.ifremer.fr/geonetwork/srv/api/records/ff2cd349-ecab-48e1-817a-1ed87dc0c4be

Copernicus

ECMWF

ERA5 monthly averaged data on single levels from 1979 to present

Explore at:
www:link-1.0-http--publication-url, www:linkAvailable download formats
Dataset updated
Sep 6, 2021
Dataset provided by
ERA5 monthly averaged data on single levels from 1979 to present
Area covered
Description

ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 4 to 7 decades. Currently data is available from 1950, split into Climate Data Store entries for 1950-1978 (preliminary back extension) and from 1979 onwards (final release plus timely updates, this page). ERA5 replaces the ERA-Interim reanalysis.

Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread.

ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has not been the case and when this does occur users will be notified.

The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications.

An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines.

Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities).

The present entry is "ERA5 monthly mean data on single levels from 1979 to present".

Search
Clear search
Close search
Google apps
Main menu