Facebook
TwitterArchaeology is awash in digital data. Archaeologists generate large numbers of digital files in their field, laboratory, and records investigations. We use digital mapping, digital photography, digital means of data analysis, and our reports are drafted and produced digitally. Good curation of digital data provides easy means by which it can be discovered and accessed, as well as ensuring that it is preserved for future uses. In many ways the planning for and carrying out good digital involves similar steps as does good curation of artifacts, samples, and paper records, however, the digital techniques are different. We summarize best practices in this emerging part of archaeology with real world examples.
SAA 2015 abstracts made available in tDAR courtesy of the Society for American Archaeology and Center for Digital Antiquity Collaborative Program to improve digital data in archaeology. If you are the author of this presentation you may upload your paper, poster, presentation, or associated data (up to 3 files/30MB) for free. Please visit http://www.tdar.org/SAA2015 for instructions and more information.
Facebook
TwitterArchaeologists generate large numbers of digital materials during the course of field, laboratory, and records investigations. Maps, photographs, data analysis, and reports are often produced digitally. Good curation of digital data means it can be discovered and accessed, and preserving these materials means they are accessible for future use. In many ways the managing, curating and preserving digital materials involves similar steps as those taken with physical artifacts, samples, and paper records. However, the digital materials are different and the process can appear daunting at first.
In this poster we outline some simple steps for managing and curating digital materials that can be integrated into existing or future project and that can be applied to digital materials from completed projects. We will also use real world examples from tDAR (the Digital Archaeological Record) to illustrate how people are preserving their digital materials for access and future use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset (Exploring Workforce Dataset A) contains quantitative data from the fall 2023 Exploring the Data Workforce project survey. Exploring Workforce Dataset A v.2 was updated and added on June 19, 2025 and is the most recent version to date. There is an explanatory README tab at the front of the Excel workbook. Data in the Excel Workbook has been cleaned and tabs are sorted in the order of the survey questions, followed by data analysis (mean and standard deviation) for each question.The Qualtrics values text file (Word document) contains the Qualtrics survey and the numerical response values that correspond with the Likert scales in the Excel spreadsheet tabs. You can use these values to read the participant responses.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
Structure and content of the dataset
|
ChEMBL ID |
PubChem ID |
IUPHAR ID | Target |
Activity type | Assay type | Unit | Mean C (0) | ... | Mean PC (0) | ... | Mean B (0) | ... | Mean I (0) | ... | Mean PD (0) | ... | Activity check annotation | Ligand names | Canonical SMILES C | ... | Structure check | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
Facebook
TwitterThis is a pdf copy of the PPT slides used for this presentation at the SAA symposium. The Digital Index of North American Archaeology (DINAA) has a massive compilation of archaeological site data. This paper presents recent findings from development of DINAA’s site database, efforts to link DINAA with mined references from digital literature, and efforts to prepare DINAA for future crowd-sourced professional data citations. The continental United States spans eight million square kilometers, with a multicultural past of over 15,000 years. Archaeologists have been practically and theoretically frustrated in search of curatorial practices, digital or otherwise, to make comprehensible the reporting and interpretation of such a vast spatiotemporal set. The federal organization of State and Tribal Historic Preservation Offices and similar entities under the National Historic Preservation Act guarantees local systems of information management will maintain records of archaeological sites within territorial jurisdictions. DINAA has successfully interoperated and made completely public the non-sensitive, scientific information from many of these systems. Linkage of these data with other datasets at large scales, crosscutting political borders, facilitates archaeological and interdisciplinary studies of human adaptation. In cultivating an open source community, DINAA hopes to add value to site and collections data (digital and otherwise), make these accessible to researchers and stakeholders, and highlight ethical approaches toward distributed data curation.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
DNA metabarcoding is promising for cost-effective biodiversity monitoring, but reliable diversity estimates are difficult to achieve and validate. Here we present and validate a method, called LULU, for removing erroneous molecular operational taxonomic units (OTUs) from community data derived by high-throughput sequencing of amplified marker genes. LULU identifies errors by combining sequence similarity and co-occurrence patterns. To validate the LULU method, we use a unique data set of high quality survey data of vascular plants paired with plant ITS2 metabarcoding data of DNA extracted from soil from 130 sites in Denmark spanning major environmental gradients. OTU tables are produced with several different OTU definition algorithms and subsequently curated with LULU, and validated against field survey data. LULU curation consistently improves α-diversity estimates and other biodiversity metrics, and does not require a sequence reference database; thus, it represents a promising method for reliable biodiversity estimation.
Facebook
TwitterThis is a pdf copy of the PPT slides used for this presentation in the SAA symposium. The first principle of the SAA’s Ethics states “The archaeological record …[including]... archaeological collections, records and reports, is irreplaceable. It is the responsibility of all archaeologists to work for the long-term conservation and protection of the archaeological record...” As a profession, we’ve been reasonably responsible as stewards of archaeological sites, but considerably less responsible when we think about digital records and reports. The long-term and ready availability of the complete records of any archaeological activity is essential for the credibility of archaeology. A recent article in Science reports that after redoing 100 major psychology experiments only 39% could be replicated. The ability of others to reproduce results is a central tenant of modern research. In archaeology, we commonly destroy our object of study –it is only through careful reassessment of the data from our work that we have any hope of a foundation that is not built on shifting sands. At the same time, the increasing use of high density survey and measurement in the field means that we can move from our tradition of recording information – once removed from data – to recording (and preserving) data, making preservation even more critical.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Title: YouTube Video Curation (Metadata and URLs)😇 Subtitle: Analyzing YouTube Content: From Video Descriptions to Viewer Engagement Metrics
Introduction
The YouTube Video Metadata Explorer dataset is a comprehensive collection of metadata related to YouTube videos, encompassing a wide range of information including video IDs, content details, statistical data, descriptions, and associated URLs. This rich dataset provides a unique opportunity to explore, analyze, and understand the digital media landscape on one of the world's largest video-sharing platforms.
Content
The dataset consists of 307,623 entries and six main attributes, detailed as follows:
ID: Unique identifier for each video. Snippet: Contains detailed information, including: Category ID: YouTube video category identifier Channel ID: Unique identifier for the channel hosting the video Channel Title: Name of the channel hosting the video Default Audio Language: The default audio language of the video Default Language: The default language of the video Live Broadcast Content: Indicator for Live Broadcast Content Localized: Information related to localization Title: Title of the video Published At: Publication date and time Tags: Associated tags for the video Thumbnails: Different resolution thumbnails, including: Default: 90x120 pixels. High: 360x480 pixels. Maxres: 720x1280 pixels Medium: 180x320 pixels. Standard: 480x640 pixels. Content Details: Includes information about the video's technical specifications and features: Caption: Indicates whether captions are available (true or false). Content Rating: YouTube content rating (e.g., 'ytRating': None). Definition: Video definition quality (e.g., 'hd' for high definition). - -Dimension: Video dimension (e.g., '2d' for 2-dimensional). - -Duration: Duration of the video (e.g., 'PT16M34S' for 16 minutes and 34 seconds). Licensed Content: Indicates whether the content is licensed (true or false). Projection: Type of video projection (e.g., 'rectangular'). Region Restriction: Any region restrictions applied to the video Statistics: Features video engagement metrics: Comment Count: Number of comments on the video - -Favorite Count: Number of times the video has been marked as a favorite (e.g., '0'). Like Count: Number of likes on the video (e.g., '29942'). View Count: Number of views for the video (e.g., '704710'). Description: A brief description or summary of the video content - -URLs: Links associated with the videos description
Facebook
Twitterhttps://huggingface.co/datasets/JeanKaddour/minipile
The MiniPile Challenge for Data-Efficient Language Models
MiniPile is a 6GB subset of the deduplicated The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.
The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.
More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper.
For more details on the Pile corpus, we refer the reader to the Pile datasheet.
English (EN)
MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.
Since MiniPile is a subset of the Pile, the same MIT License holds.
@article{kaddour2023minipile,
title={The MiniPile Challenge for Data-Efficient Language Models},
author={Kaddour, Jean},
journal={arXiv preprint arXiv:2304.08442},
year={2023}
}
@article{gao2020pile,
title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
Facebook
TwitterContains global weather station locations with data for monthly means from 1981 through 2010 for: Daily Mean Temperature °C Daily Maximum Temperature °C Daily Minimum Temperature °C Precipitation in mm Highest Daily Temperature °C Lowest Daily Temperature °C Additional monthly fields containing the equivalent values in °F and inches are available at the far right of the attribute table. GHCND stations were included if there were at least fifteen average daily values available in each month for all twelve months of the year, and for at least ten years between 1981 and 2010. 3,197 of the 7,480 stations did not collect or lacked sufficient precipitation data. These data are compiled from archived station values which have not undergone rigorous curation, and thus, there may be unexpected values, particularly in the daily extreme high and low fields. Esri is working to further curate this layer and will make updates as improvements are found. If your area of study is within the United States, we recommend using the U.S. Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer because the data in that service were compiled from web services produced by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc., thus the data in the U.S. service is of higher quality. Revision History: Initially Published: 6 Feb 2019 Updated: 12 Feb 2019 - Improved initial extraction algorithm to remove stations with extreme values. This included values higher than the highest temperature ever recorded on Earth, or those with mean values that were considerably different than adjacent neighboring stations.Updated: 18 Feb 2019 - Updated after finding an error in initial processing that excluded a 2,870 stations. Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 635 of 7,452 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as:
Esri, 2019: World Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed April 2019. https://www.arcgis.com/home/item.html?id=ed59d3b4a8c44100914458dd722f054f Source Data: Station locations compiled from: Initially compiled using station locations from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3.24 Amended to use the most recent station locations from Russell S. Vose, Shelley McNeill, Kristy Thomas, Ethan Shepherd (2011): Enhanced Master Station History Report of March 2019. NOAA National Climatic Data Center. Access Date: April 10, 2019 doi:10.7289/V5NV9G8D. Station Monthly Means compiled from Daily Data: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd_all.tar.gz Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3.24
Facebook
TwitterThe synthesis of novel, complex drug molecules to establish structure-activity relationships (SAR) is often the limiting step in early drug discovery. To expedite SAR exploration and enhance the pharmacological profiles of lead structures within the design-make-test-analyze (DMTA) cycle, it is crucial to refine synthetic methodologies. Late-stage functionalization (LSF) offers an effective, step-saving approach for modifying advanced leads by directly substituting C–H bonds with other moieties, thereby facilitating chemical space exploration and modulating adsorption, distribution, metabolism and excretion (ADME) properties. However, the similarity of C–H bonds within structurally intricate drug and drug-like molecules necessitates a detailed understanding of their reactivity for targeted functionalization, which complicates the standardization of experimental protocols. This complexity often results in resource-intensive wet lab explorations, which may conflict with the stringent timelines and budgets of drug discovery projects. High-throughput experimentation (HTE) has emerged as a key technology to streamline synthesis by efficiently evaluating reaction conditions in a plate format using automation equipment. Tackling certain remaining bottlenecks of HTE, specifically in the field of software/hardware integration and data governance, the technology has the potential to efficiently assess LSF reaction methodologies with the lowest possible material consumption. The LSF reaction data sets from HTE campaigns combined with big data analytics and machine learning (ML) are expected to enable the development of predictive models for C–H bond transformations. This would allow the estimation of reaction outcomes before carrying out resource and time-intensive experimentation in the laboratory facilitating the synthesis of target molecules in an environmentally conscious and material-efficient manner. Despite the potential of making LSF a more efficient methodology to enable fast drug diversification and, consequently, speed up the development of novel medicines, a seamless connection between all three research fields, namely, LSF, HTE and reactivity prediction has not been made so far. This thesis presents the development of a digital, semi-automated HTE system designed to systematically evaluate LSF methodologies on drug-like molecules. Dolphin, the Data orchestrated laboratory platform harnessing innovative neural network, is an end-to-end platform tailored for LSF that incorporates automation, digitalization, and ML to enhance compound synthesis efficiency in early drug discovery. Advanced automated laboratory equipment, such as solid and liquid dosing robots, is employed to simultaneously initiate reactions and prepare controls, ensuring sample quality for subsequent analyses. A high level of software/hardware integration supports the workflow from literature analysis and reaction plate screening to scale-up planning and data management. To allow the extraction, curation, storage and analysis of reaction data from the literature, in parallel with the development of Dolphin, efforts have been directed towards the development of a simple, user-friendly reaction format (SURF). After evaluating current data-sharing practices and identifying bottlenecks, SURF was designed to be both human- and machine-readable, streamlining the use of reaction data in ML applications. Application of this format to curate data from selected publications enabled systematic HTE plate design and provided high-quality data sets for ML model development. Applying Dolphin and SURF in two case studies with different LSF reaction types enabled reactivity prediction. The first case study was centered around assessing the applicability of C–H borylation reactions for the late-stage diversification of complex molecules. Hundreds of HTE reactions were performed on systematically chosen commercial drugs under a wide array of conditions. The data generated from these experiments were captured in SURF and used to support the development of an ML algorithm capable of predicting binary reaction outcomes, yields, and regioselectivity for novel substrates. The influence of steric and electronic effects on model performance was quantified by featurization of the input molecular graphs with 2D, 3D and quantum mechanics (QM) augmented information. The reactivity of novel reactions with known and unknown substrates was classified with a balanced accuracy of 92% and 67%, respectively, while computational models predicted reaction yields for diverse reaction conditions with a mean absolute error (MAE) margin of 4–5%. The platform delivered numerous starting points for the structural diversification of commercial pharmaceuticals and advanced drug-like fragments. The second case study investigated a library-type screening approach for determining the substrate scope of late-stage Minisci-type C–H alkylations to explore new exit vectors. This approach aimed to facilitate the in silico prediction of suitable substrates that can undergo coupling with a diverse array of sp3-rich carboxylic acids. Again, Dolphin and SURF provided the experimental data sets to train ML models for the described task. The algorithms predicted reaction yields with an MAE of 11–12% and suggested starting points for scale-up reactions of 3180 advanced heterocyclic building blocks with various carboxylic acid building blocks. From those, a set of promising candidates was chosen, reactions were scaled up to the 50 to 100 mg range and products were isolated and characterized. This process led to the creation of 30 novel, functionally modified molecules that hold potential for further optimization. The results from both case studies positively advocate the application of ML based on high-quality HTE data for reactivity prediction in the LSF space and beyond. \medskip In summary, this thesis established a semi-automated platform (Dolphin) and a new reaction format (SURF), facilitating the development of ML models for LSF reaction screening, thereby contributing to enhancing the compound synthesis efficiency in drug discovery through the strategic application of laboratory automation and artificial intelligence.
Facebook
Twitterhttps://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Contains: Wholebrain Searchlight mean (presence vs. absence of motion) from Phase 1; Mean EEG decoding accuracy for each TMS session for each subject from Experiments 2 and 3; Mean recognition memory performance for Experiment 4
Facebook
TwitterBACKGROUND In animation text animation it relates to creating text that moves in some fashion across the screen, within an area, or by following a pattern of motion. A less frequently used meaning is in the context of an animation that is created using only text characters so each of the elements within the animation are made of letters, numbers, punctuation marks or other symbols, as in this work, made almost entirely of typography. Alessio Cavallaro and Emma McRae from ACMI were invited to curate a package of moving image works by contemporary Australian. Prominently featured in the physical spaces of the Australian Pavilion, the exhibit included works by well-known Australian practitioners, Daniel Von Sturmer and the Lycette Bros, which were cycled with live performances. CONTRIBUTION A black and white digital animation exhibited primarily online Not My Type IV: the Modern Courting Rituals of Potential Office Suitors was shown at festivals and galleries in other formats as the work evolved into a series. Its narrative themes explore human relationships through the technology that now defines them and continues the Lycettes' research in the field of animation. An office and its occupants, presented as typographic characters, create a theatre of emotion. The text-based flash animation plays on texting as a form of communication that can be used to convey the depth and contradictions of human emotional life. The work establishes and displays new methodologies in digital animation techniques utilising the intersection of positive and negative space in creating a sense of dimensional depth to explore minimalism and the notion of 'what is not said' and 'what is not shown'. SIGNIFICANCE This research originally established a significant aesthetic direction for graphic animation practice by utilising typographic elements and has been widely awarded and exhibited. Over 73 million visited the Expo, the Australian pavilion among the most popular.
Facebook
Twitterhttps://www.imperial.ac.uk/medicine/research-and-impact/groups/icare/icare-facility/information-for-researchers/https://www.imperial.ac.uk/medicine/research-and-impact/groups/icare/icare-facility/information-for-researchers/
The iCARE SDE is a cloud-based, big data analytics platform sitting within Imperial College Healthcare NHS Trust (ICHT) NHS infrastructure. This, combined with the iCARE Team’s robust method of data de-identification, make the Environment an incredibly secure platform. The fact that it can be accessed remotely using the Trust’s Virtual Desktop Infrastructure means that researchers can perform their work remotely and are therefore not constrained by location. (imperial.dcs@nhs.net)
The iCARE SDE enables clinicians, researchers and data scientists to access large-scale, highly curated databases for the purposes of research, clinical audit and service evaluation. The iCARE SDE enables advanced data analytics through a scalable virtual infrastructure supporting Azure Machine Learning, Python, R and STATA and a large variety of snowflake SQL tooling.
The main iCARE data model is a HRA REC approved database covering all routinely captured information from Imperial College Healthcare Trust (ICHT) Electronic Health Record and 39 linked (at the patient-level) clinical and non-clinical systems. It contains data for all patients from 2015 onwards and is updated weekly as a minimum, and close to real-time when required. It includes inpatient, outpatient, A&E, pathology, cancer, imaging treatments, e-prescribing, procedures, clinical notes, Consent, clinical trials, tissue bank samples, Patient safety and incidents, Patient experience, Staffing and environment data.
Data can also be linked to primary care data for the 2.8million population in Northwest London, HRA REC approved, Whole Systems Integrated Care (WSIC) hosted database and other health and social care providers when approved.
On a project-by-project basis the model can be expanded to curate and include new data (including multi-modality data), that is either captured routinely or through approved research and clinical trials. There are streamlined processes to approve and curate new data (imperial.dataaccessrequest@nhs.net) and data will always remain hosted in the SDE.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current and future consequences of anthropogenic impacts such as climate change and habitat loss on ecosystems will be better understood and therefore addressed if diverse ecological data from multiple environmental contexts are more effectively shared. Re-use requires that data are readily available to the scientific scrutiny of the research community. A number of repositories to store shared data have emerged in different ecological domains and developments are underway to define common data and metadata standards. Nevertheless, the goal is far from being achieved and many challenges still need to be addressed. The definition of best practices for data sharing and re-use can benefit from the experience accumulated by pilot collaborative projects. The Euromammals bottom-up initiative has pioneered collaborative science in spatial animal ecology since 2007. It involves more than 150 institutes to address scientific, management and conservation questions regarding terrestrial mammal species in Europe using data stored in a shared database. In this manuscript we present some key lessons that we have learnt from the process of making shared data and knowledge accessible to researchers and we stress the importance of data management for data quality assurance. We suggest putting in place a pro-active data review before data are made available in shared repositories via robust technical support and users’ training in data management and standards. We recommend pursuing the definition of common data collection protocols, data and metadata standards, and shared vocabularies with direct involvement of the community to boost their implementation. We stress the importance of knowledge sharing, in addition to data sharing. We show the crucial relevance of collaborative networking with pro-active involvement of data providers in all stages of the scientific process. Our main message is that for data-sharing collaborative efforts to obtain substantial and durable scientific returns, the goals should not only consist in the creation of e-infrastructures and software tools but primarily in the establishment of a network and community trust. This requires moderate investment, but over long-term horizons.
Facebook
Twitter
As per our latest research, the global real-time data widgets for signage market size reached USD 2.47 billion in 2024, driven by the rapid adoption of dynamic digital signage solutions across various industries. The market is experiencing robust expansion, registering a CAGR of 13.2% from 2025 to 2033. By the end of the forecast period in 2033, the market is projected to attain a value of USD 7.18 billion. Key growth factors include the increasing demand for interactive and personalized customer experiences, the proliferation of smart cities, and the integration of IoT and AI technologies in digital signage systems.
The growth of the real-time data widgets for signage market is significantly influenced by the surging demand for dynamic content in retail and transportation sectors. Retailers are increasingly leveraging real-time data widgets to deliver targeted promotions, live updates, and interactive content to enhance the in-store customer journey. Similarly, transportation hubs such as airports, train stations, and bus terminals are adopting these solutions to provide live updates on schedules, weather, and emergency alerts. This capability to deliver up-to-the-minute information not only improves operational efficiency but also elevates the overall user experience, making real-time data widgets a critical component in modern digital signage ecosystems.
Another major growth driver is the advancement and integration of cloud-based technologies and IoT devices. Cloud-based deployment enables centralized management, scalability, and seamless content updates, which are essential for enterprises operating across multiple locations. IoT integration allows for the collection and display of real-time data from various sensors and external sources, further enhancing the relevance and immediacy of displayed content. These technological advancements are fostering innovation in the signage industry, enabling businesses to deliver highly contextual and engaging content that adapts to changing conditions in real-time.
The ongoing digital transformation across corporate, hospitality, education, and healthcare sectors is also fueling market growth. Organizations are increasingly embracing digital signage with real-time data widgets to improve internal communications, streamline information dissemination, and engage stakeholders effectively. In corporate settings, these widgets are used for live dashboards, performance metrics, and emergency notifications, while in healthcare, they provide patient updates, wayfinding, and health tips. The ability to customize and automate content based on live data feeds is proving invaluable across these sectors, driving widespread adoption and market expansion.
From a regional perspective, North America currently dominates the real-time data widgets for signage market, accounting for the largest revenue share due to early technology adoption and strong presence of key market players. Europe follows closely, fueled by smart city initiatives and increasing investments in digital infrastructure. The Asia Pacific region is expected to witness the fastest growth, propelled by rapid urbanization, expanding retail and transportation networks, and a growing focus on enhancing customer engagement through digital means. Latin America and the Middle East & Africa are also showing promising growth trajectories, driven by rising digitalization efforts and increasing awareness of the benefits of real-time data-driven signage solutions.
The component segment of the real-time data widgets for signage market is broadly categorized into software, hardware, and services. Software solutions form the backbone of this market, enabling the seamless integration of real-time data feeds with digital signage displays. These software platforms are equipped with advanced features such as data aggregation, content management, analytics, and automation, which allow businesses to curate and dis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Bradley Melting Point Dataset
The Jean-Claude Bradley Open Melting Point Dataset is one of the largest openly available collections of experimentally measured melting points for organic compounds.
This dataset is Extra dataset from this competition
📊 Size: Over 20,000 data points of experimentally reported melting points.
🧪 Content: Contains organic molecules with their associated melting point values, chemical identifiers, and references.
🌍 Use Cases:
Thermophysical property prediction
QSAR/QSPR modeling
Machine learning and cheminformatics research
Benchmarking predictive models for melting point estimation
This dataset was released by Jean-Claude Bradley and collaborators under an open license to support reproducible and open cheminformatics research.
Bradley Double Plus Good Melting Point Dataset
The Jean-Claude Bradley Double Plus Good Highly Curated and Validated Melting Point Dataset is a refined and high-quality subset of the original open dataset.
🧹 Curation: Carefully cleaned, validated, and standardized to reduce noise and improve data reliability.
📊 Size: Approximately 4,000 compounds with highly reliable melting point measurements.
🎯 Key Value: Offers a trustworthy benchmark for building and evaluating predictive models with minimal experimental noise.
This curated dataset is especially recommended for machine learning applications where data quality is critical.
🔖 License
Both datasets are released under the Creative Commons Zero (CC0) license (public domain dedication), meaning they are free to use, share, and remix without restriction.
📌 Sources
Jean-Claude Bradley Open Melting Point Dataset
Jean-Claude Bradley Double Plus Good (Highly Curated and Validated) Melting Point Dataset
Facebook
TwitterObjective(s): The 2024 Pediatric Sepsis Data Challenge provides an opportunity to address the lack of appropriate mortality prediction models for LMICs. For this challenge, we are asking participants to develop a working, open-source algorithm to predict in-hospital mortality and length of stay using only the provided synthetic dataset. The original data used to generate the real-world data (RWD) informed synthetic training set available to participants was obtained from a prospective, multisite, observational cohort study of children with suspected sepsis aged 6 months to 60 months at the time of admission to hospitals in Uganda. For this challenge, we have created a RWD-informed synthetically generated training data set to reduce the risk of re-identification in this highly vulnerable population. The synthetic training set was generated from a random subset of the original data (full dataset A) of 2686 records (70% of the total dataset - training dataset B). All challenge solutions will be evaluated against the remaining 1235 records (30% of the total dataset - test dataset C). Data Description: Report describing the comparison of univariate and bivariate distributions between the Synthetic Dataset and Test Dataset C. Additionally, a report showing the maximum mean discrepancy (MMD) and Kullback–Leibler (KL) divergence statistics. Data dictionary for the synthetic training dataset containing 148 variables. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator at sepsiscolab@bcchr.ca or visit our website.
Facebook
TwitterPer- and polyfluoroalkyl substances (PFAS) are a class of man-made chemicals of global concern for many health and regulatory agencies due to their widespread use and persistence in the environment (in soil, air, and water), bioaccumulation, and toxicity. This concern has catalyzed a need to aggregate data to support research efforts that can, in turn, inform regulatory and statutory actions. An ongoing challenge regarding PFAS has been the shifting definition of what qualifies a substance to be a member of the PFAS class. There is no single definition for a PFAS, but various attempts have been made to utilize substructural definitions that either encompass broad working scopes or satisfy narrower regulatory guidelines. Depending on the size and specificity of PFAS substructural filters applied to the U.S. Environmental Protection Agency (EPA) DSSTox database, currently exceeding 900,000 unique substances, PFAS substructure-defined space can span hundreds to tens of thousands of compounds. This manuscript reports on the curation of PFAS chemicals and assembly of lists that have been made publicly available to the community via the EPA’s CompTox Chemicals Dashboard. Creation of these PFAS lists required the harvesting of data from EPA and online databases, peer-reviewed publications, and regulatory documents. These data have been extracted and manually curated, annotated with structures, and made available to the community in the form of lists defined by structure filters, as well as lists comprising non-structurable PFAS, such as polymers and complex mixtures. These lists, along with their associated linkages to predicted and measured data, are fueling PFAS research efforts within the EPA and are serving as a valuable resource to the international scientific community.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The historical settlement data compilation for Spain (HISDAC-ES) is a geospatial dataset consisting of over 240 gridded surfaces measuring the physical, functional, age-related, and evolutionary characteristics of the Spanish building stock. We scraped, harmonized, and aggregated cadastral building footprint data for Spain, covering over 12,000,000 building footprints including construction year attributes, to create a multi-faceted series of gridded surfaces (GeoTIFF format), describing the evolution of human settlements in Spain from 1900 to 2020, at 100m spatial and 5 years temporal resolution. Also, the dataset contains aggregated characteristics and completeness statistics at the municipality level, in CSV and GeoPackage format.!!! UPDATE 08-2023 !!!: We provide a new, improved version of HISDAC-ES. Specifically, we fixed two bugs in the production code that caused an incorrect rasterization of the multitemporal BUFA layers and of the PHYS layers (BUFA, BIA, DWEL, BUNITS sum and mean). Moreover, we added decadal raster datasets measuring residential building footprint and building indoor area (1900-2020), and provide a country-wide, harmonized building footprint centroid dataset in GeoPackage vector data format.File descriptions:Datasets are available in three spatial reference systems:HISDAC-ES_All_LAEA.zip: Raster data in Lambert Azimuthal Equal Area (LAEA) covering all Spanish territory.HISDAC-ES_IbericPeninsula_UTM30.zip: Raster data in UTM Zone 30N covering all the Iberic Peninsula + Céuta and Melilla.HISDAC-ES_CanaryIslands_REGCAN.zip: Raster data in REGCAN-95, covering the Canary Islands only.HISDAC-ES_MunicipAggregates.zip: Municipality-level aggregates and completeness statistics (CSV, GeoPackage), in LAEA projection.ES_building_centroids_merged_spatjoin.gpkg: 7,000,000+ building footprint centroids in GeoPackage format, harmonized from the different cadastral systems, representing the input data for HISDAC-ES. These data can be used for sanity checks or for the creation of further, user-defined gridded surfaces.Source data:HISDAC-ES is derived from cadastral building footprint data, available from different authorities in Spain:Araba province: https://geo.araba.eus/WFS_Katastroa?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetCapabilitiesBizkaia province: https://web.bizkaia.eus/es/inspirebizkaiaGipuzkoa province: https://b5m.gipuzkoa.eus/web5000/es/utilidades/inspire/edificios/Navarra region: https://inspire.navarra.es/services/BU/wfsOther regions: http://www.catastro.minhap.es/INSPIRE/buildings/ES.SDGC.bu.atom.xmlData source of municipality polygons: Centro Nacional de Información Geográfica (https://centrodedescargas.cnig.es/CentroDescargas/index.jsp)Technical notes:Gridded dataFile nomenclature:./region_projection_theme/hisdac_es_theme_variable_version_resolution[m][_year].tifRegions:all: complete territory of Spaincan: Canarian Islands onlyibe: Iberic peninsula + Céuta + MelillaProjections:laea: Lambert azimuthal equal area (EPSG:3035)regcan: REGCAN95 / UTM zone 28N (EPSG:4083)utm: ETRS89 / UTM zone 30N (EPSG:25830)Themes:evolution / evol: multi-temporal physical measurementslanduse: multi-temporal building counts per land use (i.e., building function) classphysical / phys: physical building characteristics in 2020temporal / temp: temporal characteristics (construction year statistics)Variables: evolutionbudens: building density (count per grid cell area)bufa: building footprint areadeva: developed area (any grid cell containing at least one building)resbufa: residential building footprint arearesbia: residential building indoor areaVariables: physicalbia: building indoor areabufa: building footprint areabunits: number of building unitsdwel: number of dwellingsVariables: temporalmincoy: minimum construction year per grid cellmaxcoy: minimum construction year per grid cellmeancoy: mean construction year per grid cellmedcoy: median construction year per grid cellmodecoy: mode (most frequent) construction year per grid cellvarcoy: variety of construction years per grid cellVariable: landuseCounts of buildings per grid cell and land use type.Municipality-level datahisdac_es_municipality_stats_multitemporal_longform_v1.csv: This CSV file contains the zonal sums of the gridded surfaces (e.g., number of buildings per year and municipality) in long form. Note that a value of 0 for the year attribute denotes the statistics for records without construction year information.hisdac_es_municipality_stats_multitemporal_wideform_v1.csv: This CSV file contains the zonal sums of the gridded surfaces (e.g., number of buildings per year and municipality) in wide form. Note that a value of 0 for the year suffix denotes the statistics for records without construction year information.hisdac_es_municipality_stats_completeness_v1.csv: This CSV file contains the missingness rates (in %) of the building attribute per municipality, ranging from 0.0 (attribute exists for all buildings) to 100.0 (attribute exists for none of the buildings) in a given municipality.Column names for the completeness statistics tables:NATCODE: National municipality identifier*num_total: number of buildings per municperc_bymiss: Percentage of buildings with missing built year (construction year)perc_lumiss: Percentage of buildings with missing landuse attributeperc_luother: Percentage of buildings with landuse type "other"perc_num_floors_miss: Percentage of buildings without valid number of floors attributeperc_num_dwel_miss: Percentage of buildings without valid number of dwellings attributeperc_num_bunits_miss: Percentage of buildings without valid number of building units attributeperc_offi_area_miss: Percentage of buildings without valid official area (building indoor area, BIA) attributeperc_num_dwel_and_num_bunits_miss: Percentage of buildings missing both number of dwellings and number of building units attributeThe same statistics are available as geopackage file including municipality polygons in Lambert azimuthal equal area (EPSG:3035).*From the NATCODE, other regional identifiers can be derived as follows:NATCODE: 34 01 04 04001Country: 34Comunidad autónoma (CA_CODE): 01Province (PROV_CODE): 04LAU code: 04001 (province + municipality code)
Facebook
TwitterArchaeology is awash in digital data. Archaeologists generate large numbers of digital files in their field, laboratory, and records investigations. We use digital mapping, digital photography, digital means of data analysis, and our reports are drafted and produced digitally. Good curation of digital data provides easy means by which it can be discovered and accessed, as well as ensuring that it is preserved for future uses. In many ways the planning for and carrying out good digital involves similar steps as does good curation of artifacts, samples, and paper records, however, the digital techniques are different. We summarize best practices in this emerging part of archaeology with real world examples.
SAA 2015 abstracts made available in tDAR courtesy of the Society for American Archaeology and Center for Digital Antiquity Collaborative Program to improve digital data in archaeology. If you are the author of this presentation you may upload your paper, poster, presentation, or associated data (up to 3 files/30MB) for free. Please visit http://www.tdar.org/SAA2015 for instructions and more information.