I am kicking off a new series of posts that will capture research in information/resource discovery that I am pursuing as part of my role as Emerging Technologies Librarian. I recently finished reading the University of Oxford Resource Discovery report that came out in late 2015. Following are the highlights/notes from my reading. Any conversations/comments about these findings would be welcome.
Highlight, page 4
Good resource discovery tools, though, are not simply about making research easier and faster, but about facilitating the creation, preservation and discovery of knowledge by enabling new modes of research—especially across disciplines.
Highlight, page 5
Resource discovery is defined as any activity which makes it possible for an individual to locate information which he or she needs. Such material and such activities may be digital or analogue in nature
Highlight, page 11
Firstly, resource discovery is very discipline-specific. While quite a few people do start their search at Google, many start at the library catalogue. Within certain disciplines, though, searchers will jump straight to the top resources in their field (arXiv for Physics, PubMed for Medicine, WestLaw or similarly specialized tools for Law). These findings are consistent with the well-documented understanding of the differences in ‘known-item’ versus subject searching, and emphasize that while both happen in all disciplines, the sciences are often dominated by the former. One notable exception to this is evidence-based medicine, where researchers are often engaged in very thorough subject-based searching.
Students need to learn how to search.
Discovery is not as simple as ‘novice’ vs. ‘expert’. Experts in their fields may use some of the same discovery tools and techniques as incoming students in certain circumstances. A professor in one discipline may, for example, use Wikipedia or basic Google searches to familiarize themselves with a new topic just as a new student might.
Asking people, and knowing who to ask, seems to make the difference between simply finding what you need to complete an assignment and becoming an expert researcher.
Highlight, page 12
None of the respondents said that they used open/public social media platforms for asking resource discovery questions.
Expertise in a domain requires two things: an understanding of the parameters of your domain and an understanding of the available and relevant resources in those areas.
Highlight, page 13
Interactive visualization and visual analytics should have significant roles in the next generation of resource discovery technology. It is important to understand what visualization is really for. The key is to save the user’s time and reduce their cognitive load.
Highlight, page 15
Expertise requires two things: an understanding of the parameters of your domain and an understanding of the available and relevant resources in those areas. ‘Experts’ have varying levels of confidence about their mastery of these domains, but all seem to have a clear sense of its ‘borders’. Therefore, resource discovery should at least in part be about helping people to identify and define these borders.
Highlight, pages 16-17
Using collection-level metadata provide an interactive diagram that represents the range of collections. For example, works on paper vs. objects; printed vs. manuscripts; visual vs. textual; digitized vs. not digitized; catalogued vs. not catalogued. Dates could be contextualized within ranges of centuries.
Highlight, page 17
Provide an immediate visual guide to how many collections there are and their relative sizes, which ones are searchable electronically, which are catalogued in print indices and which are not yet catalogued.
Overlaps in collection provenance, topic or format could provide starting points for navigation.
Highlight, page 18
Cross collection search would ideally allow discovery across libraries and museums, while still allowing users to narrow their search to one or more collections (and allowing the individual collections to brand their own home pages)
Index and expose existing item-level metadata.
Provide a clear sense of what is being missed when searching.
The object would be to visualize not only what is being found, but what is not being found due to the lack of item-level metadata.
Provide an immediate visual representation of what is available and in what format.
Highlight, page 19
Searchers at all levels rely on people— librarians, colleagues, supervisors, mentors or experts in their field —to find resources when other search methods have failed.
Provide a graph of the professional networks
Highlight, page 21
Providing a reliable source for upcoming talks by division or subject area would be heavily used and well-received.
Asking people (colleagues, librarians, curators) for help in locating resources was universal among the users interviewed.
For many (across the disciplines), the process of research is as important as the outcome.
Highlight, page 22
Exposing metadata for indexing by Google and Google Scholar would undoubtedly assist those who start their searches on the open web. Working with subject-specific repositories like arXiv and PubMed, and publishers like JSTOR would further assist in connecting users with specific resources.
Citation chaining is ubiquitous in all areas of research across all disciplines. CrossRef and other individual tools and databases have gone some of the way towards making this easier, but citation chaining is still not heavily present. Searchers in all disciplines use cited references as authoritative points of departure for finding more resources on a topic.
Facilitate precise ‘known-item’ searching.
Highlight, page 23
Investment in the analytics and data infrastructure to support evidence based decision making across collections.
UK National Archives Discovery page provides a good example of a portal for discovery services across diverse collections. It combines a cross-collection search with prominent featured ‘popular collections’ as well as research guides.
Highlight, page 24
Collections cannot be discovered using electronic search tools unless they have some sort of representative electronic description. High quality description is key to discovery.
Highlight, page 32
A vast majority of respondents (even those known experts in their field) did not have high-confidence that they were “on top of” everything that was happening in their domains.
Of the incoming students (both graduate and undergraduate) very few tried to monitor new publications and were mostly responding to suggestions from supervisors and instructors.
Most senior academics had developed mechanisms for coping with ‘keeping up’. These usually involved a combination of social media, informal communications (email from colleagues), conferences, Zetoc and table of contents alerts from specific journals, and/or more formal roles such as serving as editor or reviewer for relevant journals.
Those that do use social media, use it as a way to monitor interest groups, people, conferences, blogs in their field, or as a mechanism to promote their own projects or work.
Highlight, page 33
The more people expand the boundaries of what and where they are searching (with tacit and explicit assistance from mentors in their domains, and often through their own trial and error), the more expert they become in their field because they learn the boundaries as well as the tips and tricks for finding the most credible sources and the less well-known parts of the collections.
Highlight, page 40
The differences between [web-scales services] are not that significant… thinking that …there are some ‘good’ and some ‘bad’… is probably wrong. It’s not really about the product, it’s about the willingness of the vendor to overcome problems, and about their attitude to their customers.
In the categories of relevance ranking and successful full-text linking, the differences among EDS, Summon, and Metalib+ were statistically insignificant and no one system greatly outperformed the others. Google Scholar fell behind in both categories. In the category of up-to-date results, the disparities were slightly greater though not significant enough to warrant a recommendation, with EDS scoring best, followed by Metalib+, then Summon, with Google Scholar falling greatly behind the others in this category, with the exception of searches in the Sciences, where it scored higher than the other systems for up-to-date results. Known-item searches are the only area where Google Scholar significantly outperformed the other systems across disciplines.
Highlight, page 44
All the participants use simple search as default. Users are now used to the Google model where ‘people do a search first and then filter afterwards’.
Users tend to use the default tab the most, regardless of what it is.
There are two ways in which search results are presented: a single list and a Bento box. Blacklight and VuFind customers use the Bento box approach, as it is part of the platform design. According to one university, the Bento box is popular with institutions that preferred a home-grown system. Another university mentioned that external research showed a 50/50 split for user preferences for a single list vs. the Bento box approach. When they were investigating the options, there was no clear winner. The choice of approach seems to largely depend on the underlying solution.
There is general agreement amongst the participants that a significant proportion of resource discovery starts outside the library, mainly in Google or Google Scholar. This is the primary reason why Utrecht University decided to concentrate its efforts on providing discovery support for users irrespective of where discovery starts, rather than investing significant resources and effort in acquiring and maintaining a resource discovery platform.
Highlight, page 45
Since the first Faculty Survey in 2000, we have seen faculty members steadily shifting towards reliance on network-level electronic resources, and a corresponding decline in interest in using locally provided tools for discovery.
Thinking the unthinkable: A library without a catalogue – reconsidering the future of discovery tools for the Utrecht University Library
Highlight, page 48
Optimization of records for search engines and linked data are seen as important but are not fully explored at present.
It is important to collect user behaviour statistics/ data in new ways, like watching Pinterest traffic of access to the organization’s collections.
Highlight, page 53
People use the Principle of Least Effort, preferring easy-to-get information over harder-to-get information, no matter how high the quality of the latter, as a rule.
They want instant results and instant gratification because a fundamental tenet is that convenience trumps equality
“Berrypicking” describes an evolving strategy, refining how discovery is carried out as initial information discovered changes the conceptual model the user has of what they are looking for: “gathering information a piece at a time while the information need and search criteria continue to evolve.”
Users of discovery services tend to blithely repeat methods which gave acceptable results in the past. This behaviour is more as predicted by Gratification Theory, where past success in finding relevant material means that the same method is re-used in the future.
The Information Search Process described by Kulhthau divides the discovery process into task initiation, topic selection, prefocus exploration, focus formulation, information collection, search closure, and starting writing. Perceptions of anxiety decrease through this process, accompanying a progression from ambiguity to specificity.
Highlight, page 54
Students carry out resource discovery principally to satisfy immediate academic requirements (essays, examinations, etc.), and is associated with a “certain amount of anxiety”. As a result, convenience and familiarity outweigh suitability as criteria for services and methods used for discovery, and this is accompanied by “hesitancy” about asking for assistance from tutors or librarians
Six typical behaviours seen in university faculty members-
starting: reading reviews and review articles, initial exploratory searchers, etc. – actions to be undertaken before the main discovery exercise
chaining: tracking citations forwards and backwards from a known item
browsing: semi-directed searching, e.g. using author names or looking along a shelf of physical items
differentiating: using differences between items to determine relevance
monitoring: current awareness of activity in research field
extracting: systematic analysis of a specific source (e.g. publisher’s web pages) to identify material of interest.
Highlight, page 55
Recent research shows “less overall difference between the physical sciences and the humanities” than expected. A series of case studies across the disciplines found that “from a broad sociological view, it is striking how much consistency there is across the fields and disciplines”
Highlight, page 62
They [commercial systems] primarily use a Lucene index at the backend. Lucene provides search index (also known as “inverted index”) creation, storage, and management facilities with document ranking algorithms (e.g., Boolean, TF-IDF, Cosine, Fuzzy, and so on). These modern tools also use probabilistic models to map documents to terms and then rank results. Probabilities are generated using methods such as TF-IDF and other language models. Although the exact methods of commercial systems are unpublished, it is likely they use techniques similar to Lucene.
Highlight, page 63
In a library, many of the files are non-textual (e.g., media, archive, images, scanned copies of invoices, books, catalogs, etc.). Therefore, improved metadata-based retrieval is essential. Similar to Web search, an Enterprise Search Engine can use relevance feedback information that can be used to learn and improve the ranking of the search results in the enterprise search as no search can guarantee to find all the relevant file or how to correctly specify the search criteria. Most importantly, it would be useful to assist search with effective visualization and interaction.
Highlight, page 64
Resource discoveries in libraries are powered at the back-end either using a federated search or pre-harvested search. In both federated search and pre-harvested search the users are allowed to search via a single access point. In federated search, the search word is looked for across multiple databases while in pre-harvested search the search word is looked for through the pre-harvested indexes of content with weighting applied to help with the relevancy ranking. By implementing the single-search-screen, users are able to retrieve results from multiple databases. For both of these searches, the results are then returned back often as a rank list in a paginated format. Although it is easier for users to retrieve search results using the single search box, users found that they have difficulties in comprehending the list of results that was returned to them.
For example, Tag clouds and inkblots that enable us to view the categorical information and statistical information about the documents, revealing various signature patterns for comparative analysis. Similarly, visualizing changes of themes and topics over time within a collection of documents, or visualizing of multi-faceted relationships of keywords either within documents or across a large collection of documents. A visual search engine for exploring Wikipedia through its semantic relationship, or a visual analytics tool for exploring academic publications through citations, ranking, and techniques of summarization and automatic clustering.
Highlight, page 65
Among all of the scholarly work that has been published by the visualization and information-retrieval communities on navigation and exploration of Web information space, the VisGets, Fluid Views, and PivotPaths are the most significant.
Highlight, page 66
Another example is using a Voronoi treemap to organize search results.
Highlight, page 70
Glyphs are visual entities composed of several visual channels representing multivariate qualitative and/or quantitative attributes.
Highlight, page 73
Despite its many advantages, there are also several drawbacks to Google Scholar mentioned by our interviewees for example (i) it returns back a vast number of results from which it is difficult to filter out the relevant articles, (ii) the detection of the omission of relevant articles becomes challenging due to the sheer volume of results and there is also concern that the list may not be accurate and up-to-date, (iii) as a keyword-based search, results can be skewed by the use of “incorrect” search terms, (iv) its inability to search across PDF documents for embedded scientific data, and (v) the lack of subject categorizations (or tagging) that makes it difficult to carry out a “top-down search” of drilling down a particular subject or retrieval of articles of similar (or associated) topic. There are also other drawbacks but they are more directed towards Google that (i) the results that are returned can sometimes be overwhelmed by commercial interests, e.g., adverts, and there is the question of the validity and providence of the information and whether it can be trusted.
Highlight, page 75
A number of interviewees explained that the search terms that they used are typically the results from conversations with colleagues, peers, and collaborators. Some of the search terms used are based on their experience from reading through articles as well as refinement of the search terms by heuristic. Library orientations were also mentioned as a source of recommendation for search terms as well as domain specific databases to search on.
Highlight, page 77
Most of the interviewees usually search for the latest published articles and specialized information that are relevant to their fields. For science-based subjects, the interviewees would search for algorithms and experimental methods. Other information that was also searched for include patents, talks, overview explanations of specific topics, other researchers that are in a similar field, and who is doing what in specific subject areas.
Highlight, page 78
[Search system should] generate the most common and related keywords and term-based subjects. It should also bridge the gaps between different fields.
Until you have captured all of the “synonyms” of the search term you could be missing out on a lot of references that may be important.
Generate the view for the co-citations of the articles as well as visualizing the degree of strength of the connections between the articles.
Display the timeline
Evolution of the paper by seeing who referenced it and who it referenced
Integrate between both the publications and social metadata where you would be able to receive social recommendation on articles and books as well as finding articles based on what the other researchers in similar fields are also accessing.
View the author’s profile and see their publications list.
Highlight, page 79
Search and extract embedded information inside PDF documents
An aggregator that would allow you to integrate between the different platforms, such as Mendeley, Google Scholars, and JSTOR.
A tool that could help promote collaboration between researchers by enabling them to see the institutional affiliations of the researchers based on their expertise and allows you to set-up a chat system to communicate with these researchers.
Constructing a visual analytics tool that would complement the existing search engines and reference managers by implementing the items mentioned in the wish list by our interviewees and more. Perhaps this visual analytics tool can help bring back the “physical shelf browsing and serendipitous discovery”
Highlight, page 88
Today’s scholars can take advantage of “altmetrics” both to measure the impact of their own work and as an aid for the discovery of well-regarded research articles. Altmetrics basically measure the informal citations of articles in various forms of social media, immediately giving a picture of the importance of an article which rounds out the information given by traditional citation counting (as well as being quicker to respond to new citations, and applicable to a wider set of academic outputs by including such things as research data)