LongEmbed: Extending Embedding Models for Long Context Retrieval (2024)

Dawei Zhu\texteta\texteta{}^{\text{\texteta}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT  Liang Wang\textpi\textpi{}^{\text{\textpi}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT  Nan Yang\textpi\textpi{}^{\text{\textpi}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT  Yifan Song\texteta\texteta{}^{\text{\texteta}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT  Wenhao Wu\texteta\texteta{}^{\text{\texteta}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT
Furu Wei\textpi\textpi{}^{\text{\textpi}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT  Sujian Li\texteta\texteta{}^{\text{\texteta}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT
\texteta\texteta{}^{\text{\texteta}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPTPeking University \textpi\textpi{}^{\text{\textpi}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPTMicrosoft Corporation
Work done during Dawei’s internship at MSR Asia. Prof. Sujian Li is the corresponding author.

Abstract

Embedding models play a pivot role in modern NLP applications such as IR and RAG.While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts.This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training.First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models.Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k.Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE’s superiority over APE for context window extension.To facilitate future research, we release E5Base-4k and E5-RoPEBase, along with the LongEmbed benchmark.

LongEmbed: Extending Embedding Models for Long Context Retrieval (1)
LongEmbed: Extending Embedding Models for Long Context Retrieval (2)
LongEmbed: Extending Embedding Models for Long Context Retrieval (3)

1 Introduction

Text embeddings are vector representations of natural language that encode its semantic information.They play a pivotal role in various natural language processing (NLP) tasks, including information retrieval (IR) and retrieval-augmented generation (RAG).However, embedding models for producing these vector representations still operates within a very narrow context window, typically 512 input tokens(Wang etal., 2022; Xiao etal., 2023; Ni etal., 2022), This narrow context window has greatly hindered their application in scenarios requiring long inputs, such as long wikipedias and meeting scripts(Saad-Falcon etal., 2024).

Previous efforts that train a long context embedding model from scratch suffer significant computational overhead, due to the combined demand for large batch sizes and long sequences.For example, Chen etal. (2024) utilized 96 A100 GPUs to train BGE-M3 which supports 8k context. Meanwhile, there have been many successes in extending context window of existing LLMs in a plug-and-play way or via efficient fine-tuning, pushing their context from 4k to 128k(Xiong etal., 2023) and even 2 million tokens(Ding etal., 2024). Motivated by this, instead of training long context embedding models from scratch, this paper explores context window extension of existing embedding models.

First, we examine the capability of existing embedding models in processing long context. Retrieval is selected as the proxy task, as it closely mirrors real-world application scenarios. While there have been some retrieval benchmarks such as BEIR(Thakur etal., 2021) and LoCo(Saad-Falcon etal., 2024), we identify two major limitations with these existing benchmarks: 1) limited document length, 2) biased distribution of target information. To overcome this, we introduce the LongEmbed benchmark that integrates two synthetic tasks to enable flexible control over document length, and four real tasks featuring dispersed target information. Results on LongEmbed indicates huge room for improvement in current embedding models.

Based on this, we explore plug-and-play strategies to extend embedding models, including parallel context windows, reorganizing position ids, and position interpolation. Comprehensive experiments show that these strategies can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of harvesting further improvements via fine-tuning while strictly preserving original behavior within the short context. In this way, we have extended E5Base(Wang etal., 2022) from 512 to 4k(See Figure1(c)).

For models utilizing RoPE(Su etal., 2021), substantial enhancements on LongEmbed are observed when employing methods that fully leverage RoPE’s advantages, such as NTK(Peng & Quesnelle, 2023) and SelfExtend(Jin etal., 2024).As illustrated in Figure1(b) and1(c), leveraging NTK extends the context window of E5-Mistral to 32k, achieving close-to-perfect accuracy on passkey retrieval and state-of-the-art performance on LongEmbed. Further, for fair comparison of APE / RoPE-based embedding models, we pre-train E5-RoPE following the training procedure and data of E5. Thorough comparison of E5 and E5-RoPE reveals the superiority of RoPE-based embedding models in context window extension.

To facilitate future research in long context embedding models, we release E5Base-4k, E5-RoPEBase, and the LongEmbed benchmarks. E5Base-4k is further fine-tuned on E5Base to support 4k context, while strictly preserving original behavior for inputs not exceeding 512 tokens. E5-RoPEBase follows the same training procedure as E5Base, except for the substitution of APE with RoPE. It is released to facilitate comparison between APE & RoPE-Based embedding models. Furthermore, we have integrated LongEmbed into MTEB(Muennighoff etal., 2023) to make evaluation more convenient.

2 Related Work

Text Embedding Models. Text embeddings are continuous, low-dimensional vector representations of text that encode semantic information, laying the foundation of numerous NLP applications. Early attempts on text embeddings includes latent semantic indexing(Deerwester etal., 1990) and weighted average of word embeddings(Mikolov etal., 2013). Modern embedding models(Wang etal., 2022; Xiao etal., 2023; Neelakantan etal., 2022) exploit supervision from labeled query-document pairs, adopting a multi-stage training paradigm, where they are first pre-trained on large-scale weakly-supervised text pairs using contrastive loss, then fine-tuned on small scale but high-quality datasets. More recently, Muennighoff etal. (2024) explores the combination of generative and embedding tasks on LLMs, introducing GritLM that harvests improvements in both aspects.

Existing efforts in developing long-context embedding models typically involve first obtaining a long-context backbone model, either by pre-training with long inputs from scratch(Günther etal., 2023; Nussbaum etal., 2024; Chen etal., 2024) or using existing ones(Wang etal., 2023b), followed by training the backbone model to produce embeddings. Instead, this paper endows existing embedding models with the ability to handle long context through context window extension.

Context Window Extension for Large Language Models. Due to the high cost of pre-training an LLM from scratch, there have been many efforts towards extending the context window of existing LLMs in a plug-and-play manner. We categorize these efforts as follows: 1) Divide-and-conquer, which involves segmenting long inputs into short chunks, processing each chunk with the model, and aggregating the results, as demonstrated by PCW(Ratner etal., 2023); 2) Position reorganization, which reorganizes position ids to boost length extrapolation, as exemplified by SelfExtend(Jin etal., 2024), DCA(An etal., 2024), and others; 3) Position interpolation, which introduces new position embeddings by interpolating existing ones, includes PI(Chen etal., 2023), NTK(Peng & Quesnelle, 2023), YaRN(Peng etal., 2023), and Resonance RoPE(Wang etal., 2024a). Our paper thoroughly investigates these three lines of methods on embedding models. We also acknowledge other efforts for extending the context window, such as prompt & KV compression(Jiang etal., 2023; Ge etal., 2023; Zhang etal., 2024a) and memory-based transformers(Wang etal., 2024b; Xiao etal., 2024). However, the former is not applicable for bidirectional attention, and the latter requires complex mechanisms for accessing encoded content, hence we do not experiment with these two categories.

In addition to their plug-and-play usability, further fine-tuning on top of these methods with long training samples has been proven to yield better performance (Xiong etal., 2023; Fu etal., 2024; Zhang etal., 2024b; Yen etal., 2024). Addressing the overhead of training on long inputs and the scarcity of extremely long training data, a line of research investigates simulating long inputs within short context, including Randomized Positions(Ruoss etal., 2023), Positional Skip-wise(PoSE) training(Zhu etal., 2023).This paper also leverage these efforts to synthesize long training samples from the original training data, facilitating further fine-tuning on top of plug-and-play methods.

3 The LongEmbed benchmark

In this section, we first identify two limitations of existing retrieval benchmarks for evaluating long-context capabilities(Section3.1). Then, we introduce the retrieval tasks adopted in our LongEmbed, including both synthetic ones(Section 3.2) and real ones(Section 3.3).

3.1 Examination of Existing Retrieval Benchmarks

There are mainly two desiderata for curating a benchmark for long context retrieval. First, the candidate documents should be long enough. Second, the target information to answer user query should be as uniformly distributed across the document as possible. This prevents embedding models from solely focusing on specific parts, such as the beginning(Coelho etal., 2024), to achieve unreasonably high scores.Based on these criteria, we evaluate existing benchmarks for text retrieval as follows:

LongEmbed: Extending Embedding Models for Long Context Retrieval (4)

BEIR Benchmark(Thakur etal., 2021) is a collection of 18 information retrieval datasets, ranging across ad-hoc web search, question answering, fact verification and duplicate question retrieval, etc. However, documents in this benchmark contains fewer than 300 words on average(See Table5 in Appendix), making it unsuitable for measuring long context retrieval that usually involves documents of thousands or tens of thousands of words.

LoCo Benchmark(Saad-Falcon etal., 2024) consists 12 retrieval tasks that requires long context reasoning, spanning diverse domains such as law, science, finance, etc. However, we show that it still suffers from biased distribution of key information. Figure2 presents results of E5Base on 8 LoCo tasks that are publicly available. With only 512 context length, E5Base achieves >85% nDCG scores on 3 out of 8 retrieval tasks. This severely biased distribution of target information undermines its ability to reflect model performance as context length increases.

3.2 Synthetic Tasks in LongEmbed

First, we tailor the passkey retrieval and needle-in-a-haystack retrieval task designed for LLMs to measure context length of embedding models as follows:

Personalized Passkey Retrieval. Passkey retrieval(Mohtashami & Jaggi, 2023) requires LLMs to recover a random passkey hidden within a long document comprising garbage information.For embedding models, we adopt the personalized passkey retrieval task proposed byWang etal. (2023b), where each document contains a unique person name and his/her passkey at random position. The goal is to retrieve the document containing the given person’s passkey from all candidates documents.

LongEmbed: Extending Embedding Models for Long Context Retrieval (5)

Needle-in-a-haystack Retrieval. While passkey retrieval surrounds key information with garbage sentences, needle-in-a-haystack retrieval(Kamradt, 2023) randomly inserts key information into an arbitrary position of a long essay, making the task more challenging.To tailor this task for embedding models, we instruct GPT-4 to generate 100 facts covering a variety of domains including physics, history, geometry, art, etc, and 100queries correspondingly. The facts are treated as needles and randomly inserted into the PaulGrahamEssay to form 100 candidate documents. Our task is to correctly retrieve the document that contains corresponding needle given the query.

The advantage of synthetic data is that we can flexibly control context length and distribution of target information. For both tasks, we evaluate a broad context range of {0.25,0.5,1,2,4,8,16,32}×10240.250.5124816321024\{0.25,0.5,1,2,4,8,16,32\}\times 1024{ 0.25 , 0.5 , 1 , 2 , 4 , 8 , 16 , 32 } × 1024 tokens333Since token numbers vary w.r.t. tokenizers, we use a rough estimation that 1 token = 0.75 word, and constraint the word numbers to not exceed {0.25,0.5,1,2,4,8,16,32}×1024×0.750.250.51248163210240.75\{0.25,0.5,1,2,4,8,16,32\}\times 1024\times 0.75{ 0.25 , 0.5 , 1 , 2 , 4 , 8 , 16 , 32 } × 1024 × 0.75.. For each context length, we include 50 test samples, each comprising 1 query and 100 candidate documents.444The original version of personalized passkey retrieval uses different candidate documents for each query, resulting in 50 queries and 5,000 documents to encode for each context length. To speed up evaluation, we share the candidate documents for different queries within each context length. In this way, we can measure the effective context size of embedding models for up to 32k tokens. Examples for both synthetic tasks are presented in Figure3. For the passkey test, the <prefix / suffix> are repeats of "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again." For the needle test, the <prefix> and <suffix> form a long essay.

3.3 Real Tasks in LongEmbed

While synthetic tasks offer flexibility in manipulating context length and distributing key information, they still differ from real-world scenarios.To conduct a comprehensive evaluation, we have tailored following long-form QA and summarization tasks for long context retrieval. Note that for QA and summarization datasets, we use the questions and summaries as queries, respectively.

NarrativeQA(Kočiský etal., 2018) is a QA dataset comprising long stories averaging 50,474 words and corresponding questions about specific content such as characters, events. As these details are dispersed throughout the story, models must process the entire long context to get correct answers.

2WikiMultihopQA(Ho etal., 2020) is a multi-hop QA dataset featuring questions with up to 5 hops, synthesized through manually designed templates to prevent shortcut solutions. This necessitates the ability to process and reason over long context, ensuring that answers cannot be obtained by merely focusing on a short span within the document.

QMSum(Zhong etal., 2021) is a query-based meeting summarization dataset that requires selecting and summarizing relevant segments of meetings in response to queries. Due to the involvement of multiple participants and topics in the meeting, summarization regarding specific queries naturally requires aggregating information dispersed throughout the entire text.

SummScreenFD(Chen etal., 2022) is a screenplay summarization dataset comprising pairs of TV series transcripts and human-written summaries. Similar to QMSum, its plot details are scattered throughout the transcript and must be integrated to form succinct descriptions in the summary.

TableLABEL:tab:statistics presents the overall statistics of LongEmbed. Considering the computational complexity that increases quadratically with input length, we intentionally restrict the number of candidate documents in each task to to not exceed 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. In this way, we can efficiently evaluate the basic long context capabilities of embedding models. For further elaboration on the source and examples for each dataset, please refer to AppendixC.

DatasetDomain# Queries# DocsAvg. QueryAvg. Doc
WordsWords
Real Tasks
NarrativeQALiterature, Film10,449355950,474
QMSumMeeting1,5271977110,058
2WikimQAWikipedia300300126,132
SummScreenFDScreenWriting3363361025,582
Synthetic Tasks
PasskeySynthetic40080011
NeedleSynthetic4008007

4 Methodology

4.1 Absolute Position Embedding (APE) & Rotary Position Embedding (RoPE)

Absolute Position Embedding (APE) stands as the predominant positional encoding strategy for embedding models, as majority of them follows the BERT architecture(Devlin etal., 2019). APE-based models first embed absolute position ids into position vectors and add token embeddings to their corresponding position vectors, before feeding them to a stack of transformer layers.

Rotary Position Embedding (RoPE) is the most pervasive position embedding strategy in the era of LLMs, including LLaMA(Touvron etal., 2023), Gemma(Team etal., 2024), QWen(Bai etal., 2023a), etc. It encodes position information of tokens with a rotation matrix that naturally incorporates explicit relative position dependency. To elucidate, given a hidden vector 𝒉=[h0,h1,,hd1]𝒉subscript0subscript1subscript𝑑1{\bm{h}}=[h_{0},h_{1},...,h_{d-1}]bold_italic_h = [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT ] of dimension d𝑑ditalic_d, and a position index m𝑚mitalic_m, RoPE operates as follows:

f(𝒉,m)=[(h0+ih1)eimθ0,(h2+ih3)eimθ1,,(hd2+ihd1)eimθd/21]𝑓𝒉𝑚subscript0isubscript1superscript𝑒i𝑚subscript𝜃0subscript2isubscript3superscript𝑒i𝑚subscript𝜃1subscript𝑑2isubscript𝑑1superscript𝑒i𝑚subscript𝜃𝑑21f({\bm{h}},m)=[(h_{0}+\mathrm{i}h_{1})e^{\mathrm{i}m\theta_{0}},(h_{2}+\mathrm%{i}h_{3})e^{\mathrm{i}m\theta_{1}},...,(h_{d-2}+\mathrm{i}h_{d-1})e^{\mathrm{i%}m\theta_{d/2-1}}]italic_f ( bold_italic_h , italic_m ) = [ ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_i italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , ( italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_i italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , ( italic_h start_POSTSUBSCRIPT italic_d - 2 end_POSTSUBSCRIPT + roman_i italic_h start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ start_POSTSUBSCRIPT italic_d / 2 - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ](1)

where θj=100002j/d,j{0,1,,d/21}formulae-sequencesubscript𝜃𝑗superscript100002𝑗𝑑𝑗01𝑑21\theta_{j}=10000^{-2j/d},j\in\{0,1,...,d/2-1\}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 10000 start_POSTSUPERSCRIPT - 2 italic_j / italic_d end_POSTSUPERSCRIPT , italic_j ∈ { 0 , 1 , … , italic_d / 2 - 1 }, i=1i1\mathrm{i}=\sqrt{-1}roman_i = square-root start_ARG - 1 end_ARG is the imaginary unit. Unlike APE that is directly applied to the input vector 𝒙𝒙{\bm{x}}bold_italic_x, RoPE is employed on the query and key vectors at each layer. The attention score a(𝒒,𝒌)𝑎𝒒𝒌a({\bm{q}},{\bm{k}})italic_a ( bold_italic_q , bold_italic_k ) between a query 𝒒𝒒{\bm{q}}bold_italic_q at position m𝑚mitalic_m and a key 𝒌𝒌{\bm{k}}bold_italic_k at position n𝑛nitalic_n is defined as:

a(𝒒,𝒌)𝑎𝒒𝒌\displaystyle a({\bm{q}},{\bm{k}})italic_a ( bold_italic_q , bold_italic_k )=Ref(𝒒,m),f(𝒌,n)=Re[j=0d/21(q2j+iq2j+1)(k2jik2j+1)ei(mn)θj]absentRe𝑓𝒒𝑚𝑓𝒌𝑛Redelimited-[]superscriptsubscript𝑗0𝑑21subscript𝑞2𝑗isubscript𝑞2𝑗1subscript𝑘2𝑗isubscript𝑘2𝑗1superscript𝑒i𝑚𝑛subscript𝜃𝑗\displaystyle=\mathrm{Re}\langle f({\bm{q}},m),f({\bm{k}},n)\rangle=\mathrm{Re%}\left[\sum_{j=0}^{d/2-1}(q_{2j}+\mathrm{i}q_{2j+1})(k_{2}j-\mathrm{i}k_{2j+1}%)e^{\mathrm{i}(m-n)\theta_{j}}\right]= roman_Re ⟨ italic_f ( bold_italic_q , italic_m ) , italic_f ( bold_italic_k , italic_n ) ⟩ = roman_Re [ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT + roman_i italic_q start_POSTSUBSCRIPT 2 italic_j + 1 end_POSTSUBSCRIPT ) ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_j - roman_i italic_k start_POSTSUBSCRIPT 2 italic_j + 1 end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT roman_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ]
:=g(𝒒,𝒌,(mn)𝜽)assignabsent𝑔𝒒𝒌𝑚𝑛𝜽\displaystyle:=g({\bm{q}},{\bm{k}},(m-n){\bm{\theta}}):= italic_g ( bold_italic_q , bold_italic_k , ( italic_m - italic_n ) bold_italic_θ )(2)

where g(·) is an abstract mapping function exclusively dependent on 𝒒,𝒌𝒒𝒌{\bm{q}},{\bm{k}}bold_italic_q , bold_italic_k and (mn)𝜽𝑚𝑛𝜽(m-n){\bm{\theta}}( italic_m - italic_n ) bold_italic_θ.

4.2 Context Window Extension for APE-based Models

As delineated in Section2, training-free context extension strategies applicable to embedding models can be classified into 3 categories: 1) Divide-and-conquer; 2) Position reorganization; 3) Position interpolation. In this section, we introduce methods from each of these categories to assess their applicability to embedding models. Further fine-tuning on top of these methods is also included. Let Losubscript𝐿𝑜L_{o}italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT represent the original context length, 𝒟={x1,x2,,xLt}𝒟subscript𝑥1subscript𝑥2subscript𝑥subscript𝐿𝑡\mathcal{D}=\{x_{1},x_{2},...,x_{L_{t}}\}caligraphic_D = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT } denote a long document of target context length Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and s=Lt/Lo𝑠subscript𝐿𝑡subscript𝐿𝑜s=\lceil L_{t}/L_{o}\rceilitalic_s = ⌈ italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ⌉ indicate the context scaling factor. The context extension methods we investigated are described below:

LongEmbed: Extending Embedding Models for Long Context Retrieval (6)

Parallel Context Windows (PCW). To process a long document with a short-context model, PCW divides the long document into multiple short chunks, processes each chunk in parallel, and aggregates their results(Ratner etal., 2023; Yen etal., 2024). In practice, we first segment 𝒟𝒟\mathcal{D}caligraphic_D into chunks of Losubscript𝐿𝑜L_{o}italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT tokens, then average over each chunk’s embeddings to get the embedding of 𝒟𝒟\mathcal{D}caligraphic_D. For simplicity, we set the overlap between adjacent chunks to 0, except for the last chunk, which conditionally overlaps with the preceding chunk to ensure it contains Losubscript𝐿𝑜L_{o}italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT tokens.

Grouped Positions (GP) & Recurrent Positions (RP). Dividing inputs into chunks and processing them separately sacrifices their interaction in between. By contrast, position reorganization accommodates longer context by reusing the original position ids. To be specific, we experiment with two simple strategies: Grouped Positions and Recurrent Positions. The former groups the original position ids as such: fgp(pid)pid/ssubscript𝑓𝑔𝑝𝑝𝑖𝑑𝑝𝑖𝑑𝑠f_{gp}(pid)\rightarrow\lfloor pid/s\rflooritalic_f start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ( italic_p italic_i italic_d ) → ⌊ italic_p italic_i italic_d / italic_s ⌋, while the latter assigns the position ids recurrently within the range {0,1,,Lo1}01subscript𝐿𝑜1\{0,1,...,L_{o}-1\}{ 0 , 1 , … , italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 }, formulated as: frp(pid)pidmodLosubscript𝑓𝑟𝑝𝑝𝑖𝑑modulo𝑝𝑖𝑑subscript𝐿𝑜f_{rp}(pid)\rightarrow pid\bmod L_{o}italic_f start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( italic_p italic_i italic_d ) → italic_p italic_i italic_d roman_mod italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT.

Linear Position Interpolation (PI). Instead of reusing position ids, Chen etal. (2023) introduces new position embeddings via linear interpolation of existing ones. To apply PI on APE-based models, we map the positions ids as such: fpi(pid)pid/ssubscript𝑓𝑝𝑖𝑝𝑖𝑑𝑝𝑖𝑑𝑠f_{pi}(pid)\rightarrow pid/sitalic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ( italic_p italic_i italic_d ) → italic_p italic_i italic_d / italic_s, and assign embeddings for non-integers as linear interpolation of that of neighboring integers. In practice, we first extend the original position embedding matrix EoLo×dsubscript𝐸𝑜superscriptsubscript𝐿𝑜𝑑{E}_{o}\in{\mathbb{R}}^{L_{o}\times d}italic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT into EtLt×dsubscript𝐸𝑡superscriptsubscript𝐿𝑡𝑑{E}_{t}\in{\mathbb{R}}^{L_{t}\times d}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d stands for hidden size. Next, we assign Et[is]=Eo[i],i{0,1,,Lo1}formulae-sequencesubscript𝐸𝑡delimited-[]𝑖𝑠subscript𝐸𝑜delimited-[]𝑖𝑖01subscript𝐿𝑜1{E}_{t}[i\cdot s]={E}_{o}[i],i\in\{0,1,...,L_{o}-1\}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ⋅ italic_s ] = italic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT [ italic_i ] , italic_i ∈ { 0 , 1 , … , italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 }. For non-integer position id j𝑗jitalic_j between i𝑖iitalic_i and i+1𝑖1i+1italic_i + 1, we determine their embeddings as follows: Et[sj]=((i+1j)Et[is]+(ji)Et[(i+1)s])subscript𝐸𝑡delimited-[]𝑠𝑗𝑖1𝑗subscript𝐸𝑡delimited-[]𝑖𝑠𝑗𝑖subscript𝐸𝑡delimited-[]𝑖1𝑠{E}_{t}[s\cdot j]=((i+1-j){E}_{t}[i\cdot s]+(j-i){E}_{t}[(i+1)\cdot s])italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s ⋅ italic_j ] = ( ( italic_i + 1 - italic_j ) italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ⋅ italic_s ] + ( italic_j - italic_i ) italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ( italic_i + 1 ) ⋅ italic_s ] ).

Further Tuning. Except for PCW, which divides long texts into smaller blocks and processes separately, GP, RP, and PI can all be seen as extending the position embedding matrix. Since APE-based models assign an independent vector to each position, we can freeze the original model parameters while updating only the newly added position embeddings. In this way, we can strictly maintain model ability within 512 context, while harvesting further performance gains in handling long context as free lunch. Specifically, further fine-tuning on top of RP and PI is explored in this paper, as illustrated in Figure4(Right). Since the traditional training data for embedding models are short queries and passages not exceeding 512 tokens, we manipulate position ids to simulate long training samples, as proposed inZhu etal. (2023). See AppendixB for details of further fine-tuning.

4.3 Context Window Extension for RoPE-based Models

For RoPE-based models, we further explore Self Extend and NTK, which respectively advances over GP and PI, harnessing the inherent advantages of RoPE. Since there is no simple strategy for further training while exactly maintaining original performance like APE, we leave comprehensive exploration of training-based context window extension for RoPE-based models for future work.

Self Extend(SE). Compared with APE, RoPE operates on the query and key vectors at each layer to encode relative positions, offering enhanced flexibility for position reorganization. For each token, instead of assigning grouped relative positions to all other tokens, SelfExtend(Jin etal., 2024) re-introduces normal relative positions within the nearest neighbor window w𝑤witalic_w, achieving improved performance. For example, consider a document of 10 tokens {x0,x1,,x9}subscript𝑥0subscript𝑥1subscript𝑥9\{x_{0},x_{1},...,x_{9}\}{ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT } with a neighbor window size w=4𝑤4w=4italic_w = 4 and a group size g=2𝑔2g=2italic_g = 2. The relative positions for x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are {0,1,2,3,4,4,5,5,6,6}0123445566\{0,1,2,3,4,4,5,5,6,6\}{ 0 , 1 , 2 , 3 , 4 , 4 , 5 , 5 , 6 , 6 }. For x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, the relative positions of the other tokens are {4,3,2,1,0,1,2,3,4,4}4321012344\{-4,-3,-2,-1,0,1,2,3,4,4\}{ - 4 , - 3 , - 2 , - 1 , 0 , 1 , 2 , 3 , 4 , 4 }.

NTK-Aware Interpolation(NTK). Given a scaling factor s𝑠sitalic_s, PI proportionally down-scales position index m𝑚mitalic_m to m/s𝑚𝑠m/sitalic_m / italic_s. In this way, the attention score a(𝒒,𝒌)𝑎𝒒𝒌a({\bm{q}},{\bm{k}})italic_a ( bold_italic_q , bold_italic_k ) defined in Equation2 becomes g(𝒒,𝒌,(mn)𝜽/s)𝑔𝒒𝒌𝑚𝑛𝜽𝑠g({\bm{q}},{\bm{k}},(m-n){\bm{\theta}}/s)italic_g ( bold_italic_q , bold_italic_k , ( italic_m - italic_n ) bold_italic_θ / italic_s ). This is also equivalent to reducing the frequencies 𝜽𝜽{\bm{\theta}}bold_italic_θ uniformly, which may prevent the model from learning high-frequency features, as shown by the Neural Tangent Kernel (NTK) theory(Jacot etal., 2018). To remedy this, NTK-Aware interpolation(Peng & Quesnelle, 2023) scales high frequencies less and low frequencies more to spread out the interpolation pressure across multiple dimensions. This is achieved by directly altering the original θj=100002j/dsubscript𝜃𝑗superscript100002𝑗𝑑\theta_{j}=10000^{-2j/d}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 10000 start_POSTSUPERSCRIPT - 2 italic_j / italic_d end_POSTSUPERSCRIPT into θj=(10000λ)2j/dsubscriptsuperscript𝜃𝑗superscript10000𝜆2𝑗𝑑\theta^{\prime}_{j}=(10000\lambda)^{-2j/d}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( 10000 italic_λ ) start_POSTSUPERSCRIPT - 2 italic_j / italic_d end_POSTSUPERSCRIPT, where λ𝜆\lambdaitalic_λ is conventionally chosen to be slightly greater than s𝑠sitalic_s.

5 Experiments

5.1 Experimental Setup

Benchmarked Models. We evaluate both open-sourced and proprietary models on LongEmbed, including E5Base(Wang etal., 2022), GTEBase(Li etal., 2023), BGE-Base(Xiao etal., 2023), Contriever(Izacard etal., 2021), GTR-Base(Ni etal., 2022), E5-Mistral(Wang etal., 2023b), Jina-V2(Günther etal., 2023), Nomic-V1(Nussbaum etal., 2024), BGE-M3(Chen etal., 2024), OpenAI-ada-002. For BGE-M3, we utilize dense vectors. M2(Saad-Falcon etal., 2024) is not included in our evaluation, given its training data partly overlaps with test samples in LongEmbed.

Candidate Models for Extension. From each of the APE-based and RoPE-based category, we select 2 candidate models for comprehensive study. The former includes E5Base and GTEBase. The latter includes the 4,096-context E5-Mistral, and a newly trained E5-RoPEBase, which supports 512 context(See AppendixA for its training details and BEIR results). Note that E5-RoPEBase employs the same training procedure and training data as E5Base, only with APE substituted with RoPE. This facilitates fair comparison of APE / RoPE-based models in context window extension, as presented in Section5.4. For implementation details of each context window extension strategies on each model, please refer to AppendixB.

5.2 Main Results

ModelParam.Synthetic (Acc@1)Real (nDCG@10)Avg.
PasskeyNeedleNQAQMSumSFD2WmQA
512 Context Models
E5Base(Wang etal., 2022)110M38.028.525.323.874.755.841.0
E5-RoPEBase110M38.531.524.623.266.658.840.5
GTEBase(Li etal., 2023)110M31.024.528.621.855.847.334.8
BGE-Base(Xiao etal., 2023)110M18.025.325.622.460.351.733.9
Contriever(Izacard etal., 2021)110M38.529.026.725.573.547.340.1
GTR-Base(Ni etal., 2022)110M38.526.326.518.363.752.236.5
\geq 4k Context Models
E5-Mistral(Wang etal., 2023b)7B71.048.344.643.696.882.064.4
Jina-V2(Günther etal., 2023)137M50.354.537.938.993.574.058.2
Nomic-V1(Nussbaum etal., 2024)137M32.325.338.335.091.073.449.2
BGE-M3(Chen etal., 2024)568M59.340.545.835.594.078.058.9
OpenAI-Ada-002-50.836.841.140.091.880.156.8
Our Extended Models
E5Base + Tuning (4k)110M67.341.530.435.795.269.256.6
E5-RoPEBase + SelfExtend (4k)110M73.553.532.339.191.974.660.8
E5-Mistral + NTK (32k)7B93.866.849.849.297.195.275.3

Table 2 demonstrates the performance of existing embedding models on our LongEmbed benchmark. Among the 512-context models, E5Base achieves the highest average score of 41.0 points, closely followed by E5-RoPEBase and Contriever. As the supported context length increases beyond 4k, exemplified by E5-Mistral and Jina-V2, a discernible increase in scores is observed. This verifies both the efficacy of these long-context models and the validity of LongEmbed to assess long-context retrieval. Note that even the best performing model attains only 64.4 pts on average, indicating huge room for improvement in current models.

In the last row block of Table 2, we further include the best results achieved by E5Base, E5-RoPEBase and E5-Mistral after context window extension. For E5Base and E5-RoPEBase, we extend their contexts from 512 to 4,096. For E5-Mistral, we extend its context from 4,096 to 32,768. Compared to the original versions, the extended models achieve an average score increase of +15.6 / +20.3 / +10.9 points. This indicates the efficacy of these context extension strategies on embedding models, enabling them to handle inputs of several folds longer. Detailed performance comparison of different extension strategies on APE & RoPE-based embedding models is presented in Section5.3.

5.3 Performance Comparison of Context Extension Methods

LongEmbed: Extending Embedding Models for Long Context Retrieval (9)
LongEmbed: Extending Embedding Models for Long Context Retrieval (10)

APE-Based Models. Figure5(a) illustrates the impact of various context extension strategies on E5Base and GTEBase across different target context lengths. We observe that plug-and-play methods including GP, RP, LPI and PCW strategies yield comparable results with no significant disparities. On the other hand, further tuning consistently yields additional performance gains for both models, across all target context lengths. Particularly noteworthy is GTEBase, which showcases a substantial average score increase of approximately 5 points after further tuning. This suggests that freezing the original model weights and fine-tuning exclusively the added position embeddings can effectively extend the model’s context window while strictly maintaining model’s original ability.

ModelSynthetic (Acc@1)Real (nDCG@10)Avg.
PasskeyNeedleNQAQMSumSFD2WmQA
E5-RoPEBase38.531.524.623.266.658.840.5
+ PCW (4k)42.550.825.134.994.969.352.9
+ GP (4k)68.038.825.930.985.865.852.5
+ PI (4k)68.336.025.930.884.965.351.9
+ SE (4k)73.553.532.339.191.974.660.8
+ NTK (4k)66.346.525.535.890.871.756.1
E5-Mistral71.048.344.643.696.882.064.4
+ PCW (32k)63.549.559.351.397.391.268.7
+ GP (32k)81.048.837.042.990.688.164.7
+ PI (32k)89.848.537.840.476.863.059.4
+ SE (32k)90.85249.348.797.296.472.4
+ NTK (32k)93.866.849.849.297.195.275.3

RoPE-Based Models. Table3 depicts the outcomes of E5-RoPEBase and E5-Mistral on each dataset of LongEmbed after context window extension via PCW, GP, PI, SE and NTK. It is observed that RoPE-specific methods including NTK and SE yield significant improvements for both models across all datasets, surpassing PCW, PI and GP by a large margin.

5.4 Analysis

Tuning on PI vs. RP. Figure5(b) compares further tuning on top of RP vs. PI.In the former approach, the initial 512 position embeddings are frozen while the remaining embeddings are tuned, whereas for the latter, the frozen / learnable embedding vectors are arranged in an interleaved manner.Our observations indicate that tuning applied to PI consistently produces superior results across both models. This superiority may be attributed to the fixed vectors acting as anchors, thereby preventing the learnable vectors from converging to suboptimal values.

LongEmbed: Extending Embedding Models for Long Context Retrieval (11)

RoPE vs. APE. We further discuss the potential of APE / RoPE-based models for context window extension. E5Base and E5-RoPEBase are selected as the comparison subjects thanks to their shared training process, training data, and comparable performance on BEIR and LongEmbed benchmarks. At each target context length ({1k,2k,4k}1𝑘2𝑘4𝑘\{1k,2k,4k\}{ 1 italic_k , 2 italic_k , 4 italic_k }), we report the best scores achieved by each model on LongEmbed, as illustrated in Figure6. Without requiring further training, E5-RoPEBase consistently demonstrates superior performance compared to E5Base across all target lengths. Furthermore, as the target window length increases, this superiority becomes more pronounced, even surpassing the fine-tuned version of E5Base by a large margin. This suggests that RoPE-based models can better extrapolate to to longer context. Consequently, we advocate for the use of RoPE in future embedding models.

6 Conclusion

This paper explores context window extension of existing embedding models.Through extensive experiments on our LongEmbed benchmark, we show that training-free context window extension strategies can effectively increase the input length of these models by several folds. Further, our analysis reveals the superiority of RoPE-based embedding models over APE-based ones in context window extension. Hence, we advocate for the use of RoPE for future embedding models.

Limitations

As a pioneering work in applying context window extension on embedding models, this paper is still limited in several aspects, particularly in that most of the context extension strategies explored in this paper are training-free. As evidenced by previous findings(Xiong etal., 2023; Fu etal., 2024; Zhang etal., 2024b; Yen etal., 2024), and the additional performance gain achieved via tuning on E5Base and GTEBase, we believefurther fine-tuning on top of plug-and-play methods can bring even better extension results. In the future, we will make comprehensive exploration of training-based context window extension for embedding models, especially for RoPE-based ones.

References

  • An etal. (2024)Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong.Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024.
  • Bai etal. (2023a)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, YuHan, Fei Huang, etal.Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a.
  • Bai etal. (2023b)Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, etal.Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023b.
  • Chen etal. (2024)Jianlv Chen, sh*tao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu.Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024.
  • Chen etal. (2022)Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel.Summscreen: A dataset for abstractive screenplay summarization.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8602–8615, 2022.
  • Chen etal. (2023)Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian.Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023.
  • Chiang & Cholak (2022)David Chiang and Peter Cholak.Overcoming a theoretical limitation of self-attention.In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7654–7664, Dublin, Ireland, May 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.acl-long.527.URL https://aclanthology.org/2022.acl-long.527.
  • Coelho etal. (2024)João Coelho, Bruno Martins, João Magalhães, Jamie Callan, and Chenyan Xiong.Dwell in the beginning: How language models embed long documents for dense retrieval.arXiv preprint arXiv:2404.04163, 2024.
  • Deerwester etal. (1990)Scott Deerwester, SusanT Dumais, GeorgeW Furnas, ThomasK Landauer, and Richard Harshman.Indexing by latent semantic analysis.Journal of the American society for information science, 41(6):391–407, 1990.
  • Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for language understanding.In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.doi: 10.18653/v1/N19-1423.URL https://aclanthology.org/N19-1423.
  • Ding etal. (2024)Yiran Ding, LiLyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang.Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024.
  • Fu etal. (2024)Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng.Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171, 2024.
  • Gao etal. (2021)Tianyu Gao, Xingcheng Yao, and Danqi Chen.Simcse: Simple contrastive learning of sentence embeddings.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, 2021.
  • Ge etal. (2023)Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei.In-context autoencoder for context compression in a large language model.arXiv preprint arXiv:2307.06945, 2023.
  • Günther etal. (2023)Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, MohammadKalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, BoWang, etal.Jina embeddings 2: 8192-token general-purpose text embeddings for long documents.arXiv preprint arXiv:2310.19923, 2023.
  • Ho etal. (2020)Xanh Ho, Anh-Khoa DuongNguyen, Saku Sugawara, and Akiko Aizawa.Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps.In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.URL https://www.aclweb.org/anthology/2020.coling-main.580.
  • Izacard etal. (2021)Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave.Towards unsupervised dense information retrieval with contrastive learning.arXiv preprint arXiv:2112.09118, 2(3), 2021.
  • Jacot etal. (2018)Arthur Jacot, Franck Gabriel, and Clément Hongler.Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018.
  • Jiang etal. (2023)Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu.Llmlingua: Compressing prompts for accelerated inference of large language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13358–13376, 2023.
  • Jin etal. (2024)Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu.Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024.
  • Kamradt (2023)Greg Kamradt.Needle in a haystack - pressure testing llms.https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
  • Karpukhin etal. (2020)Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.Dense passage retrieval for open-domain question answering.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020.
  • Kočiský etal. (2018)Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, KarlMoritz Hermann, Gábor Melis, and Edward Grefenstette.The NarrativeQA reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018.doi: 10.1162/tacl_a_00023.URL https://aclanthology.org/Q18-1023.
  • Kwiatkowski etal. (2019)Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, etal.Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019.
  • Lefaudeux etal. (2022)Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov.xformers: A modular and hackable transformer modelling library.https://github.com/facebookresearch/xformers, 2022.
  • Li etal. (2023)Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang.Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023.
  • Mikolov etal. (2013)Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013.
  • Mohtashami & Jaggi (2023)Amirkeivan Mohtashami and Martin Jaggi.Landmark attention: Random-access infinite context length for transformers.arXiv preprint arXiv:2305.16300, 2023.
  • Muennighoff etal. (2023)Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers.Mteb: Massive text embedding benchmark.In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037, 2023.
  • Muennighoff etal. (2024)Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela.Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024.
  • Neelakantan etal. (2022)Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, JesseMichael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, JongWook Kim, Chris Hallacy, etal.Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005, 2022.
  • Nguyen etal. (2016)Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and LiDeng.Ms marco: A human-generated machine reading comprehension dataset.2016.
  • Ni etal. (2022)Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, GustavoHernandez Abrego, JiMa, Vincent Zhao, YiLuan, Keith Hall, Ming-Wei Chang, etal.Large dual encoders are generalizable retrievers.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9844–9855, 2022.
  • Nussbaum etal. (2024)Zach Nussbaum, JohnX Morris, Brandon Duderstadt, and Andriy Mulyar.Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024.
  • Peng & Quesnelle (2023)Bowen Peng and Jeffrey Quesnelle.Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
  • Peng etal. (2023)Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole.Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023.
  • Ratner etal. (2023)Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham.Parallel context windows for large language models.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6383–6402, 2023.
  • Ruoss etal. (2023)Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness.Randomized positional encodings boost length generalization of transformers.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1889–1903, 2023.
  • Saad-Falcon etal. (2024)Jon Saad-Falcon, DanielY Fu, Simran Arora, Neel Guha, and Christopher Ré.Benchmarking and building long-context retrieval models with loco and m2-bert.arXiv preprint arXiv:2402.07440, 2024.
  • Shaham etal. (2022)Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy.SCROLLS: Standardized CompaRison over long language sequences.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.emnlp-main.823.URL https://aclanthology.org/2022.emnlp-main.823.
  • Su (2021)Jianlin Su.Understanding attention scaling from the perspective of entropy invariance.https://spaces.ac.cn/archives/8823, Dec 2021.
  • Su etal. (2021)Jianlin Su, YuLu, Shengfeng Pan, Ahmed Murtadha, BoWen, and Yunfeng Liu.Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021.
  • Team etal. (2024)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, Juliette Love, etal.Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024.
  • Thakur etal. (2021)Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych.BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models.In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.URL https://openreview.net/forum?id=wCu6T5xFjeJ.
  • Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • Wang etal. (2022)Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei.Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022.
  • Wang etal. (2023a)Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei.Simlm: Pre-training with representation bottleneck for dense passage retrieval.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2244–2258, 2023a.
  • Wang etal. (2023b)Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei.Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023b.
  • Wang etal. (2024a)Suyuchen Wang, Ivan Kobyzev, Peng Lu, Mehdi Rezagholizadeh, and Bang Liu.Resonance rope: Improving context length generalization of large language models.arXiv preprint arXiv:2403.00071, 2024a.
  • Wang etal. (2024b)Weizhi Wang, LiDong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei.Augmenting language models with long-term memory.Advances in Neural Information Processing Systems, 36, 2024b.
  • Xiao etal. (2024)Chaojun Xiao, Pengle Zhang, XuHan, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun.Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 2024.
  • Xiao etal. (2023)sh*tao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof.C-pack: Packaged resources to advance general chinese embedding.arXiv preprint arXiv:2309.07597, 2023.
  • Xiong etal. (2023)Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, KarthikAbinav Sankararaman, Barlas Oguz, etal.Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039, 2023.
  • Yen etal. (2024)Howard Yen, Tianyu Gao, and Danqi Chen.Long-context language modeling with parallel context encoding, 2024.
  • Zhang etal. (2024a)Peitian Zhang, Zheng Liu, sh*tao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou.Soaring from 4k to 400k: Extending llm’s context with activation beacon.arXiv preprint arXiv:2401.03462, 2024a.
  • Zhang etal. (2024b)Yikai Zhang, Junlong Li, and Pengfei Liu.Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024b.
  • Zhong etal. (2021)Ming Zhong, DaYin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed HassanAwadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev.QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization.In North American Association for Computational Linguistics (NAACL), 2021.
  • Zhu etal. (2023)Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li.Pose: Efficient context window extension of llms via positional skip-wise training.In The Twelfth International Conference on Learning Representations, 2023.

Appendix A Training Details for E5-RoPEBase

ParamsPre-trainingFine-tuning
E5BaseE5-RoPEBaseE5BaseE5-RoPEBase
learning rate2×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT2×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT2×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT2×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
GPUs (V100)323288
warmup steps10001000400400
max length128512192192
batch size32k16k256256
max steps20k20kn.a.n.a.
epochsn.a.n.a.33
τ𝜏\tauitalic_τ0.010.010.010.01
α𝛼\alphaitalic_αn.a.n.a.0.20.2
weight decay0.010.010.010.01
hard negatives0077
pos embeddingAPERoPEAPERoPE

In this section, we describe the training details of E5-RoPEBase. Our training procedure and data exactly follows that of E5(Wang etal., 2022), where we first perform contrastive pre-training on their collected CCPairs, then perform fine-tuning on the concatenation of 3 datasets: MS-MARCO passage ranking(Nguyen etal., 2016), NQ(Karpukhin etal., 2020; Kwiatkowski etal., 2019), and NLI(Gao etal., 2021). Each example is paired with 7 hard negatives. We leverage the mined hard negatives and re-ranker scores from SimLM(Wang etal., 2023a) for the first two datasets. As the NLI dataset only provides 1 hard negative per example, we randomly sample 6 sentences from the entire corpus. xFormers(Lefaudeux etal., 2022) is used for memory efficient training. As presented in Table4, training hyperparameters for E5Base and E5-RoPEBase are identical, except in two aspects:

  • Initialization. Before contrastive pre-training, E5Base is initialized on BERTBase(Devlin etal., 2019), which employs absolute position embeddings (APE). For the initialization of E5-RoPEBase, we simply replace the APE part of BERTBase with RoPE. It’s worth noting that the BERTBase model after this replacement cannot function properly. We count on the subsequent pre-training phase to adapt the model to RoPE.

  • Pre-training length and batch size. E5Base does not update its position embedding matrix during the training phase, i.e., it utilizes the same position embedding matrix as BERTBase. This allows it to generalize to input sequences of up to 512 tokens, while being trained with a max training length of 192. As for E5-RoPE, replacing APE with RoPE during initialization prevents us from directly inheriting the original model’s capability in handling 512 tokens. Consequently, in the pre-training phase of E5-RoPE, we set the maximum training length to 512, and reduce the batch size to 16k according to memory constraints.

Tasks# W/Q.# W/D.E5BaseE5-RoPEBase
MS MARCO6.056.041.842.4
Trec-Covid10.6160.869.673.3
NFCorpus3.3232.335.434.9
NQ9.278.958.260.1
HotpotQA17.646.369.161.0
FiQA10.8132.339.836.4
ArguAna193.0166.844.654.2
Touche-20206.6292.426.426.6
CQADupStack8.6129.137.436.5
Quora9.511.486.687.7
DBPedia5.449.742.240.0
Scidocs9.4176.218.718.1
Fever8.184.885.068.0
Climate-Fever20.184.826.619.0
Scifact12.4213.672.071.0
Average< 200< 30050.2348.61

Table5 demonstrates results of E5Base and E5-RoPEBase on 15 publicly available BEIR tasks. We observe comparable overall scores between both models. This comparable performance, along with their shared training process and training data, facilitates fair comparison of APE and RoPE-based models’s capabilities in length extrapolation. Note that the slight performance loss of E5-RoPEBase could possibly be attributed to the replacement of position embedding in the initialization phase, or the reduced batch size in the pre-training phase, as mentioned before.

ExtensionPCW & GP & RP & PINTKSE
GTEBase & E5Base
512 -> 1,024Lo=512,Lt=1,024,s=2formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡1024𝑠2L_{o}=512,L_{t}=1,024,s=2italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 , 024 , italic_s = 2--
512 -> 2,048Lo=512,Lt=2,048,s=4formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡2048𝑠4L_{o}=512,L_{t}=2,048,s=4italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 , 048 , italic_s = 4--
512 -> 4,096Lo=512,Lt=4,096,s=8formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡4096𝑠8L_{o}=512,L_{t}=4,096,s=8italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 4 , 096 , italic_s = 8--
E5-RoPEBase
512 -> 1,024Lo=512,Lt=1,024,s=2formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡1024𝑠2L_{o}=512,L_{t}=1,024,s=2italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 , 024 , italic_s = 2λ=3𝜆3\lambda=3italic_λ = 3(10,000 -> 30,000)g=3,w=256formulae-sequence𝑔3𝑤256g=3,w=256italic_g = 3 , italic_w = 256
512 -> 2,048Lo=512,Lt=2,048,s=4formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡2048𝑠4L_{o}=512,L_{t}=2,048,s=4italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 , 048 , italic_s = 4λ=5𝜆5\lambda=5italic_λ = 5(10,000 -> 50,000)g=5,w=128formulae-sequence𝑔5𝑤128g=5,w=128italic_g = 5 , italic_w = 128
512 -> 4,096Lo=512,Lt=4,096,s=8formulae-sequencesubscript𝐿𝑜512formulae-sequencesubscript𝐿𝑡4096𝑠8L_{o}=512,L_{t}=4,096,s=8italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 512 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 4 , 096 , italic_s = 8λ=10𝜆10\lambda=10italic_λ = 10(10,000 -> 100,000)g=9,w=64formulae-sequence𝑔9𝑤64g=9,w=64italic_g = 9 , italic_w = 64
E5-Mistral
4,096 -> 8,192Lo=4,096,Lt=8,192,s=2formulae-sequencesubscript𝐿𝑜4096formulae-sequencesubscript𝐿𝑡8192𝑠2L_{o}=4,096,L_{t}=8,192,s=2italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 4 , 096 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 , 192 , italic_s = 2λ=3𝜆3\lambda=3italic_λ = 3(10,000 -> 30,000)g=3,w=2,048formulae-sequence𝑔3𝑤2048g=3,w=2,048italic_g = 3 , italic_w = 2 , 048
4,096 -> 16,384Lo=4,096,Lt=16,384,s=4formulae-sequencesubscript𝐿𝑜4096formulae-sequencesubscript𝐿𝑡16384𝑠4L_{o}=4,096,L_{t}=16,384,s=4italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 4 , 096 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 16 , 384 , italic_s = 4λ=5𝜆5\lambda=5italic_λ = 5(10,000 -> 50,000)g=5,w=1,024formulae-sequence𝑔5𝑤1024g=5,w=1,024italic_g = 5 , italic_w = 1 , 024
4,096 -> 32,768Lo=4,096,Lt=32,768,s=8formulae-sequencesubscript𝐿𝑜4096formulae-sequencesubscript𝐿𝑡32768𝑠8L_{o}=4,096,L_{t}=32,768,s=8italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 4 , 096 , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 32 , 768 , italic_s = 8λ=10𝜆10\lambda=10italic_λ = 10(10,000 -> 100,000)g=9,w=512formulae-sequence𝑔9𝑤512g=9,w=512italic_g = 9 , italic_w = 512

Appendix B Implementation Details for Context Extension Strategies

This section describes implementation details for the explored context extension stratgies. For plug-and-play methods including PCW, RP, GP, PI, NTK and SE, Table6 summarizes their hyperparameters under each condition.

Further Tuning. On top of PI and RP, we perform further tuning on both E5Base and GTEBase, utilizing the fine-tuning dataset mentioned in AppendixA. Following the practice of PoSE(Zhu etal., 2023), we manipulate position ids to simulate long training samples. Concretely, given an input document 𝒟={x0,x1,,xLo1}𝒟subscript𝑥0subscript𝑥1subscript𝑥subscript𝐿𝑜1\mathcal{D}=\{x_{0},x_{1},...,x_{L_{o}-1}\}caligraphic_D = { italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT } of original context length Losubscript𝐿𝑜L_{o}italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, we introduce a skipping bias term u𝑢uitalic_u at the beginning of 𝒟𝒟\mathcal{D}caligraphic_D, transferring the original position ids 𝒟𝒟\mathcal{D}caligraphic_D into {0,1,,Lo1}01subscript𝐿𝑜1\{0,1,...,L_{o}-1\}{ 0 , 1 , … , italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 } into {u,u+1,,u+Lo1}𝑢𝑢1𝑢subscript𝐿𝑜1\{u,u+1,...,u+L_{o}-1\}{ italic_u , italic_u + 1 , … , italic_u + italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - 1 }.555The original practice of PoSE focuses on relative position, hence introduces bias terms at the middle of document 𝒟𝒟\mathcal{D}caligraphic_D. For APE-based models, we simply skips from the beginning. For every piece of training data, u𝑢uitalic_u is re-sampled from the discrete uniform distribution 𝒰({0,1,,LtLo})𝒰01subscript𝐿𝑡subscript𝐿𝑜\mathcal{U}(\{0,1,...,L_{t}-L_{o}\})caligraphic_U ( { 0 , 1 , … , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT } ). In this way, we ensure comprehensive coverage of target context window.The training procedure spans 3 epochs on 2 A100 GPUs, with a learning rate of 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a batch size of 512, and 100 steps for warmup. Other hyperparameters are same as Table4.

Inference. In inference time, attention scaling(Su, 2021; Chiang & Cholak, 2022) is used by default for all tested models for better length extrapolation ability. Especially for GTEBase and E5Base tuned on PI, we use the original position ids when input length not exceeds 512. This is achived by mapping the position ids {0,1,,l}01𝑙\{0,1,...,l\}{ 0 , 1 , … , italic_l } into {0,s,,l×s}0𝑠𝑙𝑠\{0,s,...,l\times s\}{ 0 , italic_s , … , italic_l × italic_s }, where s𝑠sitalic_s is the scaling factor, l<512𝑙512l<512italic_l < 512.

Appendix C Further details on LongEmbed

LongEmbed: Extending Embedding Models for Long Context Retrieval (12)

Figure7 presents source and examples for each dataset included in LongEmbed.For QA datasets including NarrativeQA and 2WikiMultihopQA, we adopt their test splits. Note that for 2WikiMultihopQA, we adopt the length-uniformly sampled version from Bai etal. (2023b) to better assess the model’s capabilities across various context lengths.For summarization datasets including QMSum and SummScreenFD, we adopt the version processed by SCROLLS(Shaham etal., 2022). Since SCROLLS does not include ground truth summarization in its test sets, we switch to validation set for these two datasets. Particularly for QMSum, as its validation set only have 60 documents, which is too small for document retrieval, we included the train set as well.

LongEmbed: Extending Embedding Models for Long Context Retrieval (2024)

References

Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated:

Views: 5229

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.