Linear assembly producing misannotated sequences #136

JamesBagley · 2023-08-22T21:26:08Z

I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.

The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py

The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c

If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.

def test_linear_with_annotations2(monkeypatch):
    from pydna._pretty import pretty_str
    from pydna.assembly import Assembly
    from pydna.dseqrecord import Dseqrecord

    a = Dseqrecord("acgatgctatactgtgCCNCCtgtgctgtgctcta")
    a.add_feature(0,10,label='a_feat')
    a_feat_seq = a.features[0].extract(a)
    # 12345678901234
    b = Dseqrecord("tgtgctgtgctctaTTTTTTTtattctggctgtatcCCCCCC")
    b.add_feature(0,10,label='b_feat')
    b_feat_seq = b.features[0].extract(b)

    # 123456789012345
    c = Dseqrecord("GtattctggctgtatcGGGGGtacgatgctatactgtg")
    c.add_feature(0,10,label='c_feat')
    c_feat_seq = c.features[0].extract(c)

    feature_sequences = {'a_feat':a_feat_seq,
                         'b_feat':b_feat_seq,
                         'c_feat':c_feat_seq}

    a.name = "aaa"  # 1234567890123456
    b.name = "bbb"
    c.name = "ccc"
    asm = Assembly((a, b, c), limit=14)
    x = asm.assemble_linear()[0]
    #print(x.features)
    #print(x)
    answer = 'aaa|14\n    \\/\n    /\\\n    14|bbb|15\n           \\/\n           /\\\n           15|ccc'

    assert x.figure() == answer.strip()
    answer = 'acgatgctatactgtgCCNCCtgtgctgtgctcta\n                     TGTGCTGTGCTCTA\n                     tgtgctgtgctctaTTTTTTTtattctggctgtatc\n                                          TATTCTGGCTGTATC\n                                          tattctggctgtatcGGGGGtacgatgctatactgtg\n'
    assert x.detailed_figure()
    for feat in x.features:

        try:
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
        except(AssertionError):
            print(feat.qualifiers['label'])
            print(feat.extract(x).seq, 'extracted feat')
            print(feature_sequences[feat.qualifiers['label']].seq, 'original feat')
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
test_linear_with_annotations2('')

I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly
https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing

Thank you for creating such a brilliant (and beautifully written) package

The text was updated successfully, but these errors were encountered:

BjornFJohansson · 2023-10-17T16:48:30Z

Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better.
I think I have solved this bug and a new alpha version will be available shortly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear assembly producing misannotated sequences #136

Linear assembly producing misannotated sequences #136

JamesBagley commented Aug 22, 2023

BjornFJohansson commented Oct 17, 2023

Linear assembly producing misannotated sequences #136

Linear assembly producing misannotated sequences #136

Comments

JamesBagley commented Aug 22, 2023

BjornFJohansson commented Oct 17, 2023