Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear assembly producing misannotated sequences #136

Open
JamesBagley opened this issue Aug 22, 2023 · 1 comment
Open

Linear assembly producing misannotated sequences #136

JamesBagley opened this issue Aug 22, 2023 · 1 comment

Comments

@JamesBagley
Copy link

I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.

The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py

The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c

If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.

def test_linear_with_annotations2(monkeypatch):
    from pydna._pretty import pretty_str
    from pydna.assembly import Assembly
    from pydna.dseqrecord import Dseqrecord

    a = Dseqrecord("acgatgctatactgtgCCNCCtgtgctgtgctcta")
    a.add_feature(0,10,label='a_feat')
    a_feat_seq = a.features[0].extract(a)
    # 12345678901234
    b = Dseqrecord("tgtgctgtgctctaTTTTTTTtattctggctgtatcCCCCCC")
    b.add_feature(0,10,label='b_feat')
    b_feat_seq = b.features[0].extract(b)

    # 123456789012345
    c = Dseqrecord("GtattctggctgtatcGGGGGtacgatgctatactgtg")
    c.add_feature(0,10,label='c_feat')
    c_feat_seq = c.features[0].extract(c)

    feature_sequences = {'a_feat':a_feat_seq,
                         'b_feat':b_feat_seq,
                         'c_feat':c_feat_seq}

    a.name = "aaa"  # 1234567890123456
    b.name = "bbb"
    c.name = "ccc"
    asm = Assembly((a, b, c), limit=14)
    x = asm.assemble_linear()[0]
    #print(x.features)
    #print(x)
    answer = 'aaa|14\n    \\/\n    /\\\n    14|bbb|15\n           \\/\n           /\\\n           15|ccc'

    assert x.figure() == answer.strip()
    answer = 'acgatgctatactgtgCCNCCtgtgctgtgctcta\n                     TGTGCTGTGCTCTA\n                     tgtgctgtgctctaTTTTTTTtattctggctgtatc\n                                          TATTCTGGCTGTATC\n                                          tattctggctgtatcGGGGGtacgatgctatactgtg\n'
    assert x.detailed_figure()
    for feat in x.features:

        try:
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
        except(AssertionError):
            print(feat.qualifiers['label'])
            print(feat.extract(x).seq, 'extracted feat')
            print(feature_sequences[feat.qualifiers['label']].seq, 'original feat')
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
test_linear_with_annotations2('')

I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly
https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing

Thank you for creating such a brilliant (and beautifully written) package

@BjornFJohansson
Copy link
Owner

Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better.
I think I have solved this bug and a new alpha version will be available shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants